AI Engineer YouTube · June 8, 2026

Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI video thumbnail
Why it matters

Training a standard LLaMA 3B model with a 3 million token context on a single 8xH100 node fails before you even start: the model parameters alone exhaust GPU memory. Max Ryabinin from Together AI walks through the full stack of techniques needed to get there: fully sharded data parallelism, DeepSpeed Ulysses context pa

My takeaway: Training a standard LLaMA 3B model with a 3 million token context on a single 8xH100 node fails before you even start: the model parameters alone exhaust GPU memory. Max Ryabinin from Together AI walks through the full stack of techniques needed to get there: fully sharded data parallelism, DeepSpeed Ulysses context pa