
MIT Researchers Develop a Proxy Model Technique That Doubles LLM Training Speed
A new MIT method uses a lightweight proxy model to predict reasoning outputs, cutting the reinforcement learning rollout bottleneck in half.
Researchers at MIT have published a technique that could meaningfully reduce the time and compute required to train advanced reasoning models. The method, detailed in a paper released on February 26, addresses one of the most expensive stages in modern LLM development: the reinforcement learning rollout phase.
The Rollout Bottleneck Explained
Training a reasoning LLM is not a single process — it involves multiple stages. After initial pretraining on text data, models undergo reinforcement learning from human feedback to improve their reasoning capabilities. During this phase, the model must generate thousands of complete reasoning chains, which are then scored and used to update the model’s weights.
This rollout phase — where the model essentially thinks through problems over and over — consumes up to 85 percent of total reinforcement learning training time. It is computationally expensive because the full-sized model must generate each token sequentially, producing millions of reasoning traces across the training run.
A Smaller Model Doing the Heavy Lifting
The MIT team’s innovation is elegant in its simplicity. They train a smaller, faster proxy model to predict the outputs that the larger model would produce. The proxy handles the bulk of rollout generation, and the larger model only needs to verify and correct the proxy’s outputs — a much faster operation than generating from scratch.
In experiments across multiple reasoning LLMs, this approach doubled training speed while preserving accuracy on benchmark evaluations. The proxy model itself requires minimal additional compute to train and can be reused across multiple training runs.
Implications for AI Development Costs
The financial implications are significant. Training frontier reasoning models currently costs tens to hundreds of millions of dollars, with compute time measured in weeks on clusters of thousands of GPUs. A two-times speedup translates directly into halved compute costs and faster iteration cycles for AI labs.
For the broader AI ecosystem, faster training means more experiments, more architectures explored, and ultimately faster progress toward more capable and efficient models.
Open Research for the Community
The MIT team has released their methodology and experimental results publicly, enabling other research groups and AI labs to adopt the technique. The approach is architecture-agnostic and can be applied to any model undergoing reinforcement learning optimization.
Sources: MIT News, February 26, 2026; MIT CSAIL, February 2026
