← back

async rl

April 14, 2026

Problem setup: you are given a set of GPU’s. Perhaps 8 of them. Perhaps 16 of them, but only 8 per node/cluster. And your task is to fine-tune an open source base LLM, in a give environment, on a given task.

How do you do this?

Algorithmically, you can point to GRPO and GRPO-derived variants. But the systems challenge is a whole other ordeal.

There are two primary components:

  1. Trainer: takes a batch of (prompt, completion, reward) tuples and updates model weights via gradient descent, typically with a policy gradient objective like GRPO.
  2. Generator (inference engine): rolls out completions from the current policy given prompts, scores them with a reward model or verifier, and passes the results to the trainer.

and there are two primary ways of doing this:

  1. Colocation: this is when you have one pool of GPU’s to work with and you want to use all of them for both training and inference at the same time. That is, one pool of GPUs handles both training and inference, time-multiplexed. Simpler to coordinate, but GPU utilization suffers since inference and training have very different memory and compute profiles, and they block each other.
  2. Async RL: training and inference run on separate GPU pools concurrently. The generator continuously produces rollouts while the trainer continuously consumes them, decoupled via a replay buffer or queue. Higher throughput, but introduces new problems.

and there are two primary difficulties in the async rl setup:

  1. Coordination: How to coordinate the training and inference of the model across separate GPU pools. For example,keeping model weights synchronized across the trainer and generator pools. The trainer updates weights on its GPUs; those updates need to be periodically broadcast to the generator, which requires careful orchestration.
  2. Off-policyness: if training updates happen faster than inference, the trainer is updating on completions sampled from a stale policy. The data in the buffer was generated by an older version of the model, violating the on-policy assumption that GRPO relies on.