16 Open-Source RL Libraries, One Shared GPU Bottleneck

On March 10, Hugging Face published a comprehensive survey of 16 open-source reinforcement learning training libraries - everything from ByteDance's verl to Meta's TorchForge to NVIDIA's NeMo-RL. The conclusion: every single one of them, built independently across different organizations on different continents, arrived at the same architectural fix for the same problem. Your training GPUs are sitting idle while your inference GPUs are busy producing rollouts, and the synchronous pipeline that most teams start with is what's killing throughput.

TL;DR

Generating 32K-token rollouts on a 32B model takes 3.7 hours on a single H100; training GPUs sit idle the entire time
All 16 surveyed libraries independently converged on disaggregating inference and training onto separate GPU pools with async rollout buffers
Ray is the de facto orchestration layer: 8 of 16 libraries use it
LoRA adapter-only weight sync cuts synchronization from 100-500ms to under 1ms; only 13 libraries support it
MoE/Expert Parallelism is the major gap - DeepSpeed-based stacks can't do it correctly, and the DeepSeek-V3.2 routing divergence bug affects every current implementation
Hugging Face's own TRL async GRPO trainer is "coming shortly"

The Bottleneck Nobody Wants to Measure

The core problem in RL training is easy to reproduce and uncomfortable to admit. When you run GRPO or PPO, the workflow is: generate a batch of rollouts via inference, then update the model weights, then repeat. In synchronous pipelines, inference and training take turns on the same hardware.

For short outputs on small models, this is fine. For the long-chain reasoning outputs that make RL valuable in 2026, it isn't.

The Hugging Face team ran concrete benchmarks on a single H100 80GB:

Rollout length	7B model	32B model
2K tokens, 512 rollouts	~3 min	~14 min
8K tokens, 512 rollouts	~11 min	~56 min
32K tokens, 512 rollouts	~45 min	3.7 hours

A single training step for a 32B model generating 32K-token reasoning traces takes nearly four hours. Training GPUs do nothing during that time. Scaling the rollout batch larger makes it worse. Adding more training GPUs makes it worse still - you're just paying for more idle compute.

The ecosystem's response is disaggregation: separate GPU pools for inference and training, connected by a rollout buffer that decouples their execution. The training loop pulls from the buffer asynchronously. Rollout generation runs continuously in the background. GPU utilization on both sides goes up.

What the Survey Actually Measured

The 16 Libraries

Library	Organization	GitHub Stars
verl	ByteDance	19,673
ART	CoreWeave / OpenPipe	8,952
SLIME	THUDM	4,595
AReaL	Ant Group	4,338
verifiers-rl	PrimeIntellect	3,876
open-instruct	AI2 / AllenAI	3,611
ROLL	Alibaba	2,921
Tunix	Google	2,175
SkyRL	NovaSky-AI	1,664
NeMo-RL	NVIDIA	1,383
PRIME-RL	PrimeIntellect	1,114
MILES	radixark	950
Atropos	NousResearch	878
OAT	SAIL-SG	637
TorchForge	Meta	632
PipelineRL	ServiceNow	374

The survey compared all 16 across seven axes: orchestration primitive, rollout buffer design, weight synchronization protocol, staleness management, partial rollout handling, LoRA support, and distributed training backend.

What They Measured

Orchestration: Ray won. Eight of the 16 libraries use Ray to coordinate the inference and training pools. The reason is practical: Ray handles heterogeneous GPU pools natively, provides built-in fault tolerance, and its object store does zero-copy transfers between workers. Google's Tunix runs on JAX/XLA and is TPU-primary, making it the odd one out. Meta's TorchForge uses a custom Monarch actor model. PipelineRL routes messages through Redis pub/sub. Everyone else either uses Ray or native Python concurrency.

Rollout buffers: Libraries split into four rough categories - no buffer at all (TRL, ART, fully synchronous), double-buffer with one-step lookahead (verifiers-rl, SLIME), bounded async queues of two to k batches (SkyRL, verl, NeMo-RL), and unbounded streams (PipelineRL, Atropos). The bounded queue approach is the current consensus for production use. PipelineRL goes furthest: it locks weights on a per-transformer-forward-pass basis rather than at batch boundaries, meaning weight staleness is measured in milliseconds instead of steps.

Weight synchronization: This is where the spread is widest. Most libraries use NCCL broadcast, which costs 100-500ms per sync for a 32B model. verl adds bucketing and cuts that to around 20ms. NeMo-RL and MILES use CUDA IPC zero-copy transfers, which is faster but requires inference and training to run on the same physical node. PRIME-RL and AReaL use filesystem checkpoints plus HTTP - slower, but works across any cluster topology. AReaL's Mooncake Transfer Engine (used at Kimi-K2 scale, 256 H20s) syncs a full 32B model in roughly 16-17 seconds.

LoRA support: Thirteen of 16 libraries support LoRA fine-tuning, which is significant because LoRA adapter-only weight sync dramatically reduces synchronization overhead. A full 32B model sync pushes hundreds of megabytes over the network. A rank-32 LoRA adapter for the same model is roughly 50MB - and with adapter-only sync, the transfer completes in under a millisecond instead of hundreds of milliseconds. SLIME and TorchForge have no LoRA support. Only ART and MILES explicitly handle MoE expert LoRA.

What They Didn't Measure

The survey is honest about three unsolved problems it surfaces rather than resolves.

The MoE gap is the most consequential. Expert Parallelism - the technique that makes large MoE models like Mixtral or the DeepSeek series actually efficient - requires Megatron-backed training stacks. Only five libraries support it correctly: verl, SLIME, MILES, ROLL, and NeMo-RL. Every library built on DeepSpeed's ZeRO optimizer cannot do Expert Parallelism efficiently, because ZeRO partitions model parameters across GPUs in a way that breaks the sparsity pattern MoE relies on.

Worse, there's a newly surfaced bug specific to DeepSeek-V3.2. When the inference engine (typically vLLM) and the training engine (Megatron) run the same MoE model, floating-point arithmetic differences cause the MoE router to produce slightly different expert selections at each step. Over many async updates, this creates "abrupt shifts in the active parameter subspace" - training instability that looks like a learning problem but is actually a data contract problem. No current library fixes this. The required data contract extension is: (token_ids, logprobs, finish_reason, expert_routing, sampling_mask).

"Keep Routing refers to the MoE routing decisions made during inference that need to be preserved and passed to the training process, while Keep Sampling Mask refers to vocabulary truncation applied during sampling that must be consistently applied during loss computation to prevent distribution mismatch."

Process Reward Models introduce a new sync barrier. DEEP-GRPO's pivot resampling needs KV cache state from mid-sequence - meaning the reward model needs to score partial rollouts before generation finishes. No existing library in the survey supports this.

Multi-agent straggler compounding is a less obvious problem. If you're running co-evolution between two models - each with a 5x ratio between 90th-percentile and median completion time - the joint 90th-percentile wait is roughly 25x the median. Async buffers help, but the tail latency still controls wall-clock time as the number of co-evolving agents grows.

Synchronous RL training timeline diagram from Hugging Face showing idle GPU time between rollout and training phases The synchronous training timeline: rollout generation and model updates are serialized. Training GPUs do nothing while inference runs, and vice versa. This is what the disaggregated architecture removes. Source: huggingface.co

Rows of server racks in a large data center Disaggregated RL training splits inference and training workloads across separate physical GPU pools - the kind of infrastructure that was previously only viable at hyperscaler scale. Source: commons.wikimedia.org

Should You Care?

If you're training a model that creates short outputs - a classifier, a summarizer, anything under 2K tokens - synchronous RL is probably fine. The overhead is manageable and the engineering simplicity is worth it.

If you're doing long-horizon reasoning training - the kind that makes models like OpenClaw tick, or that drives the performance gains on math and coding benchmarks - you'll hit this wall. The community has already solved it, and the solutions are public. The question is which of the 16 libraries best matches your infrastructure.

For teams starting fresh, verl (ByteDance, 19K stars) is the most battle-tested option with the most complete parallelism support. For teams already using the Hugging Face ecosystem, TRL is the natural home - and the forthcoming async GRPO trainer should close the gap with verl on the throughput front. NeMo-RL is the pick if you're on NVIDIA hardware and need CUDA IPC weight transfers. Atropos (NousResearch) is worth watching if you care about multi-agent co-evolution scenarios.

The Hugging Face team frames this survey explicitly as pre-work for TRL's async trainer: "That's the map, stay tuned, we are working on a concrete async GRPO trainer in TRL, and we'll announce it shortly." Given TRL's install base, that release will matter. The architectural choices they make - buffer depth, staleness tolerance, weight sync protocol - will get cargo-culted by a lot of teams that don't read the underlying paper.

The MoE routing divergence issue won't get fixed by any single library team. It requires a coordination effort between inference runtime developers and training framework developers. Nobody has volunteered to own it yet.

Sources:

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries - Hugging Face Blog, March 10, 2026