TRL #5700 is the single root cause
OffloadActivations creates async CUDA streams without synchronization → 5.7 GB garbage accumulation across steps → OOM at step 19.
The fix adds stream.synchronize() + stash.clear() in __exit__, completely eliminating the leak.
VL MODE — Qwopus3.5-9B (seq_len=512, 100 steps) — Peak VRAM per Patch (baseline: 20.3 GB)
TEXT MODE — Qwopus3.5-9B (seq_len=4096, 150 steps) — Peak VRAM per Patch (baseline: 8.5 GB)
VL MODE — Qwopus3.5-9B (seq_len=512, 100 steps) — Baseline: 20.3 GB | Saved: 7.0 GB
| Patch | Project | PR | Peak | vs Baseline | Status |
|---|
| BASELINE | — | — | 20.308 GB | — | BASELINE |
| P1: MatMul4Bit no_grad | bitsandbytes-foundation/bitsandbytes | #1935 | 20.308 GB | +0.0 GB | NOT REPRODUCIBLE |
| P2: bf16 CrossEntropyLoss | huggingface/transformers | #45769 | 20.308 GB | +0.0 GB | NOT REPRODUCIBLE |
| P3: OffloadActivations stream (FIX) | huggingface/trl | #5700 | 13.283 GB | 7.0 GB SAVED | ROOT CAUSE FIX |
| P4: pack_tensor contiguous (#5694) | huggingface/trl | #5694 | 20.308 GB | — | MERGED (prevents crash) |
| P5: torch_memory_fix | — | — | 15.935 GB | +4.4 GB | NOT REPRODUCIBLE |
| ALL 5 COMBINED | — | all | 14.29 GB | +6.0 GB | Combined (P3 dominant) |
TEXT MODE — Qwopus3.5-9B (seq_len=4096, 150 steps) — Baseline: 8.5 GB | Saved: 0.0 GB
| Patch | Project | PR | Peak | vs Baseline | Status |
|---|
| BASELINE | — | — | 8.498 GB | — | BASELINE |
| P1: MatMul4Bit no_grad | bitsandbytes-foundation/bitsandbytes | #1935 | 8.498 GB | +0.0 GB | NOT REPRODUCIBLE |
| P2: bf16 CrossEntropyLoss | huggingface/transformers | #45769 | 8.498 GB | +0.0 GB | NOT REPRODUCIBLE |
| P3: OffloadActivations stream (FIX) | huggingface/trl | #5700 | 8.498 GB | — | FIX (VL-only leak, text can't trigger) |
| P4: pack_tensor contiguous (#5694) | huggingface/trl | #5694 | 8.498 GB | — | MERGED (prevents crash) |
| P5: torch_memory_fix | — | — | 24.216 GB | OOM | FAIL (OOM — broken patch) |
| ALL 5 COMBINED | — | all | 24.216 GB | OOM | FAIL (OOM — P5 breaks combined) |
Root Cause Chain
- OffloadActivations creates two async CUDA streams (s0 for CPU→GPU, s1 for GPU→CPU)
- On __exit__, tensors in bwd_tensor_stash, bwd_ev_stash, and fwd_stash are not freed
- 5.7 GB garbage stream accumulates across training steps
- VRAM grows ~0.2 GB per step until OOM at step 19 on RTX 3090 (24 GB)
- Fix: TRL #5700 adds stream.synchronize() + stash.clear() before exit
Contributed PRs — butterwecksolutions
✅ TRL #5694 — MERGED (first HF-org merge)
pack_tensor contiguous fix: split + clone + contiguous prevents corrupted tensors from shared memory buffers. Reviewed by kashif, merged by qgallouedec. Affects ALL QLoRA users.
🔵 TRL #5700 — ROOT CAUSE FIX (kashif approved, pending merge)
OffloadActivations stream cleanup: adds stream.synchronize() + stash.clear() in __exit__ (try/finally pattern per kashif review). Eliminates 5.4 GB VRAM leak — the sole root cause among 5 investigated bugs. Makes VL training viable on consumer GPUs.