VRAM Fix Reproducer — Systematic Root Cause Isolation

5 candidate fixes (P1-P5 = Patch 1-5) tested in isolation. Fresh GPU per test. 500 samples, seq_len=4096.
Model: Jackrong/Qwopus3.5-9B-v3.5 | NVIDIA GeForce RTX 3090 (24 GB)

TRL #5700 is the single root cause

OffloadActivations creates async CUDA streams without synchronization → 5.7 GB garbage accumulation across steps → OOM at step 19.

The fix adds stream.synchronize() + stash.clear() in __exit__, completely eliminating the leak.

VL MODE — Qwopus3.5-9B (seq_len=512, 100 steps) — Peak VRAM per Patch (baseline: 20.3 GB)

TEXT MODE — Qwopus3.5-9B (seq_len=4096, 150 steps) — Peak VRAM per Patch (baseline: 8.5 GB)

VL MODE — Qwopus3.5-9B (seq_len=512, 100 steps) — Baseline: 20.3 GB | Saved: 7.0 GB

PatchProjectPRPeakvs BaselineStatus
BASELINE20.308 GBBASELINE
P1: MatMul4Bit no_gradbitsandbytes-foundation/bitsandbytes#193520.308 GB+0.0 GBNOT REPRODUCIBLE
P2: bf16 CrossEntropyLosshuggingface/transformers#4576920.308 GB+0.0 GBNOT REPRODUCIBLE
P3: OffloadActivations stream (FIX)huggingface/trl#570013.283 GB7.0 GB SAVEDROOT CAUSE FIX
P4: pack_tensor contiguous (#5694)huggingface/trl#569420.308 GBMERGED (prevents crash)
P5: torch_memory_fix15.935 GB+4.4 GBNOT REPRODUCIBLE
ALL 5 COMBINEDall14.29 GB+6.0 GBCombined (P3 dominant)

TEXT MODE — Qwopus3.5-9B (seq_len=4096, 150 steps) — Baseline: 8.5 GB | Saved: 0.0 GB

PatchProjectPRPeakvs BaselineStatus
BASELINE8.498 GBBASELINE
P1: MatMul4Bit no_gradbitsandbytes-foundation/bitsandbytes#19358.498 GB+0.0 GBNOT REPRODUCIBLE
P2: bf16 CrossEntropyLosshuggingface/transformers#457698.498 GB+0.0 GBNOT REPRODUCIBLE
P3: OffloadActivations stream (FIX)huggingface/trl#57008.498 GBFIX (VL-only leak, text can't trigger)
P4: pack_tensor contiguous (#5694)huggingface/trl#56948.498 GBMERGED (prevents crash)
P5: torch_memory_fix24.216 GBOOMFAIL (OOM — broken patch)
ALL 5 COMBINEDall24.216 GBOOMFAIL (OOM — P5 breaks combined)

Root Cause Chain

  1. OffloadActivations creates two async CUDA streams (s0 for CPU→GPU, s1 for GPU→CPU)
  2. On __exit__, tensors in bwd_tensor_stash, bwd_ev_stash, and fwd_stash are not freed
  3. 5.7 GB garbage stream accumulates across training steps
  4. VRAM grows ~0.2 GB per step until OOM at step 19 on RTX 3090 (24 GB)
  5. Fix: TRL #5700 adds stream.synchronize() + stash.clear() before exit

Contributed PRs — butterwecksolutions

✅ TRL #5694 — MERGED (first HF-org merge)
pack_tensor contiguous fix: split + clone + contiguous prevents corrupted tensors from shared memory buffers. Reviewed by kashif, merged by qgallouedec. Affects ALL QLoRA users.
🔵 TRL #5700 — ROOT CAUSE FIX (kashif approved, pending merge)
OffloadActivations stream cleanup: adds stream.synchronize() + stash.clear() in __exit__ (try/finally pattern per kashif review). Eliminates 5.4 GB VRAM leak — the sole root cause among 5 investigated bugs. Makes VL training viable on consumer GPUs.
🔸 bitsandbytes #1935 — MatMul4Bit no_grad: investigated, 0.0 GB effect isolated → CLOSE
🔸 transformers #45769 — bf16_loss training arg: investigated, 0.0 GB effect isolated → CLOSE