VRAM Fix Reproducer — Root Cause Isolation

Patch	Project	PR	Peak	vs Baseline	Status
BASELINE	—	—	20.308 GB	—	BASELINE
P1: MatMul4Bit no_grad	bitsandbytes-foundation/bitsandbytes	#1935	20.308 GB	+0.0 GB	NOT REPRODUCIBLE
P2: bf16 CrossEntropyLoss	huggingface/transformers	#45769	20.308 GB	+0.0 GB	NOT REPRODUCIBLE
P3: OffloadActivations stream (FIX)	huggingface/trl	#5700	13.283 GB	7.0 GB SAVED	ROOT CAUSE FIX
P4: pack_tensor contiguous (#5694)	huggingface/trl	#5694	20.308 GB	—	MERGED (prevents crash)
P5: torch_memory_fix	—	—	15.935 GB	+4.4 GB	NOT REPRODUCIBLE
ALL 5 COMBINED	—	all	14.29 GB	+6.0 GB	Combined (P3 dominant)

Patch

Project

Peak

vs Baseline

Status

BASELINE

—

20.308 GB

—

BASELINE

P1: MatMul4Bit no_grad

bitsandbytes-foundation/bitsandbytes

#1935

20.308 GB

+0.0 GB

NOT REPRODUCIBLE

P2: bf16 CrossEntropyLoss

huggingface/transformers

#45769

20.308 GB

+0.0 GB

NOT REPRODUCIBLE

P3: OffloadActivations stream (FIX)

huggingface/trl

#5700

13.283 GB

7.0 GB SAVED

ROOT CAUSE FIX

P4: pack_tensor contiguous (#5694)

huggingface/trl

#5694

20.308 GB

—

MERGED (prevents crash)

P5: torch_memory_fix

—

15.935 GB

+4.4 GB

NOT REPRODUCIBLE

ALL 5 COMBINED

—

all

14.29 GB

+6.0 GB

Combined (P3 dominant)

Patch	Project	PR	Peak	vs Baseline	Status
BASELINE	—	—	8.498 GB	—	BASELINE
P1: MatMul4Bit no_grad	bitsandbytes-foundation/bitsandbytes	#1935	8.498 GB	+0.0 GB	NOT REPRODUCIBLE
P2: bf16 CrossEntropyLoss	huggingface/transformers	#45769	8.498 GB	+0.0 GB	NOT REPRODUCIBLE
P3: OffloadActivations stream (FIX)	huggingface/trl	#5700	8.498 GB	—	FIX (VL-only leak, text can't trigger)
P4: pack_tensor contiguous (#5694)	huggingface/trl	#5694	8.498 GB	—	MERGED (prevents crash)
P5: torch_memory_fix	—	—	24.216 GB	OOM	FAIL (OOM — broken patch)
ALL 5 COMBINED	—	all	24.216 GB	OOM	FAIL (OOM — P5 breaks combined)

Patch

Project

Peak

vs Baseline

Status

BASELINE

—

8.498 GB

—

BASELINE

P1: MatMul4Bit no_grad

bitsandbytes-foundation/bitsandbytes

#1935

8.498 GB

+0.0 GB

NOT REPRODUCIBLE

P2: bf16 CrossEntropyLoss

huggingface/transformers

#45769

8.498 GB

+0.0 GB

NOT REPRODUCIBLE

P3: OffloadActivations stream (FIX)

huggingface/trl

#5700

8.498 GB

—

FIX (VL-only leak, text can't trigger)

P4: pack_tensor contiguous (#5694)

huggingface/trl

#5694

8.498 GB

—

MERGED (prevents crash)

P5: torch_memory_fix

—

24.216 GB

OOM

FAIL (OOM — broken patch)

ALL 5 COMBINED

—

all

24.216 GB

OOM

FAIL (OOM — P5 breaks combined)

Root Cause Chain

OffloadActivations creates two async CUDA streams (s0 for CPU→GPU, s1 for GPU→CPU)
On __exit__, tensors in bwd_tensor_stash, bwd_ev_stash, and fwd_stash are not freed
5.7 GB garbage stream accumulates across training steps
VRAM grows ~0.2 GB per step until OOM at step 19 on RTX 3090 (24 GB)
Fix: TRL #5700 adds stream.synchronize() + stash.clear() before exit

Contributed PRs — butterwecksolutions

✅ TRL #5694 — MERGED (first HF-org merge)
pack_tensor contiguous fix: split + clone + contiguous prevents corrupted tensors from shared memory buffers. Reviewed by kashif, merged by qgallouedec. Affects ALL QLoRA users.

🔵 TRL #5700 — ROOT CAUSE FIX (kashif approved, pending merge)
OffloadActivations stream cleanup: adds stream.synchronize() + stash.clear() in __exit__ (try/finally pattern per kashif review). Eliminates 5.4 GB VRAM leak — the sole root cause among 5 investigated bugs. Makes VL training viable on consumer GPUs.

🔸 bitsandbytes #1935 — MatMul4Bit no_grad: investigated, 0.0 GB effect isolated → CLOSE

🔸 transformers #45769 — bf16_loss training arg: investigated, 0.0 GB effect isolated → CLOSE

VRAM Fix Reproducer — Systematic Root Cause Isolation

TRL #5700 is the single root cause

VL MODE — Qwopus3.5-9B (seq_len=512, 100 steps) — Peak VRAM per Patch (baseline: 20.3 GB)

TEXT MODE — Qwopus3.5-9B (seq_len=4096, 150 steps) — Peak VRAM per Patch (baseline: 8.5 GB)

VL MODE — Qwopus3.5-9B (seq_len=512, 100 steps) — Baseline: 20.3 GB | Saved: 7.0 GB

TEXT MODE — Qwopus3.5-9B (seq_len=4096, 150 steps) — Baseline: 8.5 GB | Saved: 0.0 GB

Root Cause Chain

Contributed PRs — butterwecksolutions