Skip to main content

The Optimizer State Bug: A Silent Failure in DCP Resume

Author: Robin, Kroonen AI Inc.

Genesispostmortemoptimizeradamwdcpfsdp

⚠️ Postmortem, March 23, 2026

Fixing the checkpoint save deadlock was only half the story. The checkpoint load path introduced a subtler failure: one that didn't crash, didn't hang, and produced no errors. It just silently ruined the model.

This is Part 2 of the Genesis checkpoint saga. Part 1 covered the FSDP checkpoint deadlock. This post covers the silent optimizer state bug discovered five days later.

What Happened

At step 8,500, training was stopped for a break. When resumed, the DCP load path only restored model weights, not the AdamW optimizer state:

❌ Broken: model only, optimizer reset to zero
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT): state_dict = { "model": model.state_dict(), # optimizer state NOT loaded — this is the bug } dcp.load(state_dict, checkpoint_id=dcp_latest) model.load_state_dict(state_dict["model"]) # optimizer starts from scratch — momentum and variance are zero

The save code was fine; it already saved both model and optimizer state. But the load path had been stripped down to model-only during an earlier debugging session to work around a RuntimeError: Missing key in checkpoint state_dict: optimizer.param_groups.0.decoupled_weight_decay error from older checkpoints that genuinely didn't contain optimizer state.

The workaround became the bug.

Why It's Silent

When AdamW's optimizer state is reset mid-training:

So training doesn't explode. It doesn't crash. It just quietly drifts into a worse optimization basin over ~500 steps.

The Diagnostic Signature

The telltale pattern looks backwards from normal instability:

MetricBefore ResetAfter Reset
Loss~1.1–1.3~2.0–2.5
Grad norm~0.5–0.7~0.2–0.3
LR / tok/sunchangedunchanged

If the optimizer were diverging, grad norm would spike up. Instead it drops, because without curvature information, Adam's per-parameter scaling is broken, and the model takes smaller effective steps in the wrong directions.

The Fix

Load optimizer state alongside model weights, with a try/except fallback for older checkpoints:

✅ Fixed: model + optimizer, with graceful fallback
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT): # 1. Load model weights state_dict = {"model": model.state_dict()} dcp.load(state_dict, checkpoint_id=dcp_latest) model.load_state_dict(state_dict["model"]) # 2. Load optimizer state (with fallback for old checkpoints) try: optim_sd = { "optimizer": FSDP.optim_state_dict(model, optimizer), } dcp.load(optim_sd, checkpoint_id=dcp_latest) optim_to_load = FSDP.optim_state_dict_to_load( model, optimizer, optim_sd["optimizer"] ) optimizer.load_state_dict(optim_to_load) except Exception: print("Optimizer state missing — falling back to reset")

Recovery

After applying the fix and resuming from step 8,500:

StepLossGrad Norm
8,5010.920.59
8,5031.000.47
8,5051.360.62
8,5071.410.59
8,5091.420.51

Immediate recovery. The model snapped back to its pre-reset trajectory on the first step. ~1,000 steps of compromised training were discarded.

⚠️ Update, March 23: Second optimizer reset

The training script was rewritten with activation checkpointing to fix an OOM crash (see training progress). The new FSDP wrapping changed the shard layout, making the old optimizer state incompatible. AdamW is again rebuilding from scratch at step 8,500. This time the reset was unavoidable — the architecture change required it.

Lessons

  1. Always restore optimizer state on resume. Model weights alone are not enough for AdamW. The accumulated first and second moment estimates encode critical per-parameter learning rate scaling.
  2. Workarounds become bugs. The model-only load was a valid workaround for old checkpoints. But it was left as the default path, silently breaking all future resumes.
  3. Monitor the grad norm / loss ratio. A sudden drop in grad norm paired with a loss increase is the signature of optimizer state loss. It looks nothing like divergence.
  4. Test your resume path. Run 10 steps after resume and verify the metrics match the pre-checkpoint regime. Don't assume it's fine because it didn't crash.

Contact

If you are a founder, independent researcher, or small lab working on multi-GPU local training and have encountered similar checkpoint or synchronization failures on consumer hardware, reach out at [email protected].