The Optimizer State Bug: A Silent Failure in DCP Resume
Author: Robin, Kroonen AI Inc.
⚠️ Postmortem, March 23, 2026
Fixing the checkpoint save deadlock was only half the story. The checkpoint load path introduced a subtler failure: one that didn't crash, didn't hang, and produced no errors. It just silently ruined the model.
This is Part 2 of the Genesis checkpoint saga. Part 1 covered the FSDP checkpoint deadlock. This post covers the silent optimizer state bug discovered five days later.
What Happened
At step 8,500, training was stopped for a break. When resumed, the DCP load path only restored model weights, not the AdamW optimizer state:
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
state_dict = {
"model": model.state_dict(),
# optimizer state NOT loaded — this is the bug
}
dcp.load(state_dict, checkpoint_id=dcp_latest)
model.load_state_dict(state_dict["model"])
# optimizer starts from scratch — momentum and variance are zeroThe save code was fine; it already saved both model and optimizer state. But the load path had been stripped down to model-only during an earlier debugging session to work around a RuntimeError: Missing key in checkpoint state_dict: optimizer.param_groups.0.decoupled_weight_decay error from older checkpoints that genuinely didn't contain optimizer state.
The workaround became the bug.
Why It's Silent
When AdamW's optimizer state is reset mid-training:
- First moment (m, β₁=0.9): Rebuilds in ~30 steps. Fast.
- Second moment (v, β₂=0.95): Takes ~60–100 steps to stabilize.
- Bias correction masks the problem early. It amplifies small accumulated moments, making the first few hundred steps look deceptively normal.
So training doesn't explode. It doesn't crash. It just quietly drifts into a worse optimization basin over ~500 steps.
The Diagnostic Signature
The telltale pattern looks backwards from normal instability:
| Metric | Before Reset | After Reset |
|---|---|---|
| Loss | ~1.1–1.3 | ~2.0–2.5 |
| Grad norm | ~0.5–0.7 | ~0.2–0.3 |
| LR / tok/s | unchanged | unchanged |
If the optimizer were diverging, grad norm would spike up. Instead it drops, because without curvature information, Adam's per-parameter scaling is broken, and the model takes smaller effective steps in the wrong directions.
The Fix
Load optimizer state alongside model weights, with a try/except fallback for older checkpoints:
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
# 1. Load model weights
state_dict = {"model": model.state_dict()}
dcp.load(state_dict, checkpoint_id=dcp_latest)
model.load_state_dict(state_dict["model"])
# 2. Load optimizer state (with fallback for old checkpoints)
try:
optim_sd = {
"optimizer": FSDP.optim_state_dict(model, optimizer),
}
dcp.load(optim_sd, checkpoint_id=dcp_latest)
optim_to_load = FSDP.optim_state_dict_to_load(
model, optimizer, optim_sd["optimizer"]
)
optimizer.load_state_dict(optim_to_load)
except Exception:
print("Optimizer state missing — falling back to reset")Recovery
After applying the fix and resuming from step 8,500:
| Step | Loss | Grad Norm |
|---|---|---|
| 8,501 | 0.92 | 0.59 |
| 8,503 | 1.00 | 0.47 |
| 8,505 | 1.36 | 0.62 |
| 8,507 | 1.41 | 0.59 |
| 8,509 | 1.42 | 0.51 |
Immediate recovery. The model snapped back to its pre-reset trajectory on the first step. ~1,000 steps of compromised training were discarded.
⚠️ Update, March 23: Second optimizer reset
The training script was rewritten with activation checkpointing to fix an OOM crash (see training progress). The new FSDP wrapping changed the shard layout, making the old optimizer state incompatible. AdamW is again rebuilding from scratch at step 8,500. This time the reset was unavoidable — the architecture change required it.
Lessons
- Always restore optimizer state on resume. Model weights alone are not enough for AdamW. The accumulated first and second moment estimates encode critical per-parameter learning rate scaling.
- Workarounds become bugs. The model-only load was a valid workaround for old checkpoints. But it was left as the default path, silently breaking all future resumes.
- Monitor the grad norm / loss ratio. A sudden drop in grad norm paired with a loss increase is the signature of optimizer state loss. It looks nothing like divergence.
- Test your resume path. Run 10 steps after resume and verify the metrics match the pre-checkpoint regime. Don't assume it's fine because it didn't crash.
Contact
If you are a founder, independent researcher, or small lab working on multi-GPU local training and have encountered similar checkpoint or synchronization failures on consumer hardware, reach out at [email protected].
More from the Genesis Series
Fixing FSDP Checkpoint Deadlocks
The original checkpoint save deadlock on PCIe-only consumer GPUs, and the DCP fix.
Training Progress: Live Results
Live loss curves, model specs, and the road to Genesis 1B v0.1.
The Genesis Manifesto
Why small models need personality, and what local AI training means in 2026.
Mapping the Mind of Qwen 3.5 9B
A sparse autoencoder for mechanistic interpretability: zero dead features, 16K dimensions.