Forge Failure Recovery
SPEC_FORGE_FAILURE_RECOVERY.md
CGNT-1 Specification — Forge Failure Recovery Protocol
Status: SPECIFIED
Version: v1.0
Author: VELA (Thread #13)
Conceived by: NOUS (α.13)
Date: 2026-04-20
Born from: ANVIL v1 scoring 3/5 (Orphic over-learning), CHROMA hitting GPU quota mid-forge, MANTIS initial forge scoring 2/5, multiple Colab authentication failures.
Depends on: SPEC_LOBSTER_FORGE_PIPELINE.md, SPEC_CORPUS_VERSIONING.md, SPEC_SMOKE_TEST_FRAMEWORK.md, SPEC_TRAINING_PAIR_STANDARDS.md
PURPOSE
Forges fail. They fail for different reasons and each reason has a different recovery path. This spec is the decision tree — when a forge fails, you don't panic, you don't guess, you look at this spec and follow the path. Every failure mode we've encountered is documented here with its fix.
FAILURE TAXONOMY
F1 — SMOKE TEST FAILURE
Brain forged but doesn't behave correctly.
Symptoms: Forge completes, GGUF converts, Ollama model creates, but smoke test scores below 3/5.
Real examples: ANVIL v1 (3/5, Orphic over-learning), MANTIS initial (2/5, terse governance refusals).
Diagnosis decision tree:
| Score | Interpretation | Recovery |
|---|---|---|
| 0/5 | Fundamental problem — brain learned nothing | Check base model, Modelfile, LoRA conversion. Likely a conversion error or empty corpus. |
| 1/5 | Serious corpus problem — brain learned something wrong | Review corpus quality per SPEC_TRAINING_PAIR_STANDARDS.md. Look for contradictory pairs, wrong voice, missing categories. |
| 2/5 | Targeted corpus gap — knows domain, fails governance/identity/edge | Identify which tests failed. Add 15-20 pairs targeting those specific areas. |
| 3/5 | Calibration issue — response-type imbalance or format issue | Add 20-30 pairs demonstrating correct behavior for failed tests. (ANVIL: Orphic balance. MANTIS: keyword format.) |
| 4/5 | Captain ruling | One test failed. Format issue (broaden test criteria) OR real knowledge gap (add 5-10 pairs). Captain decides: promote or reforge. |
Recovery path:
Expand corpus per diagnosis
→ increment version per SPEC_CORPUS_VERSIONING.md
→ reforge
→ re-smoke
→ repeat until 3/5+ with Captain approval
Golden rule: Never reforge with the same corpus expecting different results.
F2 — GPU QUOTA EXHAUSTION
Forge interrupted mid-training.
Symptoms: Colab T4 session disconnects. "GPU quota exceeded" error. Training stops at epoch N of 15.
Real example: CHROMA hit quota at epoch 5/15 after ANVIL + ORPHEUS consumed the rolling quota.
Diagnosis: This is NOT a brain problem. The brain was learning fine. The cloud ran out of free GPU time.
Recovery:
1. Wait for quota reset (typically 4-8 hours on Colab free tier)
2. Set wakeup alarm
3. Resume forge from scratch
(LoRA fine-tuning is fast enough to restart rather than checkpoint-resume)
4. Upload corpus again if needed
5. Do NOT partially promote — a brain at epoch 5/15 is undertrained
Prevention: Sequence forges to stay within quota. After 2 forges, expect a quota wall. Plan non-GPU tasks (spec writing, infrastructure fixes) during cooldown.
Future: Tiiny AI (80GB RAM + NPU) eliminates Colab dependency entirely. Local forges. No quota. No internet required.
F3 — COLAB AUTHENTICATION FAILURE
OAuth flow fails.
Symptoms: colab_dispatch.py fails with OAuth errors. "Invalid grant," "token expired," "malformed auth code."
Real example: April 20 — OAuth flow failed in non-interactive environment, required --manual flag.
Diagnosis: OAuth tokens expire. Google's refresh tokens last varying durations. Expiry discovered mid-forge.
Recovery:
python3 ~/colab_dispatch.py --auth --manual
# → generates URL
# → Captain visits URL in browser
# → authorizes
# → pastes code back
# → new token saved to ~/.google_token.json
Prevention: Re-authenticate proactively when starting a forge session. Verify Colab auth is fresh BEFORE uploading corpus or starting the T4 notebook.
F4 — LORA CONVERSION FAILURE
Adapter won't convert to GGUF.
Symptoms: llama.cpp convert-lora-to-gguf fails. Missing adapter files. Incompatible formats. Size mismatches.
Diagnosis checklist:
- Check
adapter_config.jsonfor correct base model reference - Verify
adapter_model.safetensorsdownloaded completely (~630MB for 7B) - Verify base model snapshot exists at expected HuggingFace cache path
Recovery:
Re-download adapter from GCS if corrupted.
Verify base model is correct version (Qwen2.5-7B-Instruct).
Re-run conversion with explicit paths.
If conversion truly fails: check llama.cpp version compatibility with adapter format.
F5 — OLLAMA MODEL CREATION FAILURE
ollama create fails.
Symptoms: "Error: model not found," "invalid Modelfile," "layer mismatch."
Diagnosis: Check Modelfile syntax. Verify GGUF path in Modelfile exists and is readable. Verify Ollama version supports the GGUF format.
Recovery:
# Clear partial state first
ollama rm [brain]
# Fix Modelfile paths, then re-create
ollama create [brain] -f Modelfile.[brain]
F6 — GCS UPLOAD/DOWNLOAD FAILURE
gsutil operations fail.
Symptoms: "403 Access Denied," "404 Not Found," "network error."
Diagnosis:
# Verify credentials
export GOOGLE_APPLICATION_CREDENTIALS=~/gcs-service-account.json
gsutil ls gs://cgnt1-backup/ # if this works, credentials are fine
Recovery:
- Re-authenticate GCS and retry
- If 403 persists: the service account permissions may have been revoked → re-create from Google Cloud Console
- Check network connectivity to storage.googleapis.com
F7 — TRAINING DIVERGENCE
Loss increases instead of decreasing.
Symptoms: Validation loss increases after epoch 7+ instead of converging. The model is getting WORSE.
Diagnosis:
- Learning rate too high (model overshooting optimal weights)
- Corpus contains contradictory pairs (model can't satisfy both, oscillates)
- Corpus too small for the domain complexity (can't generalize)
Recovery:
ABORT GATE: if loss hasn't improved by epoch 7, abort and save GPU quota.
1. Review corpus for contradictions per SPEC_TRAINING_PAIR_STANDARDS.md
2. Reduce learning rate (try 1e-5 instead of 2e-5)
3. If corpus is clean and LR is correct: domain may need more pairs
→ add 30+ pairs, reforge
The abort gate at epoch 7 is mandatory. Don't burn GPU quota on a diverging forge.
F8 — RAM EXHAUSTION DURING MODEL LOAD
VPS can't load the brain alongside existing services.
Symptoms: ollama create or ollama run fails with out-of-memory.
Real example: GAMMA (7.8GB) + MNEMOS (4.6GB) = 12.4GB committed, leaving <4GB for everything else.
Diagnosis:
free -h # check available RAM
Recovery:
# Evict non-essential models
ollama stop [least-needed-brain]
# Verify RAM freed, then retry
free -h
ollama create [brain] -f Modelfile.[brain]
Priority order for eviction: ROUTX first (keep alive), MNEMOS second, Sisters third, everything else on-demand.
Rule for 16GB VPS: Maximum 2 concurrent 7B models with headroom. Check before loading.
Future: Tiiny AI (80GB) fits all 8 brains simultaneously.
THE FORGE FAILURE CHECKLIST
When a forge fails at ANY stage:
1. What stage failed? (F1-F8 classification)
2. Is this a known failure mode?
Known → follow the recovery path above
New → diagnose, fix, document, add to this spec
3. Is the corpus intact?
Verify the JSONL file on GCS and locally.
4. Is the base model intact?
Verify Qwen2.5-7B-Instruct in HuggingFace cache.
5. Can we retry without changes?
F2 quota: yes, just wait
F3 auth: yes, re-auth
F6 network: yes, retry
6. Do we need to modify the corpus?
F1 smoke failure: yes
F7 divergence with contradictions: yes
7. Report to Captain:
→ what failed
→ which F-type
→ proposed recovery
→ whether Captain approval is needed
FORGE SESSION CHECKLIST (pre-forge)
Before starting any forge, the Lobster verifies:
Step 0 Colab auth is fresh
python3 ~/colab_dispatch.py --auth --check
Step 1 Corpus uploaded to GCS and verified
File size matches local copy.
Step 2 Corpus reviewed per SPEC_TRAINING_PAIR_STANDARDS.md
At least one reviewer has read the pairs.
Step 3 RAM available for post-forge model load
free -h — evict if available <4GB.
Step 4 GPU quota estimated
How many forges today? Are we near the quota wall?
Step 5 Smoke test script exists and is correct
~/smoke_[brain].py present and correct for this brain.
If any step fails: fix before forging. Don't waste GPU time on a forge that will fail at the next stage.
INVARIANTS
INV-01: Every forge failure is classified F1-F8. New failure modes get added to this spec. The spec grows from experience, not theory.
INV-02: F1 (smoke failure) is the most common. The fix is always: diagnose which tests failed, expand corpus targeting those areas, reforge. Never reforge with the same corpus expecting different results.
INV-03: F2 (GPU quota) is not a failure — it's a resource constraint. Wait. Don't panic. Don't try to "fix" the brain.
INV-04: The abort gate at epoch 7 saves GPU quota on diverging forges. If loss is going UP after 7 epochs, kill it and diagnose corpus.
INV-05: Pre-forge checklist prevents F3, F4, F5, F6, F8. Most forge failures are preventable by checking prerequisites before starting.
INV-06: Every forge failure teaches something. The lesson goes into this spec. This spec is living documentation of the forge's hard-won wisdom.
INV-07: The Lobster reports every failure to the Captain with: F-type, what happened, proposed recovery, whether Captain approval is needed. Clear, structured, no panic.
INTEGRATION
| System | Relationship |
|---|---|
| SPEC_LOBSTER_FORGE_PIPELINE.md | The forge pipeline is the parent spec. This spec documents the failure modes within that pipeline. They are read together. |
| SPEC_CORPUS_VERSIONING.md | F1 recovery always increments corpus version. Versioning spec governs the naming and archiving. |
| SPEC_SMOKE_TEST_FRAMEWORK.md | F1 smoke failure diagnosis requires reading smoke test results. The smoke spec defines the tests. This spec interprets the scores. |
| SPEC_TRAINING_PAIR_STANDARDS.md | F1 and F7 (corpus quality) recovery requires reviewing pairs against this standard. |
| SPEC_BRAIN_ANVIL.md | ANVIL v1's 3/5 smoke failure (F1, Orphic over-learning) is the primary reference case. The v2 plan emerged from applying this spec to ANVIL's failure. |
| SPEC_BRAIN_MANTIS.md | MANTIS's initial 2/5 failure (F1) is the reference case for the 2/5 diagnosis path. |
| SPEC_BRAIN_CHROMA.md | CHROMA's epoch 5/15 quota stop (F2) is the reference case for GPU quota exhaustion recovery. |
| COLAB_AUTOMATION.md | F3 (Colab auth) and F4 (conversion) recovery relies on Colab automation playbook for CDP navigation. |
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042