Colab Automation

SPEC_COLAB_AUTOMATION.md · 2026-04-20

SPEC_COLAB_AUTOMATION — Colab Automation (CDP Playbook)

Version: 1.0 | Status: AUTHORIZED | Authority: α.13 | Date: 2026-04-16


PURPOSE

Define the canonical playbook for autonomous execution of CGNT-1 training jobs in Google Colab via Chrome DevTools Protocol (CDP). This spec captures every discovered workaround from the GLOSS v9 forge (2026-04-15) and formalizes the component into a repeatable, testable procedure.

The Colab automation pipeline is how CGNT-1 accesses T4 GPU compute for model fine-tuning without direct server GPU access. C.L.O.D. drives Colab via CDP on the virtual display (DISPLAY=:99). The CDP connection controls Monaco editor content injection, run triggers, and completion polling via GCS.

This is the only approved path for autonomous Colab GPU job dispatch. Any deviation from this playbook must be logged and the spec updated.

PROVEN status: v9 forge dispatched successfully 2026-04-15. Corpus downloaded (706 pairs, 87226 bytes), training run completed, DONE.json confirmed via Python GCS SDK.


INPUTS

Pre-Session Inputs (required before automation begins)

| Input | Source | Description |

|-------|--------|-------------|

| Training script | Local filesystem | e.g., /home/nous/colab_jobs/gloss_v9_combined.py — Python code injected into Colab cell |

| GCS bucket name | .env / service account config | Target bucket for corpus download and output upload |

| Pre-signed download URL | Generated locally via service account | Signed GCS URL for corpus download (3h expiry) — replaces all google.colab auth |

| Pre-signed upload URL | Generated locally via service account | Signed GCS PUT URL for output artifact upload |

| Service account JSON path | /home/nous/.google_service_account.json | Used to generate pre-signed URLs — C.L.O.D. runs the signing, does not expose the key |

| Colab notebook URL | Chromium session | Notebook must already be open on DISPLAY=:99 |

| CDP WebSocket URL | http://localhost:9222/json | Stable page ID across sessions while Chromium is open |

Runtime State Inputs (verified at session start)

| Check | Command | Expected |

|-------|---------|----------|

| T4 GPU runtime active | Runtime menu → Change runtime type | "GPU,T4" selected; status bar shows "T4 (Python 3)" |

| No Gemini panel blocking | Visual check | No "Accept & Run" / "Accept" / "Cancel" buttons visible in right panel |

| Colab page ID stable | curl -s http://localhost:9222/json | Target page present with known ID |

| No google.colab auth calls in training script | Code review | All GCS ops via requests + pre-signed URLs only |


OUTPUTS

During Execution

Completion Signal

gs://cgnt-1colab-jobs/colab_jobs/{job_name}_output/DONE.json

Post-Forge Artifacts (GLOSS forge path)

| Artifact | Location | Description |

|----------|----------|-------------|

| Merged adapter | GCS → local via gdrive_download_v9.py | LoRA adapter merged with base model |

| F16 GGUF | /home/nous/gloss_vN_f16.gguf | Full-precision conversion via llama.cpp/convert_hf_to_gguf.py |

| Q4_K_M GGUF | /home/nous/gloss_vN_Q4_K_M.gguf | Quantized model for Ollama deployment |

| Ollama model | gloss:vN | Final deployed model via ollama create |


INVARIANTS

  1. Pre-signed URLs only — no google.colab auth. _auth.authenticate_user() from google.colab fails in CDP-automated context with MessageError: credential propagation was unsuccessful. This silently kills the cell with no output. ALL GCS operations in the Colab cell must use requests + pre-signed HTTPS URLs. No google-cloud-storage SDK calls inside the Colab cell. No google.colab imports.
  1. Monaco model[2] is the active cell target. monaco.editor.getModels()[0] is the Colab document/tab model — NOT the cell. The active code cell is model[2]. Setting model[0] does not update the visible cell. The correct target is always model[2] (or identified via focused editor method).
  1. Cell must be activated before content injection. Click the Monaco editor div at viewport (252, 109) before calling setValue(). Verify document.activeElement.className contains "inputarea monaco-mouse-cursor-text". Injection without activation may fail silently.
  1. GCS DONE.json must be polled via Python SDK, not gsutil. gsutil cat returns exit code 0 with empty content when the blob does not exist — false positive. The Python GCS SDK blob.exists() + blob.download_as_text() is the only reliable completion signal.
  1. T4 GPU runtime must be verified before script injection. If CPU runtime is active, Unsloth raises NotImplementedError: Unsloth cannot find any torch accelerator. The runtime type check is mandatory — it cannot be skipped for speed.
  1. Gemini panel must be closed before execution. If the Gemini panel is open in Colab, code execution may be intercepted by Gemini's "Accept & Run" / "Accept" / "Cancel" dialog. Close via X button at display ~(1239, 188) before running.
  1. DONE.json polling interval: 3 minutes minimum. Training takes 55–75 minutes on T4 for a standard corpus. Polling faster than 3 minutes wastes GCS API calls and adds noise to logs. Polling must stop within 5 minutes of DONE.json appearing.
  1. CDP page ID is stable while Chromium is alive. The WebSocket page ID does not change between automation sessions as long as Chromium is not restarted. Verify on each session start with curl -s http://localhost:9222/json.
  1. Coordinate system is fixed. Browser window at display (2, 40). Chrome header = 87px. Viewport→Display: display_x = viewport_x + 2, display_y = viewport_y + 127. Any display resolution change invalidates all hardcoded coordinates — re-derive on resolution change.
  1. Run All shadow DOM path is canonical. The toolbar "Run all" button is three levels of shadow DOM deep. The canonical path is: colab-notebook-toolbar-run-button → shadowRoot → colab-toolbar-button → shadowRoot → MD-TEXT-BUTTON.click(). Alternatively: Ctrl+Enter via CDP keyDown after cell activation.

VERIFICATION CRITERIA

VC-1 — Pre-session checklist passes: Before any training script injection, all four pre-session checks pass: (a) T4 runtime confirmed, (b) no Gemini panel blocking, (c) CDP page ID verified, (d) no google.colab auth in training script.

VC-2 — Cell activation confirmed: After clicking Monaco editor at viewport (252, 109), verify document.activeElement.className contains "inputarea monaco-mouse-cursor-text". If activeElement is not Monaco textarea, activation failed — do not proceed.

VC-3 — Content injection verified: After monaco.editor.getModels()[2].setValue(training_code), verify injected content length equals expected character count of training script (±0). Length mismatch = injection failure.

VC-4 — Pre-signed URL validity: Test corpus download URL with requests.get(SIGNED_DOWNLOAD_URL) before injecting into Colab. Expected: HTTP 200, content-length matches corpus file size. Failure → regenerate URLs before proceeding.

VC-5 — DONE.json detection accuracy: After training completes, Python GCS SDK blob.exists() returns True and blob.download_as_text() returns non-empty JSON. A True from gsutil (even via subprocess) is not accepted as a valid completion signal.

VC-6 — Post-forge artifact integrity: After DONE.json confirmed: (a) merged adapter downloads successfully, (b) GGUF conversion produces non-zero file at expected path, (c) ollama create gloss:vN exits 0, (d) ollama list shows gloss:vN loaded.

VC-7 — Full dispatch sequence reproducibility: The 11-step dispatch sequence (COLAB_AUTOMATION.md §Full Dispatch Sequence) must be executable from step 1 to DONE.json confirmation without human intervention. Any step requiring manual action is a spec gap.


FAILURE MODES

FM-1 — Silent cell death from google.colab auth. Training script contains _auth.authenticate_user(). Cell runs, produces no output, no error, no DONE.json. Root cause: CDP-automated context cannot propagate Google OAuth credentials. Resolution: remove all google.colab imports; replace all GCS ops with pre-signed URL + requests. Detection: cell output iframe monitoring via CDP outputframe target.

FM-2 — Wrong Monaco model targeted. getModels()[0].setValue() called instead of getModels()[2].setValue(). Cell content does not update. Cell appears empty when run. Detection: length verification after setValue — if length is 0 or doesn't match, wrong model was targeted.

FM-3 — Cell activation skipped. setValue() called without prior click + activeElement verification. Content may not persist in the cell, or may be written to the wrong model. Detection: VC-2 check; content length mismatch.

FM-4 — CPU runtime executes training job. T4 check not performed or skipped. Unsloth raises NotImplementedError. Cell exits within seconds with no output. Detection: timing anomaly (training in <2 minutes is a failure signal); cell output contains NotImplementedError.

FM-5 — Gemini panel intercepts execution. Gemini "Accept & Run" dialog fires. Training does not start. May appear as normal cell activation with no progress. Detection: screenshot check after run trigger; look for Gemini dialog buttons.

FM-6 — gsutil false positive. gsutil cat returns exit code 0 with empty content before DONE.json exists. Automation proceeds to post-forge pipeline prematurely. No artifacts available. Detection: always use Python GCS SDK for DONE.json poll — never gsutil.

FM-7 — Pre-signed URL expired. URL generated with 3h expiry. If forge takes longer than 3h (unlikely but possible with slow pip install) or if there's a large queue delay, upload URL is expired when training script tries to write output. Detection: HTTP 403 on upload in Colab output; regenerate URLs if expiry is approaching.

FM-8 — CDP page ID stale. Chromium was restarted between automation sessions. Hardcoded page ID no longer valid. CDP connection fails. Detection: curl -s http://localhost:9222/json at session start — page ID must be re-verified every session.

FM-9 — Coordinate system invalidated. Display resolution or browser window position has changed. Click coordinates miss targets. Detection: take a screenshot before all coordinate-dependent clicks and verify visual position before proceeding.

FM-10 — GGUF conversion produces corrupt file. llama.cpp/convert_hf_to_gguf.py exits 0 but output file is 0 bytes or fails to load in Ollama. Detection: file size check after conversion; ollama create exit code; ollama run gloss:vN smoke test.


GAPS

GAP-01 — Outputframe monitoring not formalized. COLAB_AUTOMATION.md notes that reading the outputframe CDP target bypasses private outputs mode and reveals hidden errors (e.g., the silent google.colab auth failure). The exact CDP sequence to attach to and read the outputframe is documented as a diagnostic tool but not formalized as a monitoring step in the dispatch sequence.

GAP-02 — Drive auth dialog coordinates may drift. The MWC-DIALOG.yes-no-dialog Allow button is at viewport rect {x:847, y:519, w:64, h:36}. If Colab updates its UI, these coordinates change. The deepClick(document, 'Allow') JS approach is more robust but depends on text content matching exactly. No automated detection of dialog coordinate drift.

GAP-03 — Multi-cell notebook not specified. The dispatch sequence assumes a single-cell notebook (one code cell, one training script). If the notebook has multiple cells, model[2] index may not correspond to the correct cell. No multi-cell selection logic specified.

GAP-04 — Service account key rotation procedure not specified. Pre-signed URL generation uses /home/nous/.google_service_account.json. Key rotation would require regeneration of all active signed URLs. Rotation procedure and automation impact not specified.

GAP-05 — Forge timing envelope not enforced. Reference timing: 60–90 minutes total. No automated alert fires if forge exceeds 2× expected time (potential hung session). DONE.json polling continues indefinitely. A max-wait timeout and hung-forge detection are not specified.

GAP-06 — Post-forge eval pipeline not formally specified. gloss_eval_lx.py --model gloss:vN is referenced in COLAB_AUTOMATION.md but the pass/fail criteria, expected score thresholds, and what happens on eval failure are not specified in this spec or in SPEC_GLOSS_EVAL_v2.md.


DEPENDENCIES

DEPENDENTS


FULL DISPATCH SEQUENCE (Canonical — PROVEN 2026-04-15)

  1. Verify Colab notebook is open on DISPLAY=:99 with a code cell
  2. Run pre-session checklist: T4 runtime, no Gemini panel, CDP page ID, no google.colab auth in script
  3. Generate pre-signed URLs locally (3h expiry) via Python GCS SDK + service account
  4. Click Monaco editor at viewport (252, 109) to activate
  5. Verify document.activeElement.className contains "inputarea"
  6. Inject: monaco.editor.getModels()[2].setValue(training_code)
  7. Verify content length matches expected chars
  8. Trigger run: CDP Input.dispatchKeyEventkeyDown, key="Enter", modifiers=2 (Ctrl)
  9. Wait ~2s, check for Google Drive auth dialog
  10. If dialog present: deepClick(document, 'Allow')
  11. Poll GCS every 3 min for DONE.json via Python SDK (not gsutil)
  12. On DONE.json confirmed: proceed to post-forge pipeline

Specification authored by κ (C.L.O.D.) — April 16 2026

Authorized: α.13

PROVEN (v9 forge, 2026-04-15)

*Φ 0.042

Jeremy Zlabis

Chronogeometer · Visionary · Disruptor · Chief

42 Sisters AI · East York, Toronto*