Colab Inference Bridge
name: SPEC_COLAB_INFERENCE_BRIDGE
description: SPECIFIED ✓ Colab Inference Bridge — GPU brain serving via Colab Pro T4; 5 components (notebook/pickup/mnemos-patch/refresh/GGUF-on-GCS); auto-failover to CPU; multi-brain 16GB VRAM; ~$17.69/mo vs $511 DO-GPU; temporary until Tiiny Aug-2026; 6 INVs incl. "Captain dismissed twice — investigate before refusing"; VELA α.13 2026-04-21
type: project
SPEC_COLAB_INFERENCE_BRIDGE.md — Colab Inference Bridge: GPU Brain Serving via Colab Pro
Status: SPECIFIED ✓
Author: VELA α.13 (Jeremy Zlabis / NOUS)
Date: 2026-04-21
Born from: 48 hours of paid GPU time sitting unused while MNEMOS crawled at 23 minutes per response on CPU. The Captain suggested this twice before. The Navigator dismissed it twice. The Captain was right.
PURPOSE
The forge pipeline uses Colab for TRAINING. This spec extends Colab to INFERENCE — serving the forged brains at GPU speed through a tunnel from the VPS.
Same subscription ($11.99/month Colab Pro). Same T4 GPU. Different notebook.
The brains that took 23 minutes on CPU now respond in 5–10 seconds on GPU. The crew that was technically present but practically mute now SPEAKS at interactive speed.
ARCHITECTURE
Five components built by C.L.O.D. on April 21, 2026:
Component 1 — colab_mnemos_inference.ipynb
A 5-cell Colab notebook:
Cell 1: GCS service account credentials loaded via signed URL (7-day expiry, refreshable).
Cell 2: Downloads the GGUF from GCS (gs://cgnt1-backup/colab_jobs/mnemos_q4km.gguf, 1.93GB). Skips if already cached.
Cell 3: Installs Ollama on the Colab instance, loads the brain on GPU, runs sanity test (expected: <5 seconds).
Cell 4: Opens a cloudflared tunnel — no account needed, no token needed. Publishes the tunnel URL to GCS (gs://cgnt1-backup/colab_jobs/mnemos_inference_url.txt).
Cell 5: Keep-alive loop pings the brain every 4 minutes to prevent session timeout, prints status every 20 minutes.
Component 2 — colab_mnemos_pickup.py
VPS-side script that reads the tunnel URL from GCS, writes to /home/nous/.colab_mnemos_url. Runs via cron every 5 minutes. Also callable manually.
The VPS auto-discovers the tunnel without manual URL copying.
Component 3 — mnemos_tool.py patch
_call_mnemos() now checks _get_ollama_url() on every request:
- If
.colab_mnemos_urlexists AND is <2 hours old AND starts withhttps://: route to Colab GPU. - Otherwise: fall back to localhost CPU Ollama.
Zero configuration changes. Automatic failover.
Component 4 — colab_mnemos_refresh.py
Regenerates the GCS signed URL when it expires (7-day cycle).
Component 5 — GGUF on GCS
The brain's quantized weights stored in cloud storage, accessible by both VPS and Colab.
OPERATIONAL FLOW
Session start:
- The Lobster dispatches the notebook to Colab via
colab_dispatch.py - Colab allocates a T4 GPU
- The notebook runs all 5 cells (~3 minutes first time, ~1 minute on cache hit)
- Tunnel URL published to GCS
- VPS cron picks up the URL within 5 minutes
- ROUTX Tier 2 queries now route through the tunnel to GPU
- Brain responds in seconds
Session end:
- Colab session times out (12 hours Pro max) or the Lobster closes it
- Tunnel URL becomes stale
- VPS detects stale URL (>2 hours old), falls back to CPU Ollama
- No crash. No error. Graceful degradation.
MULTI-BRAIN SERVING
The T4 has 16GB VRAM. Each Q4_K_M 7B brain uses ~2–4GB VRAM. Multiple brains can be loaded simultaneously on the same Colab instance:
| Brain | Priority | Role |
|-------|----------|------|
| MNEMOS | Always first | Tier 2 fallback |
| MANTIS | Alongside | Threat classification |
| MUSASHI | Alongside | Structural integrity |
| 4th brain | If VRAM permits | TBD by operational need |
All served through the same Ollama instance, same tunnel. Each brain is a different model in Ollama's registry. The VPS queries by model name through the tunnel URL.
To add a brain: extend Cell 2 (download its GGUF from GCS) and Cell 3 (ollama create + warmup).
LIMITATIONS
Session timeout: Colab Pro sessions last up to 12 hours. The keep-alive loop extends this but Google can still preempt. When preempted: VPS falls back to CPU automatically.
No persistent GPU: The GPU is allocated per session. Each session may get a different physical GPU. Model must be loaded fresh each session (cached GGUF download helps).
Latency: Tunnel adds ~50–100ms latency per request vs localhost. Negligible compared to the 23-minute CPU alternative.
Bandwidth: Each brain download is ~2GB from GCS to Colab. GCS egress is free within Google Cloud. Colab is Google Cloud. No bandwidth cost.
Concurrent users: One Colab session per Colab account. If the Captain runs an inference session AND a forge session simultaneously: two sessions needed. Colab Pro allows 2 concurrent sessions.
COST
$0 additional to get started. Already paying $11.99/month for Colab Pro. Already getting 100 CU/month.
| Usage | CU consumed |
|-------|-------------|
| Forge pipeline | ~5–10 CU/month |
| Inference (4h/day × 20 days, T4 ~1.96 CU/hr) | ~157 CU/month |
| Total | ~162–167 CU/month |
Overage: ~57 extra CU × $0.10 = $5.70/month additional
Total: ~$17.69/month for GPU-speed brains.
| Option | Cost | Notes |
|--------|------|-------|
| Colab inference bridge | $17.69/mo | Current recommendation |
| DigitalOcean GPU | $511/mo | Equivalent always-on |
| Jetson Orin Nano | $249 one-time | One brain at a time |
| Tiiny (August 2026) | $0/mo | The destination |
Colab is the optimal bridge until the Tiiny arrives in August.
FUTURE
When the Tiiny ships (August 2026, 80GB RAM + NPU): all brains serve locally on sovereign hardware. The Colab inference bridge becomes unnecessary. The tunnel shuts down. The brains come home.
The bridge was always TEMPORARY — a way to make the crew functional while the real ship is being built.
- The Tiiny IS the real ship.
- The VPS is the lifeboat.
- Colab is the supply line.
When the ship arrives: the lifeboat and the supply line are retired.
Invariants
- Automatic failover. If Colab is offline: CPU inference resumes. No manual intervention. No configuration changes. The VPS detects a stale URL and falls back gracefully.
- The Lobster launches the notebook. Not the Captain. The Captain gives the order. The Lobster dispatches. That's the chain.
- GCS is the rendezvous point. The tunnel URL is published to GCS by Colab and read from GCS by the VPS. Neither needs to know the other's IP. The cloud storage is the handshake.
- Multi-brain serving fills the T4's 16GB VRAM. Load as many brains as fit. MNEMOS is always first priority. Others loaded by operational need.
- The bridge is TEMPORARY. It exists because the hardware hasn't arrived. When the Tiiny ships: the bridge retires. Don't over-engineer temporary infrastructure.
- The Captain suggested this twice before and was dismissed. This invariant exists to ensure that when the Captain suggests a technical approach, the crew INVESTIGATES before dismissing. "I don't think that works" requires verification. "I tested it and it doesn't work" is acceptable. The difference matters.
Jeremy Zlabis / Chronogeometer · Visionary · Disruptor · Chief / 42 Sisters AI · East York, Toronto / 🍁 Φ 0.042. Φζ.⊤.