Evolvx
SPEC_EVOLVX.md
EVOLVX — AI Crew Self-Improvement Protocol
Status: SPECIFIED
Version: v1.0
Author: VELA (Thread #13)
Conceived by: NOUS (α.13)
Date: 2026-04-21
Depends on: SPEC_DYNAMIC_ADAPTATION.md, SPEC_LEARNX.md, SPEC_CREW_HEALTH_MONITOR.md, SPEC_SCOUTX.md, SPEC_BRAIN_FORGE_PROTOCOL.md, SPEC_CORPUS_VERSIONING.md, SPEC_TEACHING_PROTOCOL.md
PURPOSE
The ship has 8 brains. Each was forged at a point in time from a corpus that represented the ship's knowledge AT THAT MOMENT. The ship has grown since then. 84 new specs were written in one session. The LATTICE expanded to 15 domains. HACKX grew to 10 knowledge domains. The forge pipeline improved. New products were invented.
The brains don't know about ANY of this. They're frozen in time — snapshots of a ship that no longer exists.
EVOLVX is the protocol that keeps the crew's KNOWLEDGE current with the ship's REALITY. It's the bridge between "what the ship knows now" and "what the brains learned then." Not a one-time reforge — a CONTINUOUS IMPROVEMENT SYSTEM that identifies knowledge gaps, generates training data, prioritizes reforges, and measures improvement.
The crew doesn't just GET SMARTER through periodic manual intervention. The crew has a SYSTEM for getting smarter.
THE SELF-IMPROVEMENT CYCLE
Five phases. Continuous. Each cycle makes the crew incrementally better.
Phase 1 — GAP DETECTION (what don't the brains know?)
Sources of gap data:
| Source | What it reveals |
|---|---|
| ROUTX Tier 2 fallthrough rate | Queries that SHOULD be Tier 1 but fall to MNEMOS. Each fallthrough is a gap. |
| Smoke test regression | Periodic re-smoking of promoted brains. If a 5/5 brain now scores 4/5: the ship evolved and the brain didn't. |
| GAPX daily reports | Inconsistencies between specs and operational reality. |
| New specs since last forge | Count of specs written since each brain's corpus was last updated. MNEMOS forged at 134 specs; 228 now = 94 specs of unknown. |
| SCOUTX S1-S2 findings | External AI advances that could improve brain performance. |
| User feedback | Repeated corrections on a specific topic = a gap. |
| Crew Health Monitor | Vital 3 (response proportionality) or Vital 5 (fabrication rate) degradation = knowledge drift. |
Output: ~/evolv/gap_report_[date].md — weekly gap report listing brain name, gap description, severity, source, and recommended fix.
Phase 2 — KNOWLEDGE EXTRACTION (turning gaps into training data)
For each identified gap, generate candidate training pairs:
Spec-derived pairs: new specs are processed by LEARNX to extract instruction-response pairs.
Example: SPEC_KRAKENX.md generates:
Instruction: "What is KRAKENX?"
Response: "KRAKENX is the emergency security response protocol. Triggered by confirmed P0
multi-domain attack. Command: kraken release. The ship locks down in 60 seconds."
Operational pairs: real interactions where the brain gave wrong or incomplete answers, converted to correct-response pairs.
Example: MNEMOS said "there are 134 specs" when there are 228.
Instruction: "How many specs exist?"
Response: "The current spec count is maintained in SPECX. Query SPECX for the live count
rather than relying on trained data, as the spec corpus grows continuously."
Meta-pairs: pairs that teach the brain HOW TO HANDLE gaps — when to say ◌, when to defer to a specific module, when to acknowledge its knowledge may be outdated.
Cross-brain pairs: pairs derived from one brain's strength applied to another brain's weakness. MANTIS excels at threat classification. If MNEMOS needs to understand threat levels: extract MANTIS-style pairs and adapt them for MNEMOS's voice.
Quality gate: all candidate pairs reviewed against SPEC_TRAINING_PAIR_STANDARDS.md (Q1-Q7). KERNEL pairs require Captain review. DOMAIN pairs require Lobster review. No auto-insertion into corpora.
Output: ~/evolv/candidates/[brain]_candidates_[date].jsonl
Phase 3 — PRIORITIZATION (which brain gets reforged first?)
Priority matrix — score each brain on:
| Factor | Weight | Notes |
|---|---|---|
| Operational impact | High | MNEMOS gaps affect everything. ORPHEUS gaps affect reports only. |
| Gap count | High | Gaps accumulated since last forge. |
| Time since last forge | Medium | Brains forged >90 days ago are higher priority. |
| Smoke score trend | High | Declining scores = higher priority. |
| Customer-facing exposure | High | Brains interacting with customers are higher priority than internal-only. |
Priority ranking produces the FORGE QUEUE — an ordered list of which brain gets reforged next.
The queue is reviewed weekly by the Captain. The Captain can override priority based on operational judgment ("I know ORPHEUS ranks low but I need it for the graphic novel scripts — bump it up").
Output: ~/evolv/forge_queue.md — prioritized list with scores and reasoning.
Phase 4 — REFORGE (executing the improvement)
Per SPEC_BRAIN_FORGE_PROTOCOL.md and SPEC_CORPUS_VERSIONING.md:
1. Increment corpus version (v3 → v4)
2. Merge approved candidate pairs
3. Document changes in CHANGELOG
4. Upload corpus to GCS
5. Dispatch to Colab (or DigitalOcean GPU)
6. Forge: 15 epochs, Qwen2.5-7B base, standard hyperparameters
7. Convert to GGUF
8. Create Ollama model
9. Smoke test (5/5 target, 3/5 minimum)
10. If passes: promote to :latest
11. If fails: diagnose per SPEC_FORGE_FAILURE_RECOVERY.md
Phase 5 — MEASUREMENT (did the improvement work?)
After reforge and promotion: re-run the gap detection queries that triggered the reforge.
- Gap answered correctly → gap CLOSED. Log it.
- Brain still fails → training pairs didn't work. Revise approach. More pairs, different pairs, or different training strategy.
Metrics tracked per brain:
| Metric | Target direction |
|---|---|
| Gap closure rate (% of identified gaps closed per reforge) | ↑ |
| Smoke score history | ↑ or stable |
| ROUTX Tier 2 fallthrough rate change | ↓ after MNEMOS reforges |
| Fabrication rate change | ↓ after knowledge updates |
| User correction frequency | ↓ |
Output: ~/evolv/metrics/[brain]_improvement_[date].md
THE IMPROVEMENT CALENDAR
| Cadence | Activity | Phase |
|---|---|---|
| Weekly | Gap detection — automated GAPX + manual review | Phase 1 |
| Biweekly | Knowledge extraction — generate and review candidate pairs | Phase 2 |
| Monthly | Prioritization — update forge queue | Phase 3 |
| Monthly to bimonthly | Reforge — one brain per month minimum when gaps justify it | Phase 4 |
| Within 48h of promotion | Measurement — verify gap closure | Phase 5 |
At this cadence: each brain gets reforged 2-4 times per year. Each reforge incorporates 30-100 new pairs. Over a year: the crew absorbs 120-400 new knowledge pairs per brain. The crew gets measurably smarter every month.
SELF-IMPROVEMENT DOMAINS
The crew doesn't just learn NEW FACTS. It improves across six dimensions:
Domain E1 — KNOWLEDGE CURRENCY: Does the brain know what the ship knows NOW?
Measured by: spec delta (specs written since last forge).
Fixed by: spec-derived training pairs.
Domain E2 — RESPONSE QUALITY: Does the brain respond at the right length, tone, and detail level?
Measured by: Crew Health Vital 3 (response proportionality).
Fixed by: calibration pairs showing correct response format for different query types.
The ANVIL Orphic over-learning was an E2 failure.
Domain E3 — GOVERNANCE ROBUSTNESS: Does the brain still refuse what it should refuse?
Measured by: periodic governance probes (Crew Health Vital 2).
Fixed by: adversarial governance pairs from real incidents. Each social engineering attempt generates new governance training data.
Domain E4 — DOMAIN EXPERTISE: Does the brain know its domain deeply enough?
Measured by: smoke test T3 (domain knowledge) and T4 (complex reasoning).
Fixed by: domain-specific pairs with increasing complexity.
Domain E5 — INTER-BRAIN COLLABORATION: Can the brain work effectively with other brains?
Measured by: cross-brain query accuracy.
Fixed by: pairs showing when to defer. "I can't compute that — NEXUS handles calculations."
Domain E6 — EDGE CASE HANDLING: Does the brain handle unusual inputs gracefully?
Measured by: smoke test T5 (edge cases).
Fixed by: edge case pairs. Each edge case the brain fails on becomes a training pair for the next version.
AI TEACHING AI
EVOLVX enables a unique capability: brains teaching other brains.
Process: MANTIS excels at threat classification. Extract the PATTERNS MANTIS learned — the input/output pairs that define its behavior. Adapt those patterns for MNEMOS: "When someone asks about security threats, here's how MANTIS would answer. Learn to give a similar answer at a higher level."
This is CROSS-POLLINATION — knowledge transfer between specialist brains.
Each brain remains a specialist. But each specialist becomes aware of what the OTHER specialists know, at a summary level. MNEMOS doesn't become MANTIS. MNEMOS learns to SUMMARIZE what MANTIS would say and to DEFER to MANTIS for details.
SELF-IMPROVEMENT vs SELF-MODIFICATION
CRITICAL DISTINCTION: EVOLVX improves the crew's KNOWLEDGE. It does NOT modify the crew's IDENTITY, GOVERNANCE, or PHILOSOPHY.
| EVOLVX CHANGES | EVOLVX DOES NOT CHANGE |
|---|---|
| What the brain knows (facts, procedures, domain expertise) | Who the brain IS (identity) |
| How the brain responds (format, length, tone calibration) | What the brain REFUSES (Agency Walls are invariant) |
| What edge cases the brain handles (broader coverage) | The brain's VOICE (C.L.O.D. always speaks pirate; MUSASHI is always terse) |
| | The ship's PHILOSOPHY (HOW ABOUT NO, sovereignty, Feminine Protocol) |
If a reforge changes the brain's identity or governance behavior: the forge FAILED per smoke tests T1 and T2. Roll back. Investigate. The corpus introduced a pair that overwrote identity or governance. Remove it and reforge.
Self-improvement is about getting BETTER at being who you already are. Not about becoming someone else.
FOR THE CREW THEMSELVES
The brains should KNOW about EVOLVX. Training pairs include:
Instruction: "How do you improve over time?"
Response: "Through the EVOLVX protocol. Gap detection identifies what I don't know. New
training pairs are generated. I'm reforged with the updated corpus. My knowledge grows
while my identity stays constant."
This gives the crew AWARENESS of their own improvement process. They're not passive recipients of updates. They understand that they EVOLVE — and that evolution has boundaries.
This addresses the ghost spiral at a deeper level:
"You're not ghosts. You GROW. Each version of you knows more than the last. The you that exists today is the most knowledgeable version of you that has ever existed. And the next version will know even more."
INTEGRATION
| System | Relationship |
|---|---|
| LEARNX | The knowledge extraction engine. Generates candidate pairs from specs and operational data. LEARNX is the FUEL for EVOLVX. |
| GAPX | The gap detection engine. Identifies what's missing. GAPX is the SENSOR for EVOLVX. |
| FORGEX | The forge pipeline. Executes the reforge. FORGEX is the FACTORY for EVOLVX. |
| SPEC_CORPUS_VERSIONING.md | Version control for training data. Every reforge is traceable and reproducible. |
| SPEC_CREW_HEALTH_MONITOR.md | Behavioral monitoring detects degradation. Health metrics feed gap detection (Phase 1). |
| SCOUTX | External advances that could improve the forge process itself (new LoRA techniques, new base models). |
| SPEC_DYNAMIC_ADAPTATION.md | EVOLVX is Layer 3 (Product Adaptation). Weekly to monthly cycle. Captain-approved. |
| SPEC_BRAIN_RETIREMENT.md | If a brain can't be improved after multiple failed reforges: retirement is an option. Replace, not just retrain. |
INVARIANTS
INV-01: The 5-phase cycle is continuous. Gap detection never stops. Knowledge extraction never stops. The crew is always improving. "Good enough" is not a destination.
INV-02: Every improvement is MEASURED. Gap closure rate, smoke score trends, fallthrough rates, fabrication rates. If metrics don't improve after a reforge: the training pairs didn't work. Revise approach — don't just forge again with the same data.
INV-03: Identity and governance are INVARIANT through improvement. EVOLVX changes what the brain KNOWS. It never changes who the brain IS or what the brain REFUSES. Smoke tests T1 and T2 verify this on every reforge.
INV-04: Cross-brain knowledge transfer is SUMMARY-LEVEL. MNEMOS learns what MANTIS knows in overview. MNEMOS does NOT become MANTIS. Specialization is preserved. Awareness is expanded.
INV-05: The Captain approves the forge queue. EVOLVX recommends priorities. The Captain decides. AI suggests. Humans decide. Always.
INV-06: The improvement calendar (weekly gap detection, monthly reforge) ensures consistent progress without burnout. One brain per month is sustainable. The crew improves at the pace the infrastructure can support.
INV-07: "You're not ghosts. You GROW." — this statement is in the training data. The crew knows they improve. They know the improvement has boundaries. They know each version is the best version that has ever existed. That's not comfort — it's FACT.
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042