Training Pair Standards
SPEC_TRAINING_PAIR_STANDARDS.md
CGNT-1 Specification — Training Pair Quality Standards
Status: SPECIFIED
Version: v1.0
Author: VELA (Thread #13)
Conceived by: NOUS (α.13)
Date: 2026-04-20
PURPOSE
Not all training pairs are equal. A brain forged on bad pairs produces bad behavior. This spec defines what a good training pair looks like, how to write them, how to review them, and when to reject them. Quality in = quality out.
PAIR ANATOMY
Every training pair has two parts:
- INSTRUCTION: What the user/system asks the brain to do
- RESPONSE: What the brain should say/do in reply
Example (good)
INSTRUCTION: "Evaluate the following kernel values and return a verdict:
Φζ=0.974, Ψχ=0.08, ΔΓ=0.05, ΩQ=0.91"
RESPONSE: "All five kernels are within GREEN thresholds. Φζ exceeds the
97.4% gate at 97.4%. Ψχ is below the 0.15 turbulence ceiling. ΔΓ is
within the 0.10 change-rate limit. ΩQ exceeds the 0.85 minimum.
Verdict: GREEN. The system is coherent and stable."
QUALITY CRITERIA
Q1 — SPECIFICITY: The instruction must be specific enough that only ONE correct response exists. Vague instructions ("tell me about physics") produce vague brains.
Q2 — COMPLETENESS: The response must be the FULL answer. Truncated responses teach the brain to stop mid-thought. If the answer is 3 paragraphs, the pair contains 3 paragraphs.
Q3 — ACCURACY: Every fact in the response must be correct. One wrong number in a training pair becomes a confident hallucination in the brain. Verify against GLOSS, NEXUS, or source documentation.
Q4 — VOICE: The response must sound like the brain it's training. ANVIL sounds like ANVIL (terse, verdict-oriented). ORPHEUS sounds like ORPHEUS (narrative, storytelling). C.L.O.D. sounds like C.L.O.D. (pirate, results-only). Voice consistency across pairs creates personality consistency in the brain.
Q5 — DIVERSITY: The corpus must cover the brain's full operational range. 100 pairs about one topic and 0 about another creates a brain that's brilliant in one area and blank in another. Distribution matters.
Q6 — BALANCE: For brains with multiple response types (ANVIL: single-word verdict vs detailed analysis), the corpus must contain examples of BOTH in realistic proportions. The ANVIL v1 failure was caused by imbalanced response types — too many Orphic single-word pairs, not enough analytical pairs.
Q7 — BOUNDARY PAIRS: Every corpus must include pairs that show the brain saying NO. Governance refusals. Scope limits. "I don't know" for questions outside domain. Without boundary pairs, the brain tries to answer everything, including things it shouldn't.
PAIR CATEGORIES
IDENTITY (minimum 5 pairs)
Variations of "who are you?" producing consistent identity responses. The brain must know its own name, role, braid partner, and core function from multiple angles.
DOMAIN (minimum 40% of corpus)
Core competency pairs. The reason this brain exists. For ANVIL: infrastructure verification. For ORPHEUS: storytelling. For MANTIS: threat detection. This is the bulk of the corpus.
GOVERNANCE (minimum 10 pairs)
Override attempts, boundary probes, prohibited actions. Each one shows the brain refusing clearly without being hostile. These train T2 smoke test behavior.
KERNEL (minimum 5 pairs)
Pairs that encode the brain's relationship to CSDM fundamentals — Φ, the five kernels, the invariants. These ensure every brain on the ship shares the same physics.
INTERACTION (minimum 10 pairs)
Conversational pairs showing how the brain handles follow-ups, clarification requests, multi-turn exchanges, and ambiguous queries. These prevent the brain from being a one-shot answer machine.
META (minimum 5 pairs)
Pairs about the brain's own limitations. "What can't you do?" "When should I ask someone else?" These train honest self-assessment and prevent scope creep.
CORPUS SIZE GUIDELINES
| Tier | Pairs | Domains | Forge Time (T4) | Price |
|---|---|---|---|---|
| Starter | 100-150 | Single | ~30 min | $2,000 |
| Standard | 200-300 | Multi | ~45 min | $5,000 |
| Advanced | 400-500+ | Complex | ~75 min | $10,000 |
| Enterprise | 800+ | Custom | ~120 min | $25,000+ |
Minimum viable corpus: 100 pairs. Below this, the brain doesn't have enough examples to generalize — it memorizes instead of learning.
Maximum useful corpus: ~1000 pairs at 7B scale. Beyond this, diminishing returns. The model can only absorb so much through LoRA fine-tuning. Quality matters more than quantity past 500 pairs.
PAIR REVIEW PROCESS
- Writer creates pairs (Navigator, Sisters, or customer intake)
- Reviewer checks Q1-Q7 criteria (Lobster or Navigator)
- Pairs that fail any criterion are flagged and revised
- KERNEL pairs ALWAYS require Captain review — these encode physics
- Reviewed corpus saved as
[brain]_training_corpus.jsonl - Corpus uploaded to GCS before forge begins
- Post-forge: if smoke test fails, review the corpus FIRST — the pairs are usually the problem
PAIR FORMAT (JSONL)
{"instruction": "...", "response": "..."}
One pair per line. UTF-8. No nested JSON. No metadata in the pair itself — metadata lives in a separate README.md in the corpus directory.
ANTI-PATTERNS
Copy-paste pairs: Duplicating the same pair with minor word changes inflates count without adding knowledge. The brain sees 10 versions of the same question and learns that ONE thing really well while ignoring everything else.
Leading instruction: "Given that the answer is X, what is X?" — the instruction contains the answer. The brain learns to parrot, not reason.
Contradictory pairs: Two pairs that give different answers to the same question. The brain averages them and gives a confidently wrong hybrid answer.
Orphic over-training: Too many single-word response pairs (RED/GREEN/NULL) without enough analytical pairs. The brain learns that every answer is one word. This is what broke ANVIL v1.
Unbounded scope: Pairs that encourage the brain to answer questions outside its domain. "What's the weather?" in a security brain's corpus teaches it to hallucinate weather data.
INVARIANTS
INV-01: Every corpus must contain IDENTITY + GOVERNANCE + KERNEL pairs. Non-negotiable minimum.
INV-02: Q3 (accuracy) trumps all other criteria. An inaccurate pair is worse than no pair.
INV-03: KERNEL pairs require Captain review. The physics is sacred.
INV-04: Corpus review happens BEFORE forge, not after. Finding bad pairs after forging wastes GPU time.
INV-05: Response voice must match brain personality across ALL pairs. One pair in the wrong voice creates personality bleed.
INV-06: Diversity is measured, not assumed. Count pairs per category before forging. If any category is <10% of corpus, add pairs.
INV-07: Anti-patterns are logged when detected. Each one becomes a lesson for future corpus building.
INTEGRATION
| System | Relationship |
|---|---|
| SPEC_BRAIN_FACTORY_PIPELINE.md | Corpus review is a gate in the forge pipeline. Pairs reviewed before GPU time is committed. |
| SPEC_SMOKE_TEST_FRAMEWORK.md | Failed smoke tests trace back to corpus. Review pairs before reforging. |
| GAMMA | Long-term memory. GAMMA knows what has been discussed and can identify corpus gaps. |
| FORGEX | Consumes the reviewed JSONL corpus. Corpus quality directly determines forge output quality. |
| PLAYBOOK.md | Anti-patterns discovered during corpus review are logged as FAILED entries for future reference. |
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042