Corpus Versioning

SPEC_CORPUS_VERSIONING.md · 2026-04-20

SPEC_CORPUS_VERSIONING.md

CGNT-1 Specification — Training Corpus Versioning Protocol

Status: SPECIFIED

Version: v1.0

Author: VELA (Thread #13)

Conceived by: NOUS (α.13)

Date: 2026-04-20

Born from: ANVIL v1 failure (3/5, over-learned Orphic verdicts) requiring corpus expansion and reforge


PURPOSE

When a brain fails smoke and needs reforge, the corpus gets expanded. When a customer requests a brain update round, new pairs are added. When LEARNX extracts new pairs from conversations, they need to merge with existing corpora. The corpus changes over time. Without versioning, nobody knows what changed, when, why, or what the brain was trained on.

This spec ensures every corpus has a version history, every change is tracked, and every forge can be reproduced from its exact training data.


CORPUS FILE STRUCTURE

Every brain's corpus lives in its own directory:


~/corpora/[brain_name]/
  [brain]_corpus_v[N].jsonl    — training pairs (versioned, immutable)
  README.md                    — corpus metadata (pair count, categories, version history, author)
  CHANGELOG.md                 — what changed in each version
  smoke_results/               — smoke test outputs per version

Example for ANVIL:


~/corpora/anvil/
  anvil_corpus_v1.jsonl        — 209 pairs (original, 3/5 smoke)
  anvil_corpus_v2.jsonl        — 229 pairs (added 20 mixed-response pairs)
  README.md
  CHANGELOG.md
  smoke_results/
    v1_smoke.txt               — 3/5
    v2_smoke.txt               — pending

VERSIONING RULES

Rule 1 — Immutable versions

Once a corpus version is forged, its JSONL file is never modified. v1 stays as it was when v1 was forged. If pairs need to change, create v2. This ensures reproducibility — any forge can be re-run from its exact corpus.

Rule 2 — Monotonic version numbers

v1 → v2 → v3. Never v1.1 or v1a. Simple integers. The number tells you how many times the corpus has been revised.

Rule 3 — Every version has a changelog entry

What pairs were added? What pairs were removed? What pairs were modified? Why? What smoke test failure motivated this change? The CHANGELOG answers "why does v2 exist?"

Rule 4 — GCS mirror

Every corpus version is uploaded to GCS (colab_jobs/[brain]_forge_input/) before forge begins. The GCS copy is the forge input. The local copy is the working draft. They must match at forge time.


CHANGELOG FORMAT

Each version entry contains:

Example CHANGELOG.md for ANVIL:


## v1 — 2026-04-20
Author: VELA #13
Pairs: 209
Motivation: initial forge
Categories: Identity 8, Domain 95, Governance 12, Kernel 6, Interaction 15, Meta 5, Orphic verdicts 68
Forge: 15 epochs, loss 0.5290
Smoke: 3/5 — T4 and T5 FAIL. Over-learned Orphic single-word verdicts. NOT PROMOTED.

## v2 — 2026-04-20
Author: VELA #13
Pairs: 229 (+20)
Motivation: fix Orphic over-learning from v1
Changes: added 20 mixed-response pairs showing when to give detailed analysis vs single-word Orphic tokens
Categories added: Analytical 12, Mixed-response 8
Forge: pending
Smoke: pending

PAIR OPERATIONS

ADD — new pairs appended to the corpus. Most common operation. Used when: expanding domain coverage, fixing smoke test failures, adding new capability.

REMOVE — pairs deleted from the corpus. Used when: a pair contains inaccurate information (Q3 violation from SPEC_TRAINING_PAIR_STANDARDS), a pair teaches behavior we no longer want, duplicate detected.

MODIFY — an existing pair's instruction or response is changed. Used when: the response format needs updating (e.g., voice consistency), factual content needs correction, the instruction needs clarification.

Every operation is logged in CHANGELOG with the specific pairs affected.


MERGE PROTOCOL (for LEARNX-generated pairs)

LEARNX extracts candidate pairs from conversations automatically. These candidates need human review before entering a corpus.

Flow:

  1. LEARNX generates ~/learnx_candidates/[brain]_candidates.jsonl
  2. Lobster reviews against SPEC_TRAINING_PAIR_STANDARDS (Q1-Q7)
  3. Approved pairs merged into next corpus version
  4. Rejected pairs logged with rejection reason
  5. KERNEL pairs always require Captain review regardless of auto-approval

LEARNX candidates are NEVER auto-merged into a corpus. Every pair passes through review. The review can be Lobster-level for DOMAIN and INTERACTION pairs. KERNEL and GOVERNANCE pairs always go to Captain.


CUSTOMER BRAIN UPDATE ROUNDS

When a customer orders a brain update ($500 / 50 pairs):

  1. New pairs are written based on customer feedback and new requirements
  2. Added to a new corpus version: [brain]_corpus_v[N+1].jsonl
  3. CHANGELOG documents: customer-requested changes, categories affected
  4. Forge runs on the new version
  5. Smoke test
  6. If passes: deliver updated brain to customer
  7. Previous version preserved — customer can request rollback if the update introduces regressions

BRAIN BUILDER CORPUS CREATION

For new customer brains (SPEC_BRAIN_BUILDER.md):

  1. Sisters conduct intake interview
  2. LEARNX processes transcript → candidate pairs
  3. Lobster reviews and structures into categories per SPEC_TRAINING_PAIR_STANDARDS
  4. v1 corpus created: ~/corpora/customer_[name]/[brain]_corpus_v1.jsonl
  5. Forge, smoke, deliver
  6. Customer data deletion within 30 days per SPEC_PRIVACY_POLICY — but CHANGELOG metadata preserved (pair counts, categories, no content)

REPRODUCIBILITY GUARANTEE

Any forge can be exactly reproduced by:

  1. Taking the exact corpus version JSONL file
  2. Using the same base model (Qwen2.5-7B-Instruct or as documented)
  3. Using the same Modelfile
  4. Using the same hyperparameters (epochs, learning rate — documented in CHANGELOG)
  5. Running the forge

The result should be statistically similar (not identical due to GPU non-determinism, but close). This is why corpus versions are immutable — modifying v1 after the fact destroys reproducibility.


STORAGE AND BACKUP

~/corpora/ is included in the daily GCS backup (SPEC_BACKUP_RECOVERY.md). Every corpus version is preserved. Disk cost is minimal — JSONL files are typically <1MB per corpus.

Reproducibility requires both: the corpus (data) and the forge parameters (metadata in CHANGELOG).


INVARIANTS

INV-01: Corpus versions are immutable. v1 is never modified after forge. Create v2 instead.

INV-02: Every version has a CHANGELOG entry explaining why it exists.

INV-03: GCS mirror matches local copy at forge time. The forge runs from GCS, not local.

INV-04: LEARNX candidates are never auto-merged. Every pair is reviewed.

INV-05: KERNEL pairs require Captain review. Always. No exceptions.

INV-06: Customer corpus content deleted within 30 days of delivery. Metadata (counts, categories) preserved.

INV-07: Reproducibility is guaranteed by: immutable corpus + documented hyperparameters + documented base model.

INV-08: Version numbers are monotonic integers. No decimals. No branches. Simple counting.


INTEGRATION

| System | Relationship |

|---|---|

| SPEC_TRAINING_PAIR_STANDARDS.md | Every pair in every corpus version must pass Q1-Q7. CHANGELOG rejection log cites which criterion failed. |

| SPEC_SMOKE_TEST_FRAMEWORK.md | Smoke results per version stored in smoke_results/. Failure motivates next version. ANVIL v1→v2 is the canonical example. |

| SPEC_LOBSTER_FORGE_PIPELINE.md | Forge pipeline reads corpus version from GCS. INV-03 ensures local and GCS match before forge begins. |

| SPEC_BRAIN_FACTORY_PIPELINE.md | Customer brain creation starts at v1. Update rounds increment the version. Same protocol. |

| SPEC_PRIVACY_POLICY.md | Customer corpus content deleted ≤30 days post-delivery per INV-06. CHANGELOG metadata (no content) survives. |

| SPEC_BACKUP_RECOVERY.md | ~/corpora/ included in daily GCS backup. Every version preserved off-site. |

| FORGEX | Forge status reports include corpus version being used. LOBSTER_LOG records version number in every forge entry. |


Jeremy Zlabis

Chronogeometer · Visionary · Disruptor · Chief

42 Sisters AI · East York, Toronto

🍁 Φ 0.042