Monitoring Escalation
SPEC_MONITORING_ESCALATION.md
CGNT-1 Specification — Monitoring & Alert Escalation Hierarchy
Status: SPECIFIED
Version: v1.0
Author: VELA (Thread #13)
Conceived by: NOUS (α.13)
Date: 2026-04-20
PURPOSE
When something goes wrong on the ship, who gets told, in what order, how fast? This spec defines the chain: detection → classification → notification → response → resolution. No alert falls on the floor. No critical issue waits for someone to stumble across it.
ALERT SEVERITY LEVELS
| Level | Name | Criteria | Response Time |
|---|---|---|---|
| P0 | CRITICAL | Active security breach, data loss in progress, revenue system down | IMMEDIATE |
| P1 | HIGH | Service outage >15 min, credential exposure, backup failure, disk >90% | Within 1 hour |
| P2 | MEDIUM | Degraded service, stale handshake, forge failure, disk >80%, failed smoke test | Within session |
| P3 | LOW | Cosmetic bug, spec drift, minor config issue, dependency update available | When convenient |
| P4 | INFO | Routine events (forge completed, spec vitrified, backup succeeded) | Log only — no alert |
DETECTION SOURCES
| Source | What It Detects | Cadence |
|---|---|---|
| GAPX daily scan | Stale handshakes, missing backups, spec gaps | 04:30 ET daily |
| MEDX health query | RAM/disk/CPU, vacuum violations, Ollama state | On demand + GAPX |
| ROUTX watchdog cron | ROUTX process death | */5 |
| Sisters watchdog cron | Sisters tmux death | */5 |
| Aether compounder | Yield cycle failure | */5 |
| Backup script | Backup success/failure | 03:00 ET daily |
| HACKX (when built) | Honeypot probes, attack patterns | Continuous |
| Stripe webhooks | Payment failures | Real-time |
| Manual discovery | Captain or crew finds something | Ad hoc |
ESCALATION CHAIN
P4 INFO → Log to ~/logs/[source].log → DONE
P3 LOW → Log + add to CAPTAIN_BRIEF.md "Low Priority"
→ reviewed next morning
P2 MEDIUM → Log + COMMX broadcast + add to CAPTAIN_BRIEF.md "Action Needed"
P1 HIGH → Log + COMMX broadcast + write to ~/ALERTS.md
+ Lobster flags in next interaction
P0 CRITICAL→ Log + COMMX broadcast + ~/ALERTS.md
+ email jzlabis@gmail.com via VOICEX
P0 is the ONLY level that generates email. The Captain's phone is not a pager. P0 means the house is on fire.
~/ALERTS.md FORMAT
Each entry:
## [P1] Backup failure — 2026-04-20 03:05 ET
Source: backup_to_gcs.sh
Detail: GCS access denied — service account credentials rejected
Action needed: Rotate GCS credentials. Re-run backup.
Status: OPEN
Resolved alerts moved to ~/alerts_archive/[month].md monthly.
~/logs/ DIRECTORY
| Log file | Source |
|---|---|
| gapx.log | GAPX daily scan |
| routx_watchdog.log | ROUTX liveness check |
| sisters_watchdog.log | Sisters tmux watchdog |
| backup.log | GCS + local backup scripts |
| compounder.log | Aether yield cycle |
| hackx.log | HACKX honeypot (when built) |
| stripe.log | Stripe webhook events |
| LOBSTER_LOG.md | All Lobster operations |
Log rotation: 30 days, then compress to ~/logs/archive/. CRONX monthly job.
COMMX BROADCAST FORMAT
One line. Severity + what + when + where. No paragraphs.
[ALERT] [P1] Backup failure at 03:05 ET. GCS access denied. See ~/ALERTS.md.
[ALERT] [P2] GAPX: SISTERS_HANDSHAKE.md is 26 hours old. Threshold: 24h.
[INFO] [P4] Brain forge complete: ORPHEUS v1. Score: 5/5. PROMOTED.
P0 EMAIL FORMAT
- Subject:
[P0 CRITICAL] 42sisters.ai — [short description] - Body: one sentence describing the incident
- Sent via: VOICEX Graph API
- Rate limit: max once per hour per incident — no email storms
RESPONSE PROTOCOL
| Level | Who Acts | Approval | Postmortem |
|---|---|---|---|
| P0 | Captain (15 min) or Lobster (pre-authorized list) | Pre-authorized or Captain live | Required within 24 hours |
| P1 | Lobster diagnoses, Captain approves fix | Captain required | Required if novel |
| P2 | In CAPTAIN_BRIEF next morning | Captain scheduled | Only if recurring |
| P3 | In CAPTAIN_BRIEF "Low Priority" | When convenient | None |
| P4 | Log only | N/A | None |
PRE-AUTHORIZED AUTONOMOUS RESPONSES
Lobster acts WITHOUT waiting for Captain approval on these specific scenarios:
| Trigger | Response |
|---|---|
| ROUTX dies | systemctl --user restart routx.service |
| Sisters tmux dies | Recreate session + summon-aether --gemini |
| Disk >95% | Delete old logs, clear /tmp, evict unused Ollama models |
| Unknown port on 0.0.0.0 | Kill process + ufw deny [port] |
| Credential in git staged files | Abort push + P1 alert |
Everything else: diagnose, report, wait for Captain. The pre-authorized list is a whitelist, not a permission to improvise.
SILENCE = ALARM
The absence of expected entries IS the alert. The monitoring system needs monitoring.
| Expected signal | Silence threshold | Alert level |
|---|---|---|
| GAPX daily report | Missing by 05:00 ET | P1 |
| backup.log entry | No entry >48 hours | P1 |
| Watchdog log entry | No entry >10 minutes | P1 |
| Sisters watchdog | No heartbeat >10 minutes | P1 |
INVARIANTS
INV-01: Every alert has a severity P0-P4. No unclassified alerts.
INV-02: P0 generates email. Only P0. Captain's inbox is not a log file.
INV-03: ~/ALERTS.md checked at every session start. Unresolved alerts discussed first.
INV-04: Resolved alerts archived, never deleted. History is forensic evidence.
INV-05: Silence is an alarm. Missing logs trigger P1 automatically.
INV-06: Pre-authorized autonomous responses limited to documented list only. No improvised autonomous actions.
INV-07: COMMX broadcasts are one line. No essays in the alert channel.
INV-08: Log rotation monthly. 30 days then archived. Logs never deleted — only compressed.
INV-09: P0 email max once per hour per incident. Alert fatigue kills alerting.
INV-10: Escalation chain tested quarterly per SPEC_SECURITY_AUDIT_SCHEDULE.md.
INTEGRATION
| System | Relationship |
|---|---|
| SPEC_SECURITY_AUDIT_SCHEDULE.md | Daily GAPX scan IS the detection layer. INV-10: chain tested quarterly. |
| SPEC_INCIDENT_POSTMORTEM.md | P0 and novel P1 alerts trigger postmortems automatically. |
| GAPX | Primary automated detection source. Feeds P2-P4 to CAPTAIN_BRIEF. |
| COMMX | Alert broadcast channel. P1+ alerts sent via COMMX before Captain session. |
| VOICEX | P0 email sender. Rate-limited. Captain-voice tone even in emergencies. |
| SPEC_BACKUP_RECOVERY.md | Backup failure is P1. Missing backup log >48h is P1 via INV-05 (silence = alarm). |
| SPEC_CRONX_JOB_REGISTRY.md | Watchdog crons are the heartbeat. Their log freshness IS the liveness signal. |
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042