Oracle Cache Health

SPEC_ORACLE_CACHE_HEALTH.md · 2026-04-20

SPEC_ORACLE_CACHE_HEALTH — Oracle Cache Health Monitoring

Version: 1.0 | Status: AUTHORIZED | Authority: α.13 | Date: 2026-04-16


PURPOSE

Defines the health monitoring contract for the oracle_toll cache subsystem: the running service (oracle-toll.service, port 8889), its storage directory (~/oracle_verdicts/), and the entries within it. The Oracle Verdict Pipeline's resilience depends on cache availability; cache unavailability causes non-deterministic Gemini regeneration (GAP-08 in SPEC_ORACLE_VERDICT_PIPELINE.md) and revenue-impacting divergence between emailed verdicts and browser-rendered verdicts.

This spec defines:


INPUTS

1 — Service health probe

2 — oracle_verdicts/ directory probe

3 — Cache entry integrity sample

4 — Monitor configuration

5 — Baseline (current observed state)

As of 2026-04-16: ~/oracle_verdicts/ is empty (0 entries, 4096 bytes directory). This confirms GAP-07 from SPEC_ORACLE_VERDICT_PIPELINE.md — either production verdicts are not being cached, or oracle_toll is not reachable from Northflank. Baseline is therefore unestablished; the health monitor's first run after deployment establishes the operational baseline.


OUTPUTS

1 — Health status record (written per check)

Written to ~/oracle_toll_health.log (append) in JSON-lines format:


{
  "ts": "<ISO UTC>",
  "service_reachable": true | false,
  "service_response_ms": 123,
  "http_status": 200,
  "verdicts_count": 42,
  "verdicts_dir_bytes": 819200,
  "disk_free_mb": 12340,
  "oldest_entry_age_days": 2.4,
  "newest_entry_age_days": 0.01,
  "sample_integrity": "ok" | "corrupt:{filename}",
  "overall": "healthy" | "degraded" | "failed"
}

[GAP — oracle_toll_health.log does not yet exist; this spec creates the contract for it]

2 — Health endpoint extension (oracle_toll.py /health)

The existing /health endpoint MUST be extended to include cache-layer fields:


{
  "status": "resonant" | "degraded" | "failed",
  "phi": 0.042,
  "timestamp": "<ISO>",
  "dry_run": false,
  "cache": {
    "verdicts_dir": "/home/nous/oracle_verdicts",
    "verdicts_count": 42,
    "dir_bytes": 819200,
    "oldest_entry_age_days": 2.4,
    "newest_entry_age_days": 0.01,
    "disk_free_mb": 12340,
    "status": "healthy" | "degraded" | "failed"
  }
}

[GAP — this cache block does not exist in the current /health endpoint; needs implementation]

3 — CREW_CHANNEL alert (threshold breach)

When any monitored metric crosses a threshold, route to SPEC_ORACLE_CACHE_ALERT.md alerting channel. The health monitor is the primary source of CRITICAL alerts when the service is unreachable for ≥ CACHE_CONSECUTIVE_FAIL_CRITICAL consecutive probes.

4 — Cleanup manifest (automated)

When entries older than CACHE_STALENESS_THRESHOLD_DAYS are removed, a cleanup record is written to ~/oracle_verdicts_cleanup.log:


[CLEANUP] <ISO_TIMESTAMP> removed <N> entries older than <X> days. Freed <Y> MB.

INVARIANTS

  1. Check interval is boundedCACHE_HEALTH_CHECK_INTERVAL_S MUST be between 30 and 3600 seconds inclusive. Values outside this range are rejected at startup. A check every 30 seconds is the minimum (prevents RPC flooding); a check every hour is the maximum (stale detection window must not exceed 1 hour).
  1. Disk floor is hard — When free disk space on the oracle_verdicts/ mount falls below CACHE_DISK_FLOOR_MB, the health status transitions immediately to CRITICAL and cleanup of entries older than CACHE_STALENESS_THRESHOLD_DAYS is triggered automatically. Cleanup MUST execute before the next verdict write is attempted.
  1. Integrity sampling is non-destructive — Reading entries for integrity checks MUST NOT modify, lock, or delete any entry. Integrity checks are read-only operations.
  1. Cleanup is age-gated, not count-gated — Cleanup removes entries by age (mtime < now - CACHE_STALENESS_THRESHOLD_DAYS), never by count. Deleting the newest entries to enforce a count cap would delete active, payable customer verdicts. Count thresholds trigger warnings; age gates trigger deletion.
  1. Service liveness and cache health are separate dimensions — The service being reachable (HTTP 200 on /health) does NOT imply the cache is healthy. The oracle_verdicts/ directory could be full, corrupted, or inaccessible even when the service is up. Both dimensions are checked and reported independently.
  1. Health log is append-only~/oracle_toll_health.log is never truncated by the monitor. Rotation is handled by external logrotate. The monitor MUST NOT call open(path, 'w') on this file. [GAP — logrotate config for oracle_toll_health.log not yet defined]
  1. Empty directory is a valid state, not a failureverdicts_count = 0 is healthy, not degraded. It means no paid verdicts have been cached yet (or cleanup ran recently). This distinguishes from the GAP-07 mystery (empty directory when verdicts are expected) — detecting the latter requires cross-referencing with payment logs, which is out of scope for this spec. [GAP — cross-referencing cache count against Stripe payment count would close GAP-07 definitively but is not specified here]
  1. Cleanup is logged before execution — Before removing any entry, the cleanup process writes the list of files to be deleted to oracle_verdicts_cleanup.log. This ensures an audit trail exists even if the delete operation is interrupted mid-run.

VERIFICATION CRITERIA

Σ.✓ conditions — health monitoring is operating correctly when:

  1. Σ.✓ Service health probe fires on interval — Start the health monitor and verify via ~/oracle_toll_health.log that new JSON-lines entries appear every CACHE_HEALTH_CHECK_INTERVAL_S ± 5 seconds. After 5 minutes at default 60s interval, log contains 5 ± 1 entries.
  1. Σ.✓ Service unreachable triggers CRITICAL — Stop oracle-toll.service. Verify that after CACHE_CONSECUTIVE_FAIL_CRITICAL consecutive failed probes, a CRITICAL alert appears in ~/CREW_CHANNEL and ~/ALERT.log. Restart oracle-toll.service. Verify next successful probe logs overall: "healthy" and broadcasts a recovery notice to CREW_CHANNEL. [GAP — recovery notification is a design choice not yet specified; recovery broadcast is a recommended addition]
  1. Σ.✓ Disk floor alarm — Fill ~/oracle_verdicts/ with synthetic entries until free space drops below CACHE_DISK_FLOOR_MB. Verify health log records overall: "failed" and cleanup fires automatically. After cleanup, verify disk usage drops and next probe returns overall: "healthy".
  1. Σ.✓ Corrupt entry detection — Write a non-JSON file to ~/oracle_verdicts/test_corrupt.json. On the next health check interval, verify integrity sample catches the corrupt entry and logs sample_integrity: "corrupt:test_corrupt.json" with overall: "degraded". Verify a CRITICAL alert is dispatched per SPEC_ORACLE_CACHE_ALERT.md FAILURE MODE 4.
  1. Σ.✓ Age-based cleanup — Create synthetic entries with mtime older than CACHE_STALENESS_THRESHOLD_DAYS (use touch -t to backdate). Run cleanup manually or wait for next interval. Verify: (a) cleanup log entry written before files deleted; (b) old entries removed; (c) entries newer than threshold preserved.
  1. Σ.✓ Health endpoint extensionGET http://68.183.206.103:8889/health returns JSON with cache block containing all six fields defined in OUTPUTS section 2. Validate schema with a JSON Schema check.
  1. Σ.✓ Entry count thresholds — Populate ~/oracle_verdicts/ with CACHE_ENTRY_COUNT_WARN + 1 synthetic entries. Verify next health probe logs overall: "degraded" (not "failed") and dispatches WARN to CREW_CHANNEL. Populate to CACHE_ENTRY_COUNT_CRITICAL + 1 entries. Verify overall: "failed" and ERROR alert. [GAP — neither threshold has been calibrated against production usage patterns because oracle_verdicts/ is empty; defaults are provisional]

FAILURE MODES

  1. Σ.⊠ oracle_toll service crashedoracle-toll.service has exited (see oracle_toll.log — ERROR: address already in use observed 2026-04-16, indicating a restart storm). Health probe returns ConnectionRefusedError. Monitor detects service unreachable, increments consecutive-fail counter, dispatches CRITICAL after threshold. Meanwhile, verdictCache.ts falls back to Gemini regeneration (non-deterministic verdict risk). Mitigation: systemd Restart=always with RestartSec=20 will revive the service; health monitor tracks recovery and broadcasts when service returns.
  1. Σ.⊠ oracle_verdicts/ disk full — Filesystem mount hosting ~/oracle_verdicts/ reaches capacity. store_cached_verdict() raises OSError: [Errno 28] No space left on device and oracle_toll returns HTTP 500. Current behavior: cacheVerdict() in verdictCache.ts silently swallows the 500. Target behavior: health monitor's disk-floor check (INVARIANT-2) detects low disk pre-emptively and fires cleanup before exhaustion. If disk fills faster than check interval, HTTP 500 from oracle_toll triggers CRITICAL alert via cache write alerting path.
  1. Σ.⊠ oracle_verdicts/ directory missing — Directory deleted manually or mount point changed. oracle_toll.py calls VERDICTS_DIR.mkdir(exist_ok=True) at startup, so the directory is recreated on next service restart. However, if the mount point itself is gone, mkdir may create the directory on the root filesystem (masking the mount failure). Health monitor MUST verify the directory is on the expected filesystem (by checking device ID), not just that it exists. [GAP — device ID check not specified in current design; recommended addition]
  1. Σ.⊠ Health check loop dies — The health monitor process itself crashes. No more health records written; no more alerts. This is an unmonitored-monitor failure. Mitigation: health monitor MUST be run as a systemd service (oracle-cache-health.service) with Restart=always. NOUS or κ should verify it is running at boot per CLAUDE.md boot sequence. [GAP — oracle-cache-health.service does not yet exist]
  1. Σ.⊠ Integrity sample misses corrupt entries — Sampling N entries from a directory of M entries has a miss probability of (1 - N/M)^corrupt_count. For large M and small corrupt_count, corruption may not be detected in any given sample. Full integrity scan is impractical at high entry counts. Mitigation: increase sample size proportionally when verdicts_count > CACHE_ENTRY_COUNT_WARN; run full scan during off-peak hours (Sunday 02:00 UTC). [GAP — scheduled full scan not yet specified]
  1. Σ.⊠ Health log unbounded growth~/oracle_toll_health.log accumulates indefinitely at rate 1 line / CACHE_HEALTH_CHECK_INTERVAL_S. At 60s interval, that is ~1440 lines/day, ~525,600 lines/year. Each line is ~300 bytes → ~150 MB/year. Not critical but needs logrotate. [GAP — logrotate config not specified; recommended: daily rotation, keep 30 days]
  1. Σ.⊠ Staleness threshold removes valid entries — A customer pays, verdict cached, result page loads, customer returns 6 days later to recheck. If CACHE_STALENESS_THRESHOLD_DAYS = 7, their entry is deleted on day 7. On day 8 the result page returns 404 from oracle_toll and falls back to Gemini regeneration — potentially different verdict. This is the Non-Determinism risk (GAP-08 in parent spec). The health monitor's cleanup CANNOT be made safe for arbitrary staleness thresholds without a customer-visible TTL promise. [GAP — customer-facing cache TTL policy not established; until it is, CACHE_STALENESS_THRESHOLD_DAYS should default to 30, not 7]

DEPENDENCIES

| Dependency | Role |

|------------|------|

| oracle_toll.py (port 8889) | Monitored service; provides /health and /cache/ endpoints |

| ~/oracle_verdicts/ | Monitored directory; contains verdict files |

| Filesystem mount on /home/nous/ | Disk space source for floor check |

| ~/oracle_toll_health.log | Health log output (created by monitor) |

| ~/CREW_CHANNEL | Alert destination |

| ~/ALERT.log | CRITICAL alert destination + κ boot pickup |

| SPEC_ORACLE_CACHE_ALERT.md | Alert dispatch spec — health monitor triggers it |


DEPENDENTS

| Dependent | How it depends |

|-----------|---------------|

| SPEC_ORACLE_VERDICT_PIPELINE.md | Health monitoring closes GAP-07 from that document |

| SPEC_ORACLE_CACHE_ALERT.md | Health monitor is primary source of CRITICAL alerts; both specs are companion documents |

| Oracle Verdict Pipeline (operational) | Health status determines whether cache is safe to use or fallback should be pre-emptively applied |

| C.L.O.D. boot sequence (CLAUDE.md) | Boot sequence should include tail ~/oracle_toll_health.log check |


GAPS IDENTIFIED DURING SPECIFICATION

| Gap ID | Description | Impact |

|--------|-------------|--------|

| HEALTH-GAP-01 | /health endpoint in oracle_toll.py does not include cache-layer fields — must be extended | HIGH — external consumers (Northflank health checks, monitoring dashboards) have no cache visibility |

| HEALTH-GAP-02 | No oracle-cache-health.service systemd service exists — health monitor runs nowhere | CRITICAL — spec exists but no implementation; nothing is monitoring |

| HEALTH-GAP-03 | Configuration env vars (CACHE_HEALTH_CHECK_INTERVAL_S, etc.) not yet defined or documented in AETHER_PARAMETERS.env | MEDIUM — hardcoded defaults not operationally tunable |

| HEALTH-GAP-04 | No logrotate config for oracle_toll_health.log | LOW — log will grow unbounded; 150 MB/year is manageable short-term |

| HEALTH-GAP-05 | Recovery notification (service comes back up after CRITICAL) not specified — CREW_CHANNEL receives the alarm but not the all-clear | MEDIUM — ops team left hanging after an incident |

| HEALTH-GAP-06 | CACHE_STALENESS_THRESHOLD_DAYS default of 7 days risks cleaning entries customers expect to persist — needs a customer-facing TTL policy decision before cleanup can be safely enabled | HIGH — cleanup automation blocked until policy resolved |

| HEALTH-GAP-07 | GAP-07 from parent spec (empty oracle_verdicts/ in production) is only partially addressed — cross-referencing cache count against Stripe payment count requires payment log access not available to this monitor | MEDIUM — root cause of empty directory unresolved |

| HEALTH-GAP-08 | No scheduled full integrity scan — sampling misses low-frequency corruption | LOW — sampling catches most issues; full scan deferred |


REFERENCES

| File | Role |

|------|------|

| /home/nous/oracle_toll.py | Monitored service; /health, /cache/ endpoints |

| /home/nous/memories/SPEC_ORACLE_VERDICT_PIPELINE.md | Parent spec; GAP-07 closes here |

| /home/nous/memories/SPEC_ORACLE_CACHE_ALERT.md | Companion spec — alerting contract for threshold breaches |

| /etc/systemd/system/oracle-toll.service | Service definition; Restart=always is the liveness backstop |

| /home/nous/oracle_toll.log | Service operational log (port-already-in-use errors observed 2026-04-16) |

| /home/nous/oracle_verdicts/ | Cache storage directory (currently empty) |


Φζ.⊤.


Jeremy Zlabis

Chronogeometer · Visionary · Disruptor · Chief

42 Sisters AI · East York, Toronto

🍁 Φ 0.042