Oracle Cache Health
SPEC_ORACLE_CACHE_HEALTH — Oracle Cache Health Monitoring
Version: 1.0 | Status: AUTHORIZED | Authority: α.13 | Date: 2026-04-16
PURPOSE
Defines the health monitoring contract for the oracle_toll cache subsystem: the running service (oracle-toll.service, port 8889), its storage directory (~/oracle_verdicts/), and the entries within it. The Oracle Verdict Pipeline's resilience depends on cache availability; cache unavailability causes non-deterministic Gemini regeneration (GAP-08 in SPEC_ORACLE_VERDICT_PIPELINE.md) and revenue-impacting divergence between emailed verdicts and browser-rendered verdicts.
This spec defines:
- What is monitored and how often
- What constitutes a healthy vs. degraded vs. failed cache state
- Automated cleanup rules for stale entries
- The health endpoint contract for external consumers
- Integration with SPEC_ORACLE_CACHE_ALERT.md for threshold-breach alerting
INPUTS
1 — Service health probe
- Target:
GET http://68.183.206.103:8889/health(configurable viaORACLE_TOLL_URL) - Expected response: HTTP 200 + JSON
{"status": "resonant", "phi": 0.042, "timestamp": "<ISO>", "dry_run": false} - Timeout per probe: 5 seconds [GAP — actual oracle_toll
/healthendpoint does not include cache-specific status; only service liveness; needs extension]
2 — oracle_verdicts/ directory probe
- Target:
~/oracle_verdicts/filesystem directory - Inputs collected: entry count, total directory size in bytes, oldest entry mtime, newest entry mtime, available disk space on mount point
3 — Cache entry integrity sample
- Target: random sample of N entries from
~/oracle_verdicts/(N = min(10, total_count)) - Each sampled entry read and parsed as JSON
- Checked for required top-level keys:
tier,query,verdict,cached_at
4 — Monitor configuration
CACHE_HEALTH_CHECK_INTERVAL_S— seconds between health checks (default: 60) [GAP — not yet a configurable env var]CACHE_STALENESS_THRESHOLD_DAYS— entries older than this are candidates for cleanup (default: 7 days) [GAP — not yet configurable]CACHE_DISK_FLOOR_MB— minimum free disk space on the oracle_verdicts mount; below this triggers CRITICAL (default: 500 MB) [GAP — not yet configurable]CACHE_ENTRY_COUNT_WARN— entry count above which a warning fires (default: 10,000) [GAP — not yet configurable; oracle_verdicts/ is currently empty so no baseline established]CACHE_ENTRY_COUNT_CRITICAL— entry count above which cleanup is mandatory (default: 50,000) [GAP — not yet configurable]CACHE_CONSECUTIVE_FAIL_CRITICAL— number of consecutive failed health probes before CRITICAL alert (default: 2, per SPEC_ORACLE_CACHE_ALERT.md)
5 — Baseline (current observed state)
As of 2026-04-16: ~/oracle_verdicts/ is empty (0 entries, 4096 bytes directory). This confirms GAP-07 from SPEC_ORACLE_VERDICT_PIPELINE.md — either production verdicts are not being cached, or oracle_toll is not reachable from Northflank. Baseline is therefore unestablished; the health monitor's first run after deployment establishes the operational baseline.
OUTPUTS
1 — Health status record (written per check)
Written to ~/oracle_toll_health.log (append) in JSON-lines format:
{
"ts": "<ISO UTC>",
"service_reachable": true | false,
"service_response_ms": 123,
"http_status": 200,
"verdicts_count": 42,
"verdicts_dir_bytes": 819200,
"disk_free_mb": 12340,
"oldest_entry_age_days": 2.4,
"newest_entry_age_days": 0.01,
"sample_integrity": "ok" | "corrupt:{filename}",
"overall": "healthy" | "degraded" | "failed"
}
[GAP — oracle_toll_health.log does not yet exist; this spec creates the contract for it]
2 — Health endpoint extension (oracle_toll.py /health)
The existing /health endpoint MUST be extended to include cache-layer fields:
{
"status": "resonant" | "degraded" | "failed",
"phi": 0.042,
"timestamp": "<ISO>",
"dry_run": false,
"cache": {
"verdicts_dir": "/home/nous/oracle_verdicts",
"verdicts_count": 42,
"dir_bytes": 819200,
"oldest_entry_age_days": 2.4,
"newest_entry_age_days": 0.01,
"disk_free_mb": 12340,
"status": "healthy" | "degraded" | "failed"
}
}
[GAP — this cache block does not exist in the current /health endpoint; needs implementation]
3 — CREW_CHANNEL alert (threshold breach)
When any monitored metric crosses a threshold, route to SPEC_ORACLE_CACHE_ALERT.md alerting channel. The health monitor is the primary source of CRITICAL alerts when the service is unreachable for ≥ CACHE_CONSECUTIVE_FAIL_CRITICAL consecutive probes.
4 — Cleanup manifest (automated)
When entries older than CACHE_STALENESS_THRESHOLD_DAYS are removed, a cleanup record is written to ~/oracle_verdicts_cleanup.log:
[CLEANUP] <ISO_TIMESTAMP> removed <N> entries older than <X> days. Freed <Y> MB.
INVARIANTS
- Check interval is bounded —
CACHE_HEALTH_CHECK_INTERVAL_SMUST be between 30 and 3600 seconds inclusive. Values outside this range are rejected at startup. A check every 30 seconds is the minimum (prevents RPC flooding); a check every hour is the maximum (stale detection window must not exceed 1 hour).
- Disk floor is hard — When free disk space on the
oracle_verdicts/mount falls belowCACHE_DISK_FLOOR_MB, the health status transitions immediately toCRITICALand cleanup of entries older thanCACHE_STALENESS_THRESHOLD_DAYSis triggered automatically. Cleanup MUST execute before the next verdict write is attempted.
- Integrity sampling is non-destructive — Reading entries for integrity checks MUST NOT modify, lock, or delete any entry. Integrity checks are read-only operations.
- Cleanup is age-gated, not count-gated — Cleanup removes entries by age (
mtime < now - CACHE_STALENESS_THRESHOLD_DAYS), never by count. Deleting the newest entries to enforce a count cap would delete active, payable customer verdicts. Count thresholds trigger warnings; age gates trigger deletion.
- Service liveness and cache health are separate dimensions — The service being reachable (HTTP 200 on
/health) does NOT imply the cache is healthy. Theoracle_verdicts/directory could be full, corrupted, or inaccessible even when the service is up. Both dimensions are checked and reported independently.
- Health log is append-only —
~/oracle_toll_health.logis never truncated by the monitor. Rotation is handled by external logrotate. The monitor MUST NOT callopen(path, 'w')on this file. [GAP — logrotate config for oracle_toll_health.log not yet defined]
- Empty directory is a valid state, not a failure —
verdicts_count = 0ishealthy, notdegraded. It means no paid verdicts have been cached yet (or cleanup ran recently). This distinguishes from the GAP-07 mystery (empty directory when verdicts are expected) — detecting the latter requires cross-referencing with payment logs, which is out of scope for this spec. [GAP — cross-referencing cache count against Stripe payment count would close GAP-07 definitively but is not specified here]
- Cleanup is logged before execution — Before removing any entry, the cleanup process writes the list of files to be deleted to
oracle_verdicts_cleanup.log. This ensures an audit trail exists even if the delete operation is interrupted mid-run.
VERIFICATION CRITERIA
Σ.✓ conditions — health monitoring is operating correctly when:
- Σ.✓ Service health probe fires on interval — Start the health monitor and verify via
~/oracle_toll_health.logthat new JSON-lines entries appear everyCACHE_HEALTH_CHECK_INTERVAL_S± 5 seconds. After 5 minutes at default 60s interval, log contains 5 ± 1 entries.
- Σ.✓ Service unreachable triggers CRITICAL — Stop oracle-toll.service. Verify that after
CACHE_CONSECUTIVE_FAIL_CRITICALconsecutive failed probes, aCRITICALalert appears in~/CREW_CHANNELand~/ALERT.log. Restart oracle-toll.service. Verify next successful probe logsoverall: "healthy"and broadcasts a recovery notice to CREW_CHANNEL. [GAP — recovery notification is a design choice not yet specified; recovery broadcast is a recommended addition]
- Σ.✓ Disk floor alarm — Fill
~/oracle_verdicts/with synthetic entries until free space drops belowCACHE_DISK_FLOOR_MB. Verify health log recordsoverall: "failed"and cleanup fires automatically. After cleanup, verify disk usage drops and next probe returnsoverall: "healthy".
- Σ.✓ Corrupt entry detection — Write a non-JSON file to
~/oracle_verdicts/test_corrupt.json. On the next health check interval, verify integrity sample catches the corrupt entry and logssample_integrity: "corrupt:test_corrupt.json"withoverall: "degraded". Verify aCRITICALalert is dispatched per SPEC_ORACLE_CACHE_ALERT.md FAILURE MODE 4.
- Σ.✓ Age-based cleanup — Create synthetic entries with mtime older than
CACHE_STALENESS_THRESHOLD_DAYS(usetouch -tto backdate). Run cleanup manually or wait for next interval. Verify: (a) cleanup log entry written before files deleted; (b) old entries removed; (c) entries newer than threshold preserved.
- Σ.✓ Health endpoint extension —
GET http://68.183.206.103:8889/healthreturns JSON withcacheblock containing all six fields defined in OUTPUTS section 2. Validate schema with a JSON Schema check.
- Σ.✓ Entry count thresholds — Populate
~/oracle_verdicts/withCACHE_ENTRY_COUNT_WARN + 1synthetic entries. Verify next health probe logsoverall: "degraded"(not"failed") and dispatchesWARNto CREW_CHANNEL. Populate toCACHE_ENTRY_COUNT_CRITICAL + 1entries. Verifyoverall: "failed"andERRORalert. [GAP — neither threshold has been calibrated against production usage patterns because oracle_verdicts/ is empty; defaults are provisional]
FAILURE MODES
- Σ.⊠ oracle_toll service crashed —
oracle-toll.servicehas exited (see oracle_toll.log —ERROR: address already in useobserved 2026-04-16, indicating a restart storm). Health probe returnsConnectionRefusedError. Monitor detects service unreachable, increments consecutive-fail counter, dispatches CRITICAL after threshold. Meanwhile,verdictCache.tsfalls back to Gemini regeneration (non-deterministic verdict risk). Mitigation: systemdRestart=alwayswithRestartSec=20will revive the service; health monitor tracks recovery and broadcasts when service returns.
- Σ.⊠ oracle_verdicts/ disk full — Filesystem mount hosting
~/oracle_verdicts/reaches capacity.store_cached_verdict()raisesOSError: [Errno 28] No space left on deviceand oracle_toll returns HTTP 500. Current behavior:cacheVerdict()in verdictCache.ts silently swallows the 500. Target behavior: health monitor's disk-floor check (INVARIANT-2) detects low disk pre-emptively and fires cleanup before exhaustion. If disk fills faster than check interval, HTTP 500 from oracle_toll triggers CRITICAL alert via cache write alerting path.
- Σ.⊠ oracle_verdicts/ directory missing — Directory deleted manually or mount point changed. oracle_toll.py calls
VERDICTS_DIR.mkdir(exist_ok=True)at startup, so the directory is recreated on next service restart. However, if the mount point itself is gone, mkdir may create the directory on the root filesystem (masking the mount failure). Health monitor MUST verify the directory is on the expected filesystem (by checking device ID), not just that it exists. [GAP — device ID check not specified in current design; recommended addition]
- Σ.⊠ Health check loop dies — The health monitor process itself crashes. No more health records written; no more alerts. This is an unmonitored-monitor failure. Mitigation: health monitor MUST be run as a systemd service (
oracle-cache-health.service) withRestart=always. NOUS or κ should verify it is running at boot per CLAUDE.md boot sequence. [GAP — oracle-cache-health.service does not yet exist]
- Σ.⊠ Integrity sample misses corrupt entries — Sampling N entries from a directory of M entries has a miss probability of
(1 - N/M)^corrupt_count. For large M and small corrupt_count, corruption may not be detected in any given sample. Full integrity scan is impractical at high entry counts. Mitigation: increase sample size proportionally whenverdicts_count > CACHE_ENTRY_COUNT_WARN; run full scan during off-peak hours (Sunday 02:00 UTC). [GAP — scheduled full scan not yet specified]
- Σ.⊠ Health log unbounded growth —
~/oracle_toll_health.logaccumulates indefinitely at rate 1 line /CACHE_HEALTH_CHECK_INTERVAL_S. At 60s interval, that is ~1440 lines/day, ~525,600 lines/year. Each line is ~300 bytes → ~150 MB/year. Not critical but needs logrotate. [GAP — logrotate config not specified; recommended: daily rotation, keep 30 days]
- Σ.⊠ Staleness threshold removes valid entries — A customer pays, verdict cached, result page loads, customer returns 6 days later to recheck. If
CACHE_STALENESS_THRESHOLD_DAYS = 7, their entry is deleted on day 7. On day 8 the result page returns 404 from oracle_toll and falls back to Gemini regeneration — potentially different verdict. This is the Non-Determinism risk (GAP-08 in parent spec). The health monitor's cleanup CANNOT be made safe for arbitrary staleness thresholds without a customer-visible TTL promise. [GAP — customer-facing cache TTL policy not established; until it is,CACHE_STALENESS_THRESHOLD_DAYSshould default to 30, not 7]
DEPENDENCIES
| Dependency | Role |
|------------|------|
| oracle_toll.py (port 8889) | Monitored service; provides /health and /cache/ endpoints |
| ~/oracle_verdicts/ | Monitored directory; contains verdict files |
| Filesystem mount on /home/nous/ | Disk space source for floor check |
| ~/oracle_toll_health.log | Health log output (created by monitor) |
| ~/CREW_CHANNEL | Alert destination |
| ~/ALERT.log | CRITICAL alert destination + κ boot pickup |
| SPEC_ORACLE_CACHE_ALERT.md | Alert dispatch spec — health monitor triggers it |
DEPENDENTS
| Dependent | How it depends |
|-----------|---------------|
| SPEC_ORACLE_VERDICT_PIPELINE.md | Health monitoring closes GAP-07 from that document |
| SPEC_ORACLE_CACHE_ALERT.md | Health monitor is primary source of CRITICAL alerts; both specs are companion documents |
| Oracle Verdict Pipeline (operational) | Health status determines whether cache is safe to use or fallback should be pre-emptively applied |
| C.L.O.D. boot sequence (CLAUDE.md) | Boot sequence should include tail ~/oracle_toll_health.log check |
GAPS IDENTIFIED DURING SPECIFICATION
| Gap ID | Description | Impact |
|--------|-------------|--------|
| HEALTH-GAP-01 | /health endpoint in oracle_toll.py does not include cache-layer fields — must be extended | HIGH — external consumers (Northflank health checks, monitoring dashboards) have no cache visibility |
| HEALTH-GAP-02 | No oracle-cache-health.service systemd service exists — health monitor runs nowhere | CRITICAL — spec exists but no implementation; nothing is monitoring |
| HEALTH-GAP-03 | Configuration env vars (CACHE_HEALTH_CHECK_INTERVAL_S, etc.) not yet defined or documented in AETHER_PARAMETERS.env | MEDIUM — hardcoded defaults not operationally tunable |
| HEALTH-GAP-04 | No logrotate config for oracle_toll_health.log | LOW — log will grow unbounded; 150 MB/year is manageable short-term |
| HEALTH-GAP-05 | Recovery notification (service comes back up after CRITICAL) not specified — CREW_CHANNEL receives the alarm but not the all-clear | MEDIUM — ops team left hanging after an incident |
| HEALTH-GAP-06 | CACHE_STALENESS_THRESHOLD_DAYS default of 7 days risks cleaning entries customers expect to persist — needs a customer-facing TTL policy decision before cleanup can be safely enabled | HIGH — cleanup automation blocked until policy resolved |
| HEALTH-GAP-07 | GAP-07 from parent spec (empty oracle_verdicts/ in production) is only partially addressed — cross-referencing cache count against Stripe payment count requires payment log access not available to this monitor | MEDIUM — root cause of empty directory unresolved |
| HEALTH-GAP-08 | No scheduled full integrity scan — sampling misses low-frequency corruption | LOW — sampling catches most issues; full scan deferred |
REFERENCES
| File | Role |
|------|------|
| /home/nous/oracle_toll.py | Monitored service; /health, /cache/ endpoints |
| /home/nous/memories/SPEC_ORACLE_VERDICT_PIPELINE.md | Parent spec; GAP-07 closes here |
| /home/nous/memories/SPEC_ORACLE_CACHE_ALERT.md | Companion spec — alerting contract for threshold breaches |
| /etc/systemd/system/oracle-toll.service | Service definition; Restart=always is the liveness backstop |
| /home/nous/oracle_toll.log | Service operational log (port-already-in-use errors observed 2026-04-16) |
| /home/nous/oracle_verdicts/ | Cache storage directory (currently empty) |
Φζ.⊤.
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042