Oracle Cache Alert

SPEC_ORACLE_CACHE_ALERT.md · 2026-04-20

SPEC_ORACLE_CACHE_ALERT — Oracle Cache Failure Alerting

Version: 1.0 | Status: AUTHORIZED | Authority: α.13 | Date: 2026-04-16


PURPOSE

Defines the alerting behavior when the oracle_toll cache service (port 8889, http://68.183.206.103:8889) is unreachable, returns unexpected errors, or delivers a corrupted cache entry. The cache is a revenue-critical dependency of the Oracle Verdict Pipeline: when it silently fails, a paying customer's verdict may be regenerated non-deterministically, differing from the verdict already delivered by email (GAP-08 in SPEC_ORACLE_VERDICT_PIPELINE.md). Silent failure (current state in verdictCache.ts) is unacceptable operationally. This spec defines the alerting contract that replaces silence.

The alert channel is CREW_CHANNEL. All severities route there. Critical severity also writes to ~/ALERT.log for κ boot pickup.


INPUTS

1 — Cache write attempt (cacheVerdict)

Triggered when the Stripe webhook completes Gemini generation and calls POST /cache/{session_id} on oracle_toll. Inputs:

2 — Cache read attempt (getCachedVerdict)

Triggered when the result page backend calls GET /cache/{session_id} on oracle_toll. Inputs:

3 — Health probe (periodic — see SPEC_ORACLE_CACHE_HEALTH.md)

Inputs provided by the health monitor: last-check timestamp, response time, HTTP status.

4 — Retry budget configuration


OUTPUTS

Alert payload (written to CREW_CHANNEL and/or ALERT.log)

All alerts carry:


[ORACLE-CACHE-ALERT] {SEVERITY} | {ISO_TIMESTAMP} | {OPERATION} | {SESSION_ID_PREFIX} | {ERROR_DETAIL}

Fields:

Severity Levels

| Level | Trigger condition | Routing |

|-------|-------------------|---------|

| WARN | Single cache write failure, retry succeeded before threshold | CREW_CHANNEL only |

| ERROR | Cache write failed after all retries; or cache read returned unexpected non-404/non-200 status | CREW_CHANNEL |

| CRITICAL | oracle_toll service unreachable (connection refused or DNS failure) for ≥ 2 consecutive health probes; or corrupted entry detected on read | CREW_CHANNEL + ALERT.log |

Retry state (internal, not persisted)

Before alerting, the alerting layer MUST execute:

  1. Wait CACHE_RETRY_DELAY_MS
  2. Retry the operation
  3. Repeat up to CACHE_RETRY_COUNT times
  4. Alert only if all retries fail

A successful retry on attempt N < CACHE_RETRY_COUNT produces a WARN (transient blip recorded, no action required). No WARN is issued if the first attempt succeeds.


INVARIANTS

  1. No silent swallowverdictCache.ts MUST NOT contain a bare catch {} or catch { return null } with no downstream notification. Every catch block that fires on a cache operation MUST enqueue an alert. The current // Non-fatal comment in verdictCache.ts is a specification violation.
  1. Retry before alert — An alert is NEVER issued on first failure. The retry sequence (CACHE_RETRY_COUNT attempts, CACHE_RETRY_DELAY_MS spacing) MUST complete before any alert fires. This prevents CREW_CHANNEL flooding from transient network hiccups.
  1. Alert does not block verdict delivery — The alerting path is fire-and-forget async. The calling context (webhook IIFE, verdict route) is NEVER blocked waiting for alert delivery. Verdict caching is already non-blocking; alerting inherits that non-blocking contract.
  1. Session ID is never fully exposed in alert output — Alerts log only the first 12 characters of session_id. Full session IDs are Stripe-issued identifiers and treated as customer-sensitive.
  1. CRITICAL severity reaches ALERT.log — κ reads ~/ALERT.log at boot. Any CRITICAL oracle cache alert MUST append to ~/ALERT.log so the next C.L.O.D. session sees it immediately.
  1. Alert deduplication window — If the same OPERATION + SESSION_ID_PREFIX combination produces a second alert within 60 seconds, it is suppressed (logged to file only, not re-broadcast to CREW_CHANNEL). This prevents webhook retry storms from flooding the channel. [GAP — deduplication window duration is a design choice; 60s is a recommended default, not yet implemented]
  1. Corruption detection is mandatory — A cache read that returns HTTP 200 but fails JSON parsing MUST produce a CRITICAL alert, not a silent null return. Corrupted entries indicate filesystem or write-path failure, not transient network issues.

VERIFICATION CRITERIA

Σ.✓ conditions — alerting is operating correctly when:

  1. Σ.✓ Write failure alert fires — When oracle_toll is stopped (systemctl stop oracle-toll.service) and cacheVerdict() is called, after exhausting CACHE_RETRY_COUNT retries, an ERROR or CRITICAL entry appears in ~/CREW_CHANNEL within (CACHE_RETRY_COUNT × CACHE_RETRY_DELAY_MS) + 5s. No alert appears before retries are exhausted.
  1. Σ.✓ Retry suppresses alert — When oracle_toll is artificially delayed (e.g., via iptables rule blocking for 400ms) and CACHE_RETRY_DELAY_MS=500, the first attempt fails, the retry succeeds, and a WARN (not ERROR) appears in CREW_CHANNEL. Verdict is still cached successfully.
  1. Σ.✓ CRITICAL reaches ALERT.log — When oracle_toll is unreachable for 2 consecutive health probes (simulate with systemctl stop oracle-toll), ~/ALERT.log contains the CRITICAL entry within one health check interval (see SPEC_ORACLE_CACHE_HEALTH.md for interval definition).
  1. Σ.✓ Corrupt entry detection — Manually write a non-JSON file to ~/oracle_verdicts/{test_id}.json and call getCachedVerdict(test_id). Verify CRITICAL alert fires and function returns null (does not throw). [GAP — current getCachedVerdict in verdictCache.ts catches JSON parse errors and returns null silently; needs an added corruption-detection alert path]
  1. Σ.✓ Alert does not block — Measure end-to-end latency of cacheVerdict() with oracle_toll down. Must complete (fail-fast) within CACHE_CONNECT_TIMEOUT_MS × CACHE_RETRY_COUNT + CACHE_RETRY_DELAY_MS × (CACHE_RETRY_COUNT - 1) + 500ms. Verdict route must still return HTTP 200 to the customer browser.
  1. Σ.✓ Session ID truncation — Review all alert entries in CREW_CHANNEL. Zero entries contain a full Stripe session ID (cs_live_ + 24+ chars). All entries show ≤ 12 character prefix only.

FAILURE MODES

  1. Σ.⊠ Cache service connection refused — oracle_toll.service is stopped or crashed. fetch() in verdictCache.ts throws ECONNREFUSED. Current behavior: silently swallowed. Target behavior: retry sequence fires; after exhausting retries, CRITICAL alert dispatched. Verdict delivery still proceeds via Gemini regeneration path (INVARIANT-3).
  1. Σ.⊠ Cache service connection timeout — oracle_toll is running but overloaded or blocked by network policy. fetch() hangs indefinitely — CRITICAL because current code has no AbortController timeout on the fetch call. Current behavior: request hangs until Node.js socket timeout (~2min), then silently swallowed. Target behavior: CACHE_CONNECT_TIMEOUT_MS AbortController terminates the request; retry sequence proceeds; ERROR alert fires after retries. [GAP — no AbortController in current verdictCache.ts; this is the most dangerous failure mode because it blocks the webhook IIFE]
  1. Σ.⊠ Cache write accepted but unverified — oracle_toll returns HTTP 201, but the file write inside store_cached_verdict() fails silently (e.g., disk full). The next GET /cache/{session_id} returns 404. Current behavior: verdict route calls Gemini again (regeneration path). No alert. Target behavior: cache miss on a session that was recently written triggers an ERROR alert with note "write-confirm mismatch" on the read path. [GAP — detecting write-confirm mismatch requires tracking recently-written session IDs, which is not currently implemented]
  1. Σ.⊠ Corrupted cache entryoracle_verdicts/{session_id}.json contains non-JSON content (disk corruption, partial write). get_cached_verdict() in oracle_toll.py raises and returns HTTP 500. getCachedVerdict() in verdictCache.ts currently returns null silently. Target behavior: HTTP 500 from oracle_toll triggers CRITICAL alert (not the same as 404). The corrupt file must be quarantined (renamed to {id}.json.corrupt) so subsequent retries do not re-encounter it.
  1. Σ.⊠ Alert channel write failure — If ~/CREW_CHANNEL is not writable (permissions issue, filesystem full), the alert itself fails silently. Target behavior: fallback to ~/ALERT.log direct write. If ALERT.log also fails, console.error as last resort. Never suppress the alert entirely. [GAP — fallback chain not yet specified; needs implementation design]
  1. Σ.⊠ Alert storm — Webhook retry logic in Stripe causes the same session_id to be processed multiple times within a 60-second window. Without deduplication, CREW_CHANNEL receives N identical alerts. Current behavior: no deduplication (no alerts at all currently). Target behavior: deduplication window per INVARIANT-6 suppresses duplicates.
  1. Σ.⊠ ORACLE_TOLL_URL misconfigured — Environment variable is set to an incorrect host/port. Every cache operation fails with DNS error or connection refused. Indistinguishable from service-down failure mode. Target behavior: at service boot, perform a single health probe to ORACLE_TOLL_URL/health; if it fails, log CRITICAL immediately so the misconfiguration is visible before any customer payment arrives. [GAP — no boot-time connectivity check in verdictCache.ts or webhook route currently]

DEPENDENCIES

| Dependency | Role |

|------------|------|

| oracle_toll.py (port 8889) | Cache read/write target |

| verdictCache.ts | Client-side cache interface — alert logic lives here |

| ~/CREW_CHANNEL | Primary alert output channel |

| ~/ALERT.log | CRITICAL alert overflow + κ boot pickup |

| Node.js fetch API | Transport for cache calls (needs AbortController for timeout) |


DEPENDENTS

| Dependent | How it depends |

|-----------|---------------|

| SPEC_ORACLE_VERDICT_PIPELINE.md | Alert spec closes GAP-06 from that document |

| SPEC_ORACLE_CACHE_HEALTH.md | Health monitor feeds CRITICAL triggers into this alert spec |

| Oracle Webhook route (/api/webhook/route.ts) | Must call alerting-aware cacheVerdict() |

| Oracle Verdict route (/api/verdict/route.ts) | Must call alerting-aware getCachedVerdict() |


GAPS IDENTIFIED DURING SPECIFICATION

| Gap ID | Description | Impact |

|--------|-------------|--------|

| ALERT-GAP-01 | No AbortController timeout on fetch() in verdictCache.ts — connection hangs indefinitely when oracle_toll is slow | HIGH — can block webhook IIFE for minutes |

| ALERT-GAP-02 | No retry logic in cacheVerdict() or getCachedVerdict() — any failure is immediately final | HIGH — transient network issues cause unnecessary Gemini regeneration |

| ALERT-GAP-03 | No alert dispatch implementation exists — catch {} silently swallows all errors | CRITICAL — operations blind to cache outages |

| ALERT-GAP-04 | CACHE_RETRY_COUNT, CACHE_RETRY_DELAY_MS, CACHE_CONNECT_TIMEOUT_MS are not env vars — defaults are not configurable without code change | MEDIUM — reduces operational tuning ability |

| ALERT-GAP-05 | No deduplication window implementation — alert storm possible on Stripe webhook retries | MEDIUM — CREW_CHANNEL noise risk |

| ALERT-GAP-06 | No boot-time connectivity probe for ORACLE_TOLL_URL — misconfigured URL undetectable until first customer payment fails | HIGH — invisible misconfiguration |

| ALERT-GAP-07 | Write-confirm mismatch detection requires session ID tracking not currently implemented | LOW — Gemini fallback covers the customer path; audit trail gap only |


REFERENCES

| File | Role |

|------|------|

| /home/nous/Aether/app/app/lib/verdictCache.ts | Implementation target for retry + alert logic |

| /home/nous/oracle_toll.py | Cache service; /cache/{id} endpoints |

| /home/nous/memories/SPEC_ORACLE_VERDICT_PIPELINE.md | Parent spec; GAP-06 closes here |

| /home/nous/memories/SPEC_ORACLE_CACHE_HEALTH.md | Companion spec — health monitor feeds CRITICAL triggers |

| /home/nous/CREW_CHANNEL | Alert destination (primary) |

| /home/nous/ALERT.log | Alert destination (CRITICAL overflow + κ boot) |


Φζ.⊤.


Jeremy Zlabis

Chronogeometer · Visionary · Disruptor · Chief

42 Sisters AI · East York, Toronto

🍁 Φ 0.042