Oracle Cache Alert
SPEC_ORACLE_CACHE_ALERT — Oracle Cache Failure Alerting
Version: 1.0 | Status: AUTHORIZED | Authority: α.13 | Date: 2026-04-16
PURPOSE
Defines the alerting behavior when the oracle_toll cache service (port 8889, http://68.183.206.103:8889) is unreachable, returns unexpected errors, or delivers a corrupted cache entry. The cache is a revenue-critical dependency of the Oracle Verdict Pipeline: when it silently fails, a paying customer's verdict may be regenerated non-deterministically, differing from the verdict already delivered by email (GAP-08 in SPEC_ORACLE_VERDICT_PIPELINE.md). Silent failure (current state in verdictCache.ts) is unacceptable operationally. This spec defines the alerting contract that replaces silence.
The alert channel is CREW_CHANNEL. All severities route there. Critical severity also writes to ~/ALERT.log for κ boot pickup.
INPUTS
1 — Cache write attempt (cacheVerdict)
Triggered when the Stripe webhook completes Gemini generation and calls POST /cache/{session_id} on oracle_toll. Inputs:
ORACLE_TOLL_URL(env var, defaulthttp://68.183.206.103:8889)session_id— Stripe checkout session ID (alphanumeric +-_)payload— JSON verdict body with tier, query, verdict, cached_at
2 — Cache read attempt (getCachedVerdict)
Triggered when the result page backend calls GET /cache/{session_id} on oracle_toll. Inputs:
ORACLE_TOLL_URLsession_id
3 — Health probe (periodic — see SPEC_ORACLE_CACHE_HEALTH.md)
Inputs provided by the health monitor: last-check timestamp, response time, HTTP status.
4 — Retry budget configuration
CACHE_RETRY_COUNT— number of retries before alerting (default: 3) [GAP — not yet a configurable env var; hardcoded default recommended]CACHE_RETRY_DELAY_MS— milliseconds between retries (default: 500ms) [GAP — not yet configurable]CACHE_CONNECT_TIMEOUT_MS— connection timeout per attempt (default: 2000ms) [GAP — not yet configurable; current code has no explicit timeout on fetch()]
OUTPUTS
Alert payload (written to CREW_CHANNEL and/or ALERT.log)
All alerts carry:
[ORACLE-CACHE-ALERT] {SEVERITY} | {ISO_TIMESTAMP} | {OPERATION} | {SESSION_ID_PREFIX} | {ERROR_DETAIL}
Fields:
SEVERITY— one of:WARN,ERROR,CRITICAL(see Severity Levels)ISO_TIMESTAMP— UTC ISO 8601OPERATION—CACHE_WRITE,CACHE_READ, orHEALTH_PROBESESSION_ID_PREFIX— first 12 chars of session_id (never full ID — avoid logging customer-identifiable data beyond what is necessary)ERROR_DETAIL— human-readable failure description; one of the canonical strings defined in FAILURE MODES
Severity Levels
| Level | Trigger condition | Routing |
|-------|-------------------|---------|
| WARN | Single cache write failure, retry succeeded before threshold | CREW_CHANNEL only |
| ERROR | Cache write failed after all retries; or cache read returned unexpected non-404/non-200 status | CREW_CHANNEL |
| CRITICAL | oracle_toll service unreachable (connection refused or DNS failure) for ≥ 2 consecutive health probes; or corrupted entry detected on read | CREW_CHANNEL + ALERT.log |
Retry state (internal, not persisted)
Before alerting, the alerting layer MUST execute:
- Wait
CACHE_RETRY_DELAY_MS - Retry the operation
- Repeat up to
CACHE_RETRY_COUNTtimes - Alert only if all retries fail
A successful retry on attempt N < CACHE_RETRY_COUNT produces a WARN (transient blip recorded, no action required). No WARN is issued if the first attempt succeeds.
INVARIANTS
- No silent swallow —
verdictCache.tsMUST NOT contain a barecatch {}orcatch { return null }with no downstream notification. Every catch block that fires on a cache operation MUST enqueue an alert. The current// Non-fatalcomment inverdictCache.tsis a specification violation.
- Retry before alert — An alert is NEVER issued on first failure. The retry sequence (CACHE_RETRY_COUNT attempts, CACHE_RETRY_DELAY_MS spacing) MUST complete before any alert fires. This prevents CREW_CHANNEL flooding from transient network hiccups.
- Alert does not block verdict delivery — The alerting path is fire-and-forget async. The calling context (webhook IIFE, verdict route) is NEVER blocked waiting for alert delivery. Verdict caching is already non-blocking; alerting inherits that non-blocking contract.
- Session ID is never fully exposed in alert output — Alerts log only the first 12 characters of
session_id. Full session IDs are Stripe-issued identifiers and treated as customer-sensitive.
- CRITICAL severity reaches ALERT.log — κ reads
~/ALERT.logat boot. AnyCRITICALoracle cache alert MUST append to~/ALERT.logso the next C.L.O.D. session sees it immediately.
- Alert deduplication window — If the same
OPERATION + SESSION_ID_PREFIXcombination produces a second alert within 60 seconds, it is suppressed (logged to file only, not re-broadcast to CREW_CHANNEL). This prevents webhook retry storms from flooding the channel. [GAP — deduplication window duration is a design choice; 60s is a recommended default, not yet implemented]
- Corruption detection is mandatory — A cache read that returns HTTP 200 but fails JSON parsing MUST produce a
CRITICALalert, not a silent null return. Corrupted entries indicate filesystem or write-path failure, not transient network issues.
VERIFICATION CRITERIA
Σ.✓ conditions — alerting is operating correctly when:
- Σ.✓ Write failure alert fires — When oracle_toll is stopped (
systemctl stop oracle-toll.service) andcacheVerdict()is called, after exhaustingCACHE_RETRY_COUNTretries, anERRORorCRITICALentry appears in~/CREW_CHANNELwithin(CACHE_RETRY_COUNT × CACHE_RETRY_DELAY_MS) + 5s. No alert appears before retries are exhausted.
- Σ.✓ Retry suppresses alert — When oracle_toll is artificially delayed (e.g., via iptables rule blocking for 400ms) and
CACHE_RETRY_DELAY_MS=500, the first attempt fails, the retry succeeds, and aWARN(notERROR) appears in CREW_CHANNEL. Verdict is still cached successfully.
- Σ.✓ CRITICAL reaches ALERT.log — When oracle_toll is unreachable for 2 consecutive health probes (simulate with
systemctl stop oracle-toll),~/ALERT.logcontains theCRITICALentry within one health check interval (see SPEC_ORACLE_CACHE_HEALTH.md for interval definition).
- Σ.✓ Corrupt entry detection — Manually write a non-JSON file to
~/oracle_verdicts/{test_id}.jsonand callgetCachedVerdict(test_id). VerifyCRITICALalert fires and function returnsnull(does not throw). [GAP — currentgetCachedVerdictin verdictCache.ts catches JSON parse errors and returns null silently; needs an added corruption-detection alert path]
- Σ.✓ Alert does not block — Measure end-to-end latency of
cacheVerdict()with oracle_toll down. Must complete (fail-fast) withinCACHE_CONNECT_TIMEOUT_MS × CACHE_RETRY_COUNT + CACHE_RETRY_DELAY_MS × (CACHE_RETRY_COUNT - 1) + 500ms. Verdict route must still return HTTP 200 to the customer browser.
- Σ.✓ Session ID truncation — Review all alert entries in CREW_CHANNEL. Zero entries contain a full Stripe session ID (
cs_live_+ 24+ chars). All entries show ≤ 12 character prefix only.
FAILURE MODES
- Σ.⊠ Cache service connection refused — oracle_toll.service is stopped or crashed.
fetch()inverdictCache.tsthrowsECONNREFUSED. Current behavior: silently swallowed. Target behavior: retry sequence fires; after exhausting retries,CRITICALalert dispatched. Verdict delivery still proceeds via Gemini regeneration path (INVARIANT-3).
- Σ.⊠ Cache service connection timeout — oracle_toll is running but overloaded or blocked by network policy.
fetch()hangs indefinitely — CRITICAL because current code has noAbortControllertimeout on the fetch call. Current behavior: request hangs until Node.js socket timeout (~2min), then silently swallowed. Target behavior:CACHE_CONNECT_TIMEOUT_MSAbortController terminates the request; retry sequence proceeds;ERRORalert fires after retries. [GAP — no AbortController in current verdictCache.ts; this is the most dangerous failure mode because it blocks the webhook IIFE]
- Σ.⊠ Cache write accepted but unverified — oracle_toll returns HTTP 201, but the file write inside
store_cached_verdict()fails silently (e.g., disk full). The nextGET /cache/{session_id}returns 404. Current behavior: verdict route calls Gemini again (regeneration path). No alert. Target behavior: cache miss on a session that was recently written triggers anERRORalert with note "write-confirm mismatch" on the read path. [GAP — detecting write-confirm mismatch requires tracking recently-written session IDs, which is not currently implemented]
- Σ.⊠ Corrupted cache entry —
oracle_verdicts/{session_id}.jsoncontains non-JSON content (disk corruption, partial write).get_cached_verdict()in oracle_toll.py raises and returns HTTP 500.getCachedVerdict()in verdictCache.ts currently returnsnullsilently. Target behavior: HTTP 500 from oracle_toll triggersCRITICALalert (not the same as 404). The corrupt file must be quarantined (renamed to{id}.json.corrupt) so subsequent retries do not re-encounter it.
- Σ.⊠ Alert channel write failure — If
~/CREW_CHANNELis not writable (permissions issue, filesystem full), the alert itself fails silently. Target behavior: fallback to~/ALERT.logdirect write. IfALERT.logalso fails,console.erroras last resort. Never suppress the alert entirely. [GAP — fallback chain not yet specified; needs implementation design]
- Σ.⊠ Alert storm — Webhook retry logic in Stripe causes the same
session_idto be processed multiple times within a 60-second window. Without deduplication, CREW_CHANNEL receives N identical alerts. Current behavior: no deduplication (no alerts at all currently). Target behavior: deduplication window per INVARIANT-6 suppresses duplicates.
- Σ.⊠ ORACLE_TOLL_URL misconfigured — Environment variable is set to an incorrect host/port. Every cache operation fails with DNS error or connection refused. Indistinguishable from service-down failure mode. Target behavior: at service boot, perform a single health probe to
ORACLE_TOLL_URL/health; if it fails, logCRITICALimmediately so the misconfiguration is visible before any customer payment arrives. [GAP — no boot-time connectivity check in verdictCache.ts or webhook route currently]
DEPENDENCIES
| Dependency | Role |
|------------|------|
| oracle_toll.py (port 8889) | Cache read/write target |
| verdictCache.ts | Client-side cache interface — alert logic lives here |
| ~/CREW_CHANNEL | Primary alert output channel |
| ~/ALERT.log | CRITICAL alert overflow + κ boot pickup |
| Node.js fetch API | Transport for cache calls (needs AbortController for timeout) |
DEPENDENTS
| Dependent | How it depends |
|-----------|---------------|
| SPEC_ORACLE_VERDICT_PIPELINE.md | Alert spec closes GAP-06 from that document |
| SPEC_ORACLE_CACHE_HEALTH.md | Health monitor feeds CRITICAL triggers into this alert spec |
| Oracle Webhook route (/api/webhook/route.ts) | Must call alerting-aware cacheVerdict() |
| Oracle Verdict route (/api/verdict/route.ts) | Must call alerting-aware getCachedVerdict() |
GAPS IDENTIFIED DURING SPECIFICATION
| Gap ID | Description | Impact |
|--------|-------------|--------|
| ALERT-GAP-01 | No AbortController timeout on fetch() in verdictCache.ts — connection hangs indefinitely when oracle_toll is slow | HIGH — can block webhook IIFE for minutes |
| ALERT-GAP-02 | No retry logic in cacheVerdict() or getCachedVerdict() — any failure is immediately final | HIGH — transient network issues cause unnecessary Gemini regeneration |
| ALERT-GAP-03 | No alert dispatch implementation exists — catch {} silently swallows all errors | CRITICAL — operations blind to cache outages |
| ALERT-GAP-04 | CACHE_RETRY_COUNT, CACHE_RETRY_DELAY_MS, CACHE_CONNECT_TIMEOUT_MS are not env vars — defaults are not configurable without code change | MEDIUM — reduces operational tuning ability |
| ALERT-GAP-05 | No deduplication window implementation — alert storm possible on Stripe webhook retries | MEDIUM — CREW_CHANNEL noise risk |
| ALERT-GAP-06 | No boot-time connectivity probe for ORACLE_TOLL_URL — misconfigured URL undetectable until first customer payment fails | HIGH — invisible misconfiguration |
| ALERT-GAP-07 | Write-confirm mismatch detection requires session ID tracking not currently implemented | LOW — Gemini fallback covers the customer path; audit trail gap only |
REFERENCES
| File | Role |
|------|------|
| /home/nous/Aether/app/app/lib/verdictCache.ts | Implementation target for retry + alert logic |
| /home/nous/oracle_toll.py | Cache service; /cache/{id} endpoints |
| /home/nous/memories/SPEC_ORACLE_VERDICT_PIPELINE.md | Parent spec; GAP-06 closes here |
| /home/nous/memories/SPEC_ORACLE_CACHE_HEALTH.md | Companion spec — health monitor feeds CRITICAL triggers |
| /home/nous/CREW_CHANNEL | Alert destination (primary) |
| /home/nous/ALERT.log | Alert destination (CRITICAL overflow + κ boot) |
Φζ.⊤.
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042