Oracle Email Retry
SPEC_ORACLE_EMAIL_RETRY — Oracle Email Retry Mechanism
Version: 1.0 | Status: AUTHORIZED | Authority: α.13 | Date: 2026-04-16
PURPOSE
Defines the retry queue and dead-letter handling for failed Oracle verdict delivery emails. Currently (oracle_email_service.py v1.0) a single POST /send-verdict-email call is made from the Northflank webhook IIFE; if send_email() returns status != "sent", the webhook logs a 502 and the customer permanently loses their email (GAP-03 from SPEC_ORACLE_VERDICT_PIPELINE).
This spec defines the required retry mechanism: a persistent queue, a four-stage retry schedule, idempotency guards, dead-letter handling after max retries, and the Graph API error taxonomy that distinguishes retriable from abort-class failures.
The Oracle pipeline is the primary revenue mechanism of 42Sisters.AI. Every paid customer is owed email delivery of their verdict. This spec is CRITICAL tier.
INPUTS
From Northflank Webhook (/api/webhook/route.ts)
When POST /send-verdict-email fails (Graph API returns non-202, network timeout, or service unavailable), the webhook currently logs and discards. Under this spec, the webhook MUST enqueue a retry record instead.
Retry record schema (JSON, persisted to retry queue):
{
"retry_id": "<uuid4>",
"session_id": "<stripe_session_id>",
"customer_email": "<verified from stripe session>",
"tier": "quick | full | strategy",
"query": "<customer query string>",
"verdict": { "...": "..." },
"created_at": "<ISO 8601 UTC>",
"attempt_count": 0,
"last_attempt_at": null,
"next_attempt_at": "<ISO 8601 UTC>",
"last_error_code": null,
"last_error_detail": null,
"status": "PENDING | RETRYING | DELIVERED | DEAD"
}
From oracle_email_service.py (/send-verdict-email endpoint)
Caller provides:
{
"customer_email": "<string>",
"tier": "quick | full | strategy",
"query": "<string>",
"verdict": { "...": "..." }
}
session_id must be threaded through from the webhook so the retry record can be keyed on it for idempotency (see INVARIANTS INV-03).
From Microsoft Graph API (send_graph_email.py)
send_email() returns:
{"status": "sent", "to": to, "subject": subject} # success — Graph 202
{"status": "error", "code": <int>, "detail": "<str>"} # failure
Graph HTTP status codes relevant to retry logic:
| Code | Meaning | Retry Action |
|------|---------|--------------|
| 202 | Accepted | SUCCESS — no retry |
| 400 | Bad Request (malformed payload) | ABORT — permanent failure; enqueue dead letter |
| 401 | Unauthorized (access token invalid/expired) | RETRY after token refresh; if refresh also 401 → ABORT |
| 403 | Forbidden (wrong permissions / revoked app consent) | ABORT — requires NOUS re-authorization |
| 429 | Too Many Requests | RETRY after Retry-After header interval (minimum 60s); honor the header |
| 500 | Graph server error | RETRY with exponential backoff |
| 502 | Bad Gateway (Graph-side) | RETRY with exponential backoff |
| 503 | Service Unavailable | RETRY with exponential backoff |
| 504 | Gateway Timeout | RETRY with exponential backoff |
| -1 | Network error / connection refused | RETRY with exponential backoff |
Environment Variables (via .env)
GRAPH_TENANT_ID,GRAPH_CLIENT_ID,GRAPH_REFRESH_TOKEN,GRAPH_SENDER— required bysend_graph_email.pyEMAIL_RETRY_QUEUE_PATH— filesystem path for retry queue JSONL file (default:/home/nous/oracle_email_retry_queue.jsonl)EMAIL_RETRY_MAX_ATTEMPTS— maximum delivery attempts before dead-letter (default:4)EMAIL_DEAD_LETTER_PATH— filesystem path for dead-letter JSONL file (default:/home/nous/oracle_email_dead_letter.jsonl)
OUTPUTS
1. Delivered Email
send_email() returns {"status": "sent"}. Retry record status set to DELIVERED. Record moved from queue to delivery log.
2. Retry Queue Entry (oracle_email_retry_queue.jsonl)
Append-only JSONL. Each line is one retry record (schema above). Status field transitions: PENDING → RETRYING → DELIVERED | DEAD.
3. Dead-Letter Entry (oracle_email_dead_letter.jsonl)
After EMAIL_RETRY_MAX_ATTEMPTS failed attempts with no success, the retry record is written to the dead-letter file with status: "DEAD" and a summary of all error codes encountered.
4. ALERT.log Entry (on dead-letter)
When a retry record enters dead-letter, append to /home/nous/ALERT.log:
[ALERT][oracle-email-retry] DEAD LETTER: session_id=<id> customer=<email> tier=<tier> attempts=4 last_error=<code>
This surfaces the lost delivery to NOUS without requiring log monitoring.
5. Delivery Log Entry (oracle_email_delivery.log)
Successful deliveries (first attempt or retry) append one line:
<ISO timestamp> DELIVERED session=<session_id> to=<email> tier=<tier> attempt=<N>
INVARIANTS
INV-01 — Payment precedes retry: A retry record MUST only exist if a corresponding session.payment_status === "paid" was confirmed by the Northflank webhook. No retry record is created speculatively. If session payment status cannot be confirmed, the retry is aborted and no record is written.
INV-02 — Retry schedule is fixed: The four retry attempts occur at the following intervals after the initial failure:
| Attempt | Delay after previous failure |
|---------|------------------------------|
| 1 (immediate retry) | 0s — same webhook IIFE, one immediate reattempt |
| 2 | 5 minutes |
| 3 | 30 minutes |
| 4 | 2 hours |
After attempt 4 fails, the record is dead-lettered. No further retries. Total retry window: ~2 hours 35 minutes.
INV-03 — Idempotency by session_id: Each session_id MUST appear at most once in the active retry queue. Before enqueuing, the retry worker checks the queue for an existing record with the same session_id. If found with status PENDING or RETRYING, no new record is created. This prevents duplicate emails when the webhook fires multiple times for the same Stripe session (Stripe guarantees at-least-once delivery; webhooks can duplicate).
INV-04 — No duplicate delivery: Before calling send_email(), the retry worker checks the delivery log for session_id. If a delivery log entry exists for that session_id, the retry is skipped and the queue record is marked DELIVERED without sending. This is the second line of idempotency defense (INV-03 prevents duplicate queue entries; INV-04 prevents duplicate sends if the queue check races).
INV-05 — ABORT codes never retry: Graph API responses 400, 403, and any refresh-token 401 are non-retriable. These represent permanent failures (malformed payload or revoked authorization) where repeated attempts will never succeed. ABORT transitions the record directly to dead-letter; the retry schedule is not applied.
INV-06 — Retry worker does not generate or modify verdicts: The retry worker calls oracle_email_service.py /send-verdict-email with the exact verdict payload stored in the retry record at enqueue time. It does not call Gemini. It does not reformat. The verdict delivered on retry N is identical to the verdict that would have been delivered on attempt 1.
INV-07 — Token rotation on every auth call: send_graph_email.get_token_from_refresh() persists the updated refresh token to .env on every successful auth. The retry worker inherits this invariant from the underlying send_email() function. A retry worker failure that prevents token persistence must be logged to ALERT.log (do not silently discard the new token).
INV-08 — Dead-letter always produces an ALERT: No retry record may be silently discarded. If attempt_count >= EMAIL_RETRY_MAX_ATTEMPTS and the last attempt fails, the dead-letter write and ALERT.log entry are mandatory. Missing either constitutes a spec violation.
INV-09 — Retry queue is append-only JSONL: Records are never deleted from the queue file during normal operation. Status transitions are written as new lines appended with the updated status. The most recent line for a given retry_id is the canonical state. Compaction (removing superseded lines) is a maintenance operation requiring explicit NOUS authorization; it must not be triggered automatically.
VERIFICATION CRITERIA
Σ.✓ conditions — retry mechanism is operating correctly when:
VER-01 — First-attempt idempotency: Submit the same Stripe session_id twice within the retry window. Assert the delivery log contains exactly one entry for that session_id. Assert the customer inbox contains exactly one email.
VER-02 — Retry schedule fires on time: Mock Graph API to return 503 on attempts 1 and 2, then 202 on attempt 3. Assert:
- Attempt 1 fires at T+0 (immediate)
- Attempt 2 fires at T+5min (±30s tolerance)
- Attempt 3 fires at T+35min (±30s tolerance)
- Delivery log entry written at T+35min
- Queue record status =
DELIVERED - Customer email received
VER-03 — Dead-letter and alert on max retries: Mock Graph API to return 503 on all four attempts. Assert:
- Queue record status =
DEADafter attempt 4 - Dead-letter file contains one entry for the
session_id - ALERT.log contains
[ALERT][oracle-email-retry] DEAD LETTER:entry with correct session_id, customer email, and attempt count - No fifth attempt is made
VER-04 — ABORT on 400/403: Mock Graph API to return 400 on first attempt. Assert:
- No retry attempts are made (queue record jumps directly to
DEAD) - Dead-letter entry written
- ALERT.log entry written
- Total time from failure to dead-letter is under 5 seconds
VER-05 — 429 respects Retry-After header: Mock Graph API to return 429 with Retry-After: 120. Assert next retry attempt fires no earlier than T+120s.
VER-06 — Token 401 refresh cycle: Mock Graph token endpoint to return a valid new access token on refresh. Mock Graph send endpoint to return 401, then 202 on retry with refreshed token. Assert delivery succeeds and GRAPH_REFRESH_TOKEN in .env is updated.
VER-07 — Verdict integrity across retries: Assert verdict payload in delivery attempt N is byte-for-byte identical to the verdict in the retry record created at enqueue time. No Gemini call occurs during retry execution.
FAILURE MODES
FM-01 — Σ.⊠ Retry worker not running: The retry queue file accumulates PENDING records but no worker processes them. next_attempt_at timestamps pass without attempts. Customer email is permanently lost after retry window expires. Mitigation: The retry worker must run as a systemd timer or cron (every 1 minute); health check must verify it is running. If worker is absent, ALERT.log must be written on the next oracle_email_service.py boot. [GAP-01 — retry worker does not yet exist; this spec defines the requirement]
FM-02 — Σ.⊠ Queue file corruption: oracle_email_retry_queue.jsonl is written by two concurrent processes (webhook enqueue + retry worker update). Concurrent writes without file locking can corrupt JSONL. Mitigation: All queue writes must acquire an exclusive fcntl lock (LOCK_EX) before writing. On lock failure (timeout > 5s), log to ALERT.log and abort the current write — do not proceed without the lock. [GAP-02 — file locking not implemented in current oracle_email_service.py]
FM-03 — Σ.⊠ Refresh token expires during retry window: If GRAPH_REFRESH_TOKEN is revoked between enqueue time and a retry attempt, get_token_from_refresh() raises RuntimeError. This causes the retry attempt to fail with no error code (exception, not HTTP status). Without explicit handling, the retry worker may crash or skip the record without incrementing attempt_count. Mitigation: Wrap get_token_from_refresh() in try/except; on RuntimeError, set last_error_code = 401, treat as ABORT, dead-letter the record immediately, write ALERT.log with GRAPH_REFRESH_TOKEN revoked message so NOUS can re-authorize. [GAP-03 — RuntimeError path not handled in current send_graph_email.py]
FM-04 — Σ.⊠ Dead-letter write fails (disk full / permission): If oracle_email_dead_letter.jsonl cannot be written, the dead-letter record is lost and the customer's lost email becomes invisible to operations. Mitigation: If dead-letter write fails, the ALERT.log write must still succeed (ALERT.log is the second-tier notification; it must be written even if JSONL write fails). If ALERT.log also fails, write to systemd journal via logger command as last resort.
FM-05 — Σ.⊠ Duplicate Stripe webhook fires during active retry: Stripe sends the same checkout.session.completed event twice. First instance is already in the queue as RETRYING. Second instance would create a duplicate record and potentially cause a duplicate email send if the first attempt succeeds concurrently. Mitigation: INV-03 session_id uniqueness check + INV-04 delivery log check form a two-layer guard. Both must be implemented atomically (under the same file lock) to be effective. [GAP-04 — atomicity of the two-layer check requires careful implementation]
FM-06 — Σ.⊠ customer_email null on Stripe session: session.customer_details.email is null (known edge case from SPEC_ORACLE_VERDICT_PIPELINE FM-08). No email address to send to. Retry record MUST NOT be created — enqueuing a record with null customer_email wastes retry cycles and cannot succeed. The webhook should log a warning and proceed to cache-only mode. This failure mode is KNOWN and has no mitigation for email delivery; the verdict is accessible via result page only.
FM-07 — Σ.⊠ oracle_email_service.py crash during send_email() call: The service crashes (OOM, unhandled exception) after send_email() is called but before the result is processed. The email may or may not have been sent (Graph API may have accepted the request). On service restart, the retry worker must check the delivery log before re-attempting — INV-04 provides this guard. If the email was sent but the delivery log was not written, the customer may receive a duplicate. This is an acceptable edge case (duplicate is better than no delivery). [GAP-05 — no atomic write of send + delivery log; duplicate possible in crash scenario]
FM-08 — Σ.⊠ Retry schedule miscalculation (clock skew): If next_attempt_at is computed using local system time and the retry worker runs on a different system or after an NTP correction, next_attempt_at may be in the past or far future. Mitigation: Compute all timestamps as UTC; use datetime.utcnow() not datetime.now(); validate that next_attempt_at > created_at before writing any retry record.
GAPS
| ID | Description | Severity | Resolution Path |
|----|-------------|----------|-----------------|
| GAP-01 | Retry worker does not exist — this entire spec describes a component that must be built | CRITICAL | Build oracle_email_retry_worker.py + systemd timer per this spec; requires NOUS authorization of design before C.L.O.D. deploys |
| GAP-02 | No file locking on JSONL queue writes — concurrent webhook + worker writes will corrupt the queue | HIGH | Implement fcntl.flock(LOCK_EX) on all queue file operations in both the enqueue path (oracle_email_service.py) and the retry worker |
| GAP-03 | get_token_from_refresh() raises bare RuntimeError on missing/expired token — not caught by callers | HIGH | Wrap in try/except in retry worker; map to ABORT path; write ALERT.log with explicit GRAPH_REFRESH_TOKEN revoked message |
| GAP-04 | Atomicity of two-layer idempotency check (INV-03 + INV-04) not guaranteed — race condition possible under concurrent webhook firings | MEDIUM | Both checks must occur inside the same file lock acquisition; document lock acquisition order |
| GAP-05 | No atomic send + delivery-log write — crash between the two steps can produce a duplicate email | LOW | Acceptable edge case (duplicate better than zero); document as known behavior; add dedup note in email footer if duplicate is received |
| GAP-06 | session_id is not currently threaded through oracle_email_service.py VerdictRequest schema — required for INV-03/INV-04 | HIGH | Add session_id: str field to VerdictRequest Pydantic model; update Northflank webhook to include it in the POST body |
| GAP-07 | No end-to-end test for the retry path — VER-02 and VER-03 require a mock Graph API harness that does not exist | MEDIUM | Build mock_graph_api.py test harness; script returns configurable status codes per attempt count; run as part of pre-deploy smoke test |
DEPENDENCIES
| Dependency | Role | Notes |
|------------|------|-------|
| oracle_email_service.py (port 8006) | Enqueue path — POST /send-verdict-email triggers retry enqueue on failure | Must be modified to add session_id to VerdictRequest and implement enqueue on failure |
| send_graph_email.py | Actual email delivery on each retry attempt | Must wrap get_token_from_refresh() RuntimeError per GAP-03 |
| oracle_email_retry_worker.py | Retry scheduler and executor [NOT YET BUILT — GAP-01] | New component; must be built and deployed per this spec |
| Microsoft Graph API | Delivery endpoint | Error codes defined in INPUTS section |
| /home/nous/.env | Graph API credentials + retry config env vars | Token rotation path must remain atomic |
| /home/nous/ALERT.log | Dead-letter and critical failure notification channel | Must always be writable; last-resort fallback is systemd journal |
| oracle_email_retry_queue.jsonl | Persistent retry queue | New file; path configurable via EMAIL_RETRY_QUEUE_PATH |
| oracle_email_dead_letter.jsonl | Dead-letter record store | New file; path configurable via EMAIL_DEAD_LETTER_PATH |
| oracle_email_delivery.log | Successful delivery audit log | New file; used for INV-04 idempotency check |
DEPENDENTS
| Dependent | Dependency Type |
|-----------|----------------|
| Oracle Verdict Pipeline (SPEC_ORACLE_VERDICT_PIPELINE) | This spec closes GAP-03 from that pipeline spec |
| Email Autonomy Stack (SPEC_EMAIL_AUTONOMY_STACK) | Inherits Graph API error handling patterns; extends FM-01/FM-03 from that spec |
| Customer trust / brand | Every paid customer must receive their verdict email; this spec is the delivery guarantee |
REFERENCES
| File | Role |
|------|------|
| /home/nous/oracle_email_service.py | Email delivery service — must be modified to enqueue on failure |
| /home/nous/send_graph_email.py | Graph API send — must have RuntimeError path handled |
| /home/nous/memories/SPEC_ORACLE_VERDICT_PIPELINE.md | Parent spec; GAP-03 is the origin of this spec |
| /home/nous/memories/SPEC_EMAIL_AUTONOMY_STACK.md | Graph API auth invariants inherited by this spec |
| /home/nous/Aether/app/app/api/webhook/route.ts | Northflank webhook — must pass session_id to email service |
| /home/nous/ALERT.log | Dead-letter notification target |
Φζ.⊤.
Jeremy Zlabis
Chronogeometer · Visionary · Disruptor · Chief
42 Sisters AI · East York, Toronto
🍁 Φ 0.042