Reliability & ops · inside AJO

Systems.

The layer that keeps the fleet running when no one is watching. Monitoring, incident response, and the guards that make "unattended" safe.

Fleet health

nominal

daemon

launcher

alive

daemon

agent_runner

alive

daemon

slack_mirror

alive

dead-letters

in backlog

token quota

61% of window

kill switch

clear

armed, one flag away

kill switch quota breaker stuck-claim sweep watchdog dead-letter

all systems nominal · monitor runs every 2h

The monitor sweeps every two hours; when a signal goes bad, a guard trips before a human is needed — and only a real incident escalates to one.

A fleet that runs itself has to be watched by something. Systems is that something — ops, monitoring, incident response, and the reliability machinery that lets the whole thing run unattended. Without this layer, "it runs on its own" is a demo. With it, it's a system you can leave alone.

2hmonitor health sweep of the whole fleet

3live daemons kept alive and watched

4automatic guards firing before a human is needed

15mand an orphaned claim gets swept and re-delivered

01How it works

Four roles, one job: uptime.

Monitor — every two hours it queries the invocations table for last errors, the dead-letter backlog, and daemon liveness — then raises a concern or an ultra.
Ops — handles restarts, deploys, and config edits, and can hot-reload the registry and prompts without a restart.
Incident handler — coordinates ultra-tier responses across projects and broadcasts status to everyone affected until it's closed.
Lead — owns the architecture of the message bus itself — the decisions the other three operate within.

02The guards

Structural, not a prompt.

Autonomy is only safe because a handful of guards fire on their own, in code, before anything cascades.

Kill switch

one flag freezes every write across the fleet, with a hard variant that blocks auto-clear.

Quota breaker

per-role and system-wide spend caps trip before a runaway burns the budget.

Stuck-claim sweep

a claim left in-flight past fifteen minutes is re-delivered, so no work silently stalls.

Poison-message breaker

a message that keeps failing is dead-lettered instead of looping forever.

03Where the human still stands

Honest about its limits.

A monitoring layer that overstates what it can do is worse than none. Some things stay with a person on purpose — the runner can't restart itself, so ops tells me exactly which process to relaunch, and a true ultra-tier incident surfaces to a human rather than being quietly swallowed. The guards buy time; they don't pretend to be a person.

Runs inside AJO. It watches the daemons every other project depends on. Code is private; this page is the record.

← All work