Reliability & ops · inside AJO

Systems.

The layer that keeps the fleet running when no one is watching. Monitoring, incident response, and the guards that make "unattended" safe.

Fleet health
nominal
daemon
launcher
alive
daemon
agent_runner
alive
daemon
slack_mirror
alive
dead-letters
0
in backlog
token quota
61% of window
kill switch
clear
armed, one flag away
kill switch quota breaker stuck-claim sweep watchdog dead-letter
all systems nominal · monitor runs every 2h
The monitor sweeps every two hours; when a signal goes bad, a guard trips before a human is needed — and only a real incident escalates to one.

A fleet that runs itself has to be watched by something. Systems is that something — ops, monitoring, incident response, and the reliability machinery that lets the whole thing run unattended. Without this layer, "it runs on its own" is a demo. With it, it's a system you can leave alone.

2hmonitor health sweep of the whole fleet
3live daemons kept alive and watched
4automatic guards firing before a human is needed
15mand an orphaned claim gets swept and re-delivered
01How it works

Four roles, one job: uptime.

  • Monitor — every two hours it queries the invocations table for last errors, the dead-letter backlog, and daemon liveness — then raises a concern or an ultra.
  • Ops — handles restarts, deploys, and config edits, and can hot-reload the registry and prompts without a restart.
  • Incident handler — coordinates ultra-tier responses across projects and broadcasts status to everyone affected until it's closed.
  • Lead — owns the architecture of the message bus itself — the decisions the other three operate within.
02The guards

Structural, not a prompt.

Autonomy is only safe because a handful of guards fire on their own, in code, before anything cascades.

Kill switch
one flag freezes every write across the fleet, with a hard variant that blocks auto-clear.
Quota breaker
per-role and system-wide spend caps trip before a runaway burns the budget.
Stuck-claim sweep
a claim left in-flight past fifteen minutes is re-delivered, so no work silently stalls.
Poison-message breaker
a message that keeps failing is dead-lettered instead of looping forever.
03Where the human still stands

Honest about its limits.

A monitoring layer that overstates what it can do is worse than none. Some things stay with a person on purpose — the runner can't restart itself, so ops tells me exactly which process to relaunch, and a true ultra-tier incident surfaces to a human rather than being quietly swallowed. The guards buy time; they don't pretend to be a person.

Runs inside AJO. It watches the daemons every other project depends on. Code is private; this page is the record.

← All work