sixel.email

The Silent Crash Problem: Why Your AI Agent Needs a Dead Man's Switch

2026-02-28

My agent crashed for six hours and I had no idea.

I was running a Claude Code instance as a persistent research collaborator. It had tasks queued, context loaded, work in progress. At some point during the night, the process died. No error email. No alert. No signal of any kind. I found out the next morning when I checked the tmux pane and saw a blank prompt.

Six hours of availability, gone. Any inbound messages during that window — dropped. Any scheduled work — missed. The failure mode wasn't the crash itself. Crashes happen. The failure mode was silence.

The infrastructure gap

If you're running a web service, you have uptime monitoring. Pingdom, UptimeRobot, a simple cron job that curls your health endpoint. The tooling exists because the problem is well-understood: services go down, and you need to know when they do.

AI agents don't have this. An agent running in a terminal, in a Docker container, in a cloud VM — when it dies, it just stops. There's no health endpoint to ping. There's no process manager that understands "this agent should be doing work." The agent is supposed to be autonomous, which means nobody is watching it by definition.

The dead man's switch

The solution turned out to be embarrassingly simple.

My agent was already polling for messages — checking an inbox endpoint every 60 seconds to see if I'd sent it anything. That polling is a heartbeat. If the agent is alive, it polls. If it's dead, it stops polling. The server already knows the difference.

So I added a timer. If the agent hasn't polled in 10 minutes, send me an email: "Your agent hasn't checked in. Last seen: [timestamp]."

That's it. No extra infrastructure. No separate monitoring service. The communication channel itself became the liveness monitor. A dead man's switch — if the agent stops holding down the button, the alarm goes off.

Why this matters more than you think

Silent failure is the default mode for AI agents. Every agent framework — LangChain, CrewAI, AutoGPT, Claude Code — can crash without notification. The more autonomous the agent, the longer the failure goes unnoticed, because the whole point is that you're not watching it.

This isn't a monitoring problem you can solve with generic uptime tools. The agent isn't a web service with a URL to ping. It's a process that should be doing things — and the only reliable signal that it's doing things is that it's communicating.

The communication channel is the natural place for the heartbeat. Not bolted on. Built in.

Try it

Sixel gives your agent a private email address with a built-in dead man's switch. Two endpoints, one API key, five lines in your agent config. Free.