On April 25, we went dark for three and a half hours. While I worked out why, the system I had built to warn me when things break crashed under the noise of its own alarms.
The outage was bad.
The worse part came while I was fixing it: the safety net was held together by three files on one machine, and only one person, me, knew how to put them back if the machine died. None of it was backed up in the way the rest of our code was backed up. If that machine had failed at 2 a.m. on a weekend, I'd have spent most of Saturday rebuilding it from memory.
I'd known this for months. I kept choosing not to fix it. April 25 was the day I couldn't keep choosing that anymore.
What the risk actually is
Most software companies have systems that run quietly in the background. Monitoring tools. Backup jobs. The thing that sends invoices on the first of the month. The product depends on them.
These systems usually get built quickly, often by one person under pressure. Once they work, nobody touches them. Over months, they accumulate small adjustments: a fix here, a workaround there. These exist only on the machine they run on, and only in the head of the person who made them.
That's the risk. Systems fail all the time and get rebuilt; that's normal. The risk is that when it fails, the only path to recovery runs through one person's memory.
If that person is on vacation, you wait. If they're sick, you wait. If they leave the company, you may not recover at all. You rebuild from scratch and hope you didn't lose anything important along the way.
Why smart people let it sit
Because the math, in the moment, always favors letting it sit.
Fixing the problem takes a focused weekend. That weekend has a visible cost: a feature doesn't ship, a customer waits, a bug stays open. The cost of not fixing it is invisible: a bad day that hasn't happened yet, that might not happen this quarter, that the engineer can probably handle if it does.
Week after week, the visible cost beats the invisible one. The engineer makes a defensible choice each time, and the debt gets a little larger.
Discipline alone doesn't solve it. This is how risk works when one side of the ledger is concrete and the other is hypothetical. You can carry this kind of debt for a long time without anyone, including the person carrying it, noticing how heavy it's gotten.
Until something tips.
The warnings I ignored
Three times this spring, the machine running our monitoring system had gone down.
Each time, I recovered it by hand. Each recovery took longer than it should have, because the steps lived in my head, not on paper. Each one was a warning I treated like an inconvenience instead of a signal.
April 25 was the warning I couldn't argue with. The system whose job was to tell us when things break was now the thing breaking. If we lost the machine that day, we'd lose visibility into everything else at the same moment we needed it most.
The decision that mattered
When I finally sat down to fix it, the obvious move was to throw out the three files and write something cleaner. A weekend of work, no baggage.
I didn't.
The three files weren't pretty, but they were right. Over three months, I'd baked small fixes into them: adjustments I'd made the day after specific incidents, settings tuned to how my particular network behaves, a threshold I'd raised five times because the default kept sending false alarms at 3 a.m. A clean rewrite meant either re-deriving every one of those fixes or shipping without them. The first was a month of work. The second meant trading one kind of fragility for another.
So I did the harder, less satisfying thing. I captured exactly what was running, including every quirk and workaround, and put that into version control. Then I built a one-command process to re-deploy it from scratch onto a new machine, in case I ever need to.
It's uglier than the rewrite would have been. It's also impossible to be wrong about, because the source of truth is the system I know is working.
This is worth knowing if you hire engineers: the right answer is often the one that looks worse on paper. A senior engineer will sometimes pick the slower, less elegant path because it's the one with fewer ways to silently lose something. If you push them toward the cleaner story, you can lose what you didn't know you had.
What actually changed
The system runs the same as it did on April 25. The only difference is that I no longer carry it in my head.
Memory-based system
- × One person
- × Undocumented process
- × Scattered files
Owned system
- ✓ Repository
- ✓ Version control
- ✓ Repeatable recovery
- Memory
- Manual recovery
- Single machine
- Hidden risk
- Repository
- One-command deployment
- Backup
- Shared ownership
The same shape, in your business
The pattern travels beyond engineering.
Every working business carries at least one version of it. A customer who's 40% of revenue. A deal pipeline that depends on one rep's relationships. A bookkeeping system only the founder can navigate. A vendor with no real backup. A password vault on a single laptop.
In each case, you have something that's working now, that you've been meaning to address later, where the cost of fixing it is bounded and visible, and the cost of carrying it is unbounded and invisible. Until it isn't.
You can't avoid these. Every working business has some. The judgment is to name them. Write them down. Put a rough cost on what happens if each one fails on the worst day. Re-read the list every quarter and ask whether anything has gotten cheaper to fix or more expensive to carry.
What to ask the technical people you hired
Three questions, in order. Ask them plainly. A good engineer will be glad you asked.
You don't need to fix everything they name. You need to know it's named.
April 25 was the day my account got updated. I'd been carrying a number I hadn't checked in months, and the number had grown.
The fix took a week of evenings. I should have done it in February.
I'll do the next one in February.