A customer placed an order in production. The confirmation page loaded.

But the cook and bartender dashboards showed nothing. The order existed in the database: payment_status: pending

That was the problem.

The dashboards filter on payment_status = 'paid'. An order that never gets confirmed, never shows up.

No alert. No error. No food. Then the customer waits.

Production incident flow
Customer places orderConfirmation page loads successfully
payment_status = pendingOrder created in database but never confirmed
Dashboard receives nothingFilters on payment_status = 'paid'
Kitchen sees no orderOrder never made. Customer waits.
Root cause discoveredDatabase schema not applied in production

What actually happened

Staging worked. Production didn't.

Same code. Same Docker images. Different outcome.

The difference: staging had the correct database schema, but production didn't.

The root cause

The deploy workflow had this line:

deploy.yml
cat ~/[app]/init-scripts/schema.sql | docker exec -i [app]-postgres-production \
  psql -U [app]_user -d [app] || echo 'Schema may already exist'

That last part is the lie:

the problem
|| echo 'Schema may already exist'

If the file is missing → the pipeline prints a message and continues.

If the SQL has errors → same thing.

If psql can't connect → same thing.

The build stays green but the schema is not applied. And nobody knows.

I wrote a deploy step that skipped verification and moved on. I never checked if the schema applied. The app never checked if its tables existed. The health check returned HTTP 200 because the server booted, not because it worked.

What the pipeline looked like
Commit
CI Passed
green
Deploy
green
Production Broken
broken

The audit

I grepped both workflow files for || true, || echo, and unguarded curl calls.

Eight hits. Two broke production.

RiskPatternImpact
High schema.sql || echo (staging) DB schema not applied, app broken
High schema.sql || echo (production) DB schema not applied, app broken
Medium ssh-keygen || echo (production) Bad key undetected, deploy fails later
Low grep || true (staging) Expected behavior, no risk
Low grep || true (production x2) Expected behavior, no risk
Low curl without -f flag (staging) Cache not purged, stale content
Low curl without -f flag (production) Cache not purged, stale content

Five were harmless. One caused confusion. Two broke production.

The fix

Three layers. Each catches what the others miss.

1

Stop on first failure

Added set -euo pipefail to every SSH block:

deploy.yml
ssh user@server "
  set -euo pipefail
  docker login ghcr.io ...
  docker compose pull
  docker compose up -d
"
FlagEffect
-eExit immediately if any command fails
-uTreat unset variables as errors
-o pipefailFail if any command in the pipe fails

Without pipefail, a pipeline like cat file | psql only checks the exit code of psql. If cat fails because the file is missing, the pipe still reports success. psql receives empty input and exits 0.

With pipefail, the pipe fails if either side fails.

This alone would have caught the schema bug.

2

Replace silence with validation

Replace every || echo with proper error handling.

Silent failure
  • || echo
  • || true
  • Ignored errors
  • Green pipeline
Controlled failure
  • Validation
  • Logs
  • Warnings
  • Explicit errors

Schema validation — before:

before
cat schema.sql | docker exec -i postgres psql ... || echo 'Schema may already exist'

After:

after
if [ ! -f ~/[app]/init-scripts/schema.sql ]; then
  echo 'ERROR: schema.sql not found! Did SCP step fail?'
  exit 1
fi
echo 'Applying database schema...'
cat ~/[app]/init-scripts/schema.sql | docker exec -i postgres psql ...
echo 'Schema applied successfully'

SSH key validation — before:

before
ssh-keygen -l -f ~/.ssh/deploy_key || echo "Key validation failed"

After:

after
if ! ssh-keygen -l -f ~/.ssh/deploy_key; then
  echo "ERROR: SSH key validation failed!"
  echo "Check PRODUCTION_SSH_KEY secret format"
  exit 1
fi
echo "SSH key validated successfully"
3

Validate before deploying

Added a validation step before any file gets to the server:

deploy.yml — pre-deploy validation
- name: Validate required files exist
  run: |
    echo "Validating deployment files..."
    if [ ! -f infrastructure/docker/docker-compose.production.yml ]; then
      echo "ERROR: docker-compose.production.yml not found"
      exit 1
    fi
    if [ ! -d infrastructure/docker/init-scripts ]; then
      echo "ERROR: init-scripts directory not found"
      exit 1
    fi
    if [ ! -f infrastructure/docker/init-scripts/schema.sql ]; then
      echo "ERROR: schema.sql not found"
      exit 1
    fi
    echo "All required files present"

Catches missing files on the build runner, where the error message is obvious.

Pipeline after fix
Commit
Validation
files checked
Schema Check
verified
Deployment
deployed
Production Verified
healthy

Tradeoffs

Not every || true is wrong. grep exits 1 when it finds zero matches. That's correct behavior, not an error. If your .env file has no STRIPE_ lines, grep -v '^STRIPE_' .env exits 1 and set -e kills the build.

The fix is a comment, not removal:

intentional || true
# || true is intentional: grep exits 1 when no matches found;
# this is expected if .env has no STRIPE_ lines
grep -v '^STRIPE_' .env > .env.tmp || true

The Cloudflare cache purge was a different case. After making it fail loudly, it blocked every deploy. The API token didn't have the right permissions.

A cache issue was killing deploys that had nothing to do with the application.

So I switched to continue-on-error: true with a warning:

deploy.yml — non-blocking step
- name: Purge Cloudflare Cache
  continue-on-error: true  # TODO: fix API token permissions
  run: |
    if ! curl -sf ... purge_cache; then
      echo "WARNING: Cloudflare cache purge failed (non-blocking)"
    else
      echo "Cloudflare cache purged successfully"
    fi

There's a spectrum:

Silent failure
|| echo
Loud failure
exit 1
Right answer
continue-on-error
+ WARNING log
+ TODO comment

Silence looks like success.

Failure looks like a problem.

The right answer tells the truth.

Takeaway

Add set -euo pipefail to every shell block. Audit every || true and || echo.

For each one, ask: if this fails, should the build continue?

If yes, document it.

If no, let it fail.

The rule

Validate files before deploying them. If you allow a failure, log a warning. If a deploy step doesn't verify its own success, you're guessing it worked. Five commits fixed that.

Originally published on Medium. Read the original article →