The setup

Two deploy workflows. One for staging. One for production.

Same steps. Same structure. Maintained separately.

They started identical. Within weeks, they weren't.

A missing SCP step in production meant database migrations never reached the server. Production had no item_type column. The cook dashboard — responsible for separating food and drink orders — broke. No orders appeared.

This is how the drift happened, how I found it, and how I made it structurally impossible.

The system

A restaurant ordering platform deployed via GitHub Actions into Docker containers.

Two environments:

  • Staging → auto-deploys on push
  • Production → deploys manually

Each environment had its own workflow file.

How drift happens

Nobody writes two different workflows on purpose. You copy one file, change the secrets and domain names, and move on.

Then:

  • You fix a bug in staging → forget to port the fix to production
  • Someone adds a step to production → forget to add it to staging

Now the files are different in ways nobody tracks.

Workflow drift timeline
Week 1
staging.yml

Identical to production

production.yml

Identical to staging

Week 3
staging.yml

+ SCP step added
+ migrations enabled

production.yml

No SCP step
Still uses schema.sql

Week 5
staging.yml · 229 lines

+ fail-loudly
+ grep comments

production.yml · 190 lines

+ SSH key validation
No grep comments

Each change made sense on its own. Nobody noticed the divergence because nobody diffed the files. They looked similar enough that you assumed they were the same.

Reference

Martin Kleppmann describes this class of problem in Designing Data-Intensive Applications — the challenge of maintaining consistency across replicas. Two workflow files are replicas of the same deployment logic. Without a sync mechanism, they will diverge. The question isn't if, but when. And whether you'll notice before it matters.

The bug

Staging got a new step: SCP the database migration files to the server before deploying.

These files are the record of every structural change to the database. Every column added, every table altered, in order. Copy them to server and run them, and the database arrives at the correct state. Skip them and the database falls behind.

This replaced the old pattern of baking schema.sql into Docker's init-scripts, which only run on first container creation.

deploy-staging.yml
# Added to deploy-staging.yml
- name: Copy docker-compose and migrations to server
  run: |
    scp docker-compose.staging.yml ${USER}@${HOST}:~/app/
    scp -r migrations ${USER}@${HOST}:~/app/
    scp run-migrations.sh ${USER}@${HOST}:~/app/

Production never got this step.

What broke

A migration that added a new column: item_type ran on staging. The cook dashboard filters by item type.

  • Food → cook dashboard
  • Drinks → bartender dashboard

Production never received the migration. The column didn't exist. The query failed silently. No orders appeared.

Staging (working)
Build Docker images
SCP migrations to server
Deploy via SSH
run-migrations.sh
Purge Cloudflare cache
Health check
item_type exists · dashboard works
Production (broken)
Build Docker images
missing
Deploy via SSH
missing
Purge Cloudflare cache
Health check
item_type missing · dashboard broken

Why it passed checks

Both environments passed health checks.

App loaded. API responded.

But the check only tested whether the server returned HTTP 200. It didn't test whether the schema was correct.

The audit

I diffed both workflow files. 41 differences. I sorted each one.

CategoryCountExamples
Intentional 12 Different secrets, domains, image tags, compose file names, database names
Drifted 7 Missing SCP step, missing migrations, different SSH key handling, different cache settings, inconsistent comments
Shared 22 Build steps, Docker login, health check pattern, .env handling, container restart

The pattern was obvious:

Key insight

Intentional differences → values.
Drifted differences → logic.

Values should vary. Logic should not.

The fix: one workflow, two callers

GitHub Actions supports reusable workflows via workflow_call.

Instead of duplicating logic:

  • Put all deployment steps in one shared workflow
  • Let each environment pass its own parameters
● Before: 419 lines across 2 files
deploy-staging.yml
229 lines · all logic
deploy-production.yml
190 lines · same logic copied

Duplication. Drift risk. Hidden differences.

● After: 348 lines across 3 files
deploy.yml (shared)
291 lines · all deployment logic
staging.yml production.yml
30 + 27 lines · parameters only

Shared logic. Different parameters. Can't drift.

Shared workflow (core idea)

deploy.yml
# deploy.yml
on:
  workflow_call:
    inputs:
      environment:
        required: true
        type: string
      compose_file:
        required: true
        type: string
      image_tag:
        required: true
        type: string
      customer_url:
        required: true
        type: string
      staff_url:
        required: true
        type: string
      postgres_container:
        required: true
        type: string
      database_name:
        required: true
        type: string
      payment_mode:
        required: true
        type: string
      use_docker_cache:
        type: boolean
        default: true
      run_e2e_tests:
        type: boolean
        default: false
    secrets:
      SSH_KEY:
        required: true
      HOST:
        required: true
      USER:
        required: true
      STRIPE_SECRET_KEY:
        required: true
      STRIPE_WEBHOOK_SECRET:
        required: true
      CLOUDFLARE_ZONE_ID:
        required: true
      CLOUDFLARE_API_TOKEN:
        required: true

Caller (staging)

deploy-staging.yml (30 lines total)
name: Deploy to Staging

on:
  push:
    branches: [staging]
  workflow_dispatch:

jobs:
  deploy:
    permissions:
      contents: read
      packages: write
    uses: ./.github/workflows/deploy.yml
    with:
      environment: staging
      compose_file: docker-compose.staging.yml
      image_tag: staging
      customer_url: https://staging.example.com
      staff_url: https://staging-manage.example.com
      postgres_container: app-postgres-staging
      database_name: app_staging
      payment_mode: dummy
      use_docker_cache: false
      run_e2e_tests: true
    secrets:
      SSH_KEY: ${{ secrets.STAGING_SSH_KEY }}
      HOST: ${{ secrets.STAGING_HOST }}

Now:

  • Add a step → both environments get it
  • Fix a bug → both environments get it
  • Forget something → impossible (by structure)

Tradeoffs

Reusable workflows have constraints. GitHub's workflow_call requires the caller to pass every secret explicitly. Secrets aren't inherited. The caller lists seven secrets even though it adds no logic:

1. Explicit secrets

Every secret must be passed manually:

secrets block
secrets:
  SSH_KEY: ${{ secrets.STAGING_SSH_KEY }}
  HOST: ${{ secrets.STAGING_HOST }}
  USER: ${{ secrets.STAGING_USER }}
  STRIPE_SECRET_KEY: ${{ secrets.STRIPE_SECRET_KEY_STAGING }}
  STRIPE_WEBHOOK_SECRET: ${{ secrets.STRIPE_WEBHOOK_SECRET_STAGING }}
  CLOUDFLARE_ZONE_ID: ${{ secrets.CLOUDFLARE_ZONE_ID }}
  CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}

You can use secrets: inherit to pass all secrets implicitly, but that gives the shared workflow access to secrets it doesn't need. I chose explicit. The repetition is worth knowing exactly what each environment exposes.

2. Drift detection ≠ drift prevention

The unified workflow prevents future code drift. It doesn't detect existing drift. I found the seven drifted differences by diffing the files by hand. An automated check that compares staging and production configurations on a schedule would catch this earlier.

Open ticket

TICKET-013: Establish Environment Parity with Ansible. Not built yet. The workflow unification came first because it prevents the most damaging class of drift: missing deployment steps.

What changed

Before
Workflow files2
Total lines419
Lines that can drift419
Driftable logic229 / 190
Missing SCP bugPossible
Change in one placeNo
After
Workflow files1 + 2 callers
Total lines348
Lines that can drift57
Driftable logic0
Missing SCP bugImpossible
Change in one placeYes

One commit for the implementation. One merge PR. The refactor took a morning.

The pattern

Robert C. Martin's DRY principle applies to infrastructure code the same way it applies to application code. Two functions with the same logic should be one function called twice with different arguments. Two workflow files with the same deployment logic should be one workflow called twice with different parameters.

The difference in CI/CD: the cost of drift is delayed. Duplicate application code causes bugs immediately. Change one copy and the other behaves differently. Duplicate workflow code causes bugs on the next deploy that hits the drifted path. You might not encounter it for weeks. When you do, the symptom (missing database column) has no obvious connection to the cause (missing SCP step in a workflow file you haven't opened in a month).

The takeaway

If you maintain multiple workflow files:

  1. Diff them
  2. Categorize every difference as intentional (values) or drifted (logic)
  3. Extract the shared logic into a reusable workflow
  4. Keep environment files thin — parameters and secrets only
  5. Test on staging first
  6. Accept deployment timing differences

Because unified code doesn't mean synchronized deployments. If production deploys are manual, build a habit or an alert for triggering them after merges.

Originally published on Medium. Read the original article →