The setup
Two deploy workflows. One for staging. One for production.
Same steps. Same structure. Maintained separately.
They started identical. Within weeks, they weren't.
A missing SCP step in production meant database migrations never reached the server. Production had no item_type column. The cook dashboard — responsible for separating food and drink orders — broke. No orders appeared.
This is how the drift happened, how I found it, and how I made it structurally impossible.
The system
A restaurant ordering platform deployed via GitHub Actions into Docker containers.
Two environments:
- Staging → auto-deploys on push
- Production → deploys manually
Each environment had its own workflow file.
How drift happens
Nobody writes two different workflows on purpose. You copy one file, change the secrets and domain names, and move on.
Then:
- You fix a bug in staging → forget to port the fix to production
- Someone adds a step to production → forget to add it to staging
Now the files are different in ways nobody tracks.
Identical to production
Identical to staging
+ SCP step added
+ migrations enabled
No SCP step
Still uses schema.sql
+ fail-loudly
+ grep comments
+ SSH key validation
No grep comments
Each change made sense on its own. Nobody noticed the divergence because nobody diffed the files. They looked similar enough that you assumed they were the same.
Martin Kleppmann describes this class of problem in Designing Data-Intensive Applications — the challenge of maintaining consistency across replicas. Two workflow files are replicas of the same deployment logic. Without a sync mechanism, they will diverge. The question isn't if, but when. And whether you'll notice before it matters.
The bug
Staging got a new step: SCP the database migration files to the server before deploying.
These files are the record of every structural change to the database. Every column added, every table altered, in order. Copy them to server and run them, and the database arrives at the correct state. Skip them and the database falls behind.
This replaced the old pattern of baking schema.sql into Docker's init-scripts, which only run on first container creation.
# Added to deploy-staging.yml - name: Copy docker-compose and migrations to server run: | scp docker-compose.staging.yml ${USER}@${HOST}:~/app/ scp -r migrations ${USER}@${HOST}:~/app/ scp run-migrations.sh ${USER}@${HOST}:~/app/
Production never got this step.
What broke
A migration that added a new column: item_type ran on staging. The cook dashboard filters by item type.
- Food → cook dashboard
- Drinks → bartender dashboard
Production never received the migration. The column didn't exist. The query failed silently. No orders appeared.
Why it passed checks
Both environments passed health checks.
App loaded. API responded.
But the check only tested whether the server returned HTTP 200. It didn't test whether the schema was correct.
The audit
I diffed both workflow files. 41 differences. I sorted each one.
| Category | Count | Examples |
|---|---|---|
| Intentional | 12 | Different secrets, domains, image tags, compose file names, database names |
| Drifted | 7 | Missing SCP step, missing migrations, different SSH key handling, different cache settings, inconsistent comments |
| Shared | 22 | Build steps, Docker login, health check pattern, .env handling, container restart |
The pattern was obvious:
Intentional differences → values.
Drifted differences → logic.
Values should vary. Logic should not.
The fix: one workflow, two callers
GitHub Actions supports reusable workflows via workflow_call.
Instead of duplicating logic:
- Put all deployment steps in one shared workflow
- Let each environment pass its own parameters
Duplication. Drift risk. Hidden differences.
Shared logic. Different parameters. Can't drift.
Shared workflow (core idea)
# deploy.yml on: workflow_call: inputs: environment: required: true type: string compose_file: required: true type: string image_tag: required: true type: string customer_url: required: true type: string staff_url: required: true type: string postgres_container: required: true type: string database_name: required: true type: string payment_mode: required: true type: string use_docker_cache: type: boolean default: true run_e2e_tests: type: boolean default: false secrets: SSH_KEY: required: true HOST: required: true USER: required: true STRIPE_SECRET_KEY: required: true STRIPE_WEBHOOK_SECRET: required: true CLOUDFLARE_ZONE_ID: required: true CLOUDFLARE_API_TOKEN: required: true
Caller (staging)
name: Deploy to Staging on: push: branches: [staging] workflow_dispatch: jobs: deploy: permissions: contents: read packages: write uses: ./.github/workflows/deploy.yml with: environment: staging compose_file: docker-compose.staging.yml image_tag: staging customer_url: https://staging.example.com staff_url: https://staging-manage.example.com postgres_container: app-postgres-staging database_name: app_staging payment_mode: dummy use_docker_cache: false run_e2e_tests: true secrets: SSH_KEY: ${{ secrets.STAGING_SSH_KEY }} HOST: ${{ secrets.STAGING_HOST }}
Now:
- Add a step → both environments get it
- Fix a bug → both environments get it
- Forget something → impossible (by structure)
Tradeoffs
Reusable workflows have constraints. GitHub's workflow_call requires the caller to pass every secret explicitly. Secrets aren't inherited. The caller lists seven secrets even though it adds no logic:
1. Explicit secrets
Every secret must be passed manually:
secrets: SSH_KEY: ${{ secrets.STAGING_SSH_KEY }} HOST: ${{ secrets.STAGING_HOST }} USER: ${{ secrets.STAGING_USER }} STRIPE_SECRET_KEY: ${{ secrets.STRIPE_SECRET_KEY_STAGING }} STRIPE_WEBHOOK_SECRET: ${{ secrets.STRIPE_WEBHOOK_SECRET_STAGING }} CLOUDFLARE_ZONE_ID: ${{ secrets.CLOUDFLARE_ZONE_ID }} CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
You can use secrets: inherit to pass all secrets implicitly, but that gives the shared workflow access to secrets it doesn't need. I chose explicit. The repetition is worth knowing exactly what each environment exposes.
2. Drift detection ≠ drift prevention
The unified workflow prevents future code drift. It doesn't detect existing drift. I found the seven drifted differences by diffing the files by hand. An automated check that compares staging and production configurations on a schedule would catch this earlier.
TICKET-013: Establish Environment Parity with Ansible. Not built yet. The workflow unification came first because it prevents the most damaging class of drift: missing deployment steps.
What changed
One commit for the implementation. One merge PR. The refactor took a morning.
The pattern
Robert C. Martin's DRY principle applies to infrastructure code the same way it applies to application code. Two functions with the same logic should be one function called twice with different arguments. Two workflow files with the same deployment logic should be one workflow called twice with different parameters.
The difference in CI/CD: the cost of drift is delayed. Duplicate application code causes bugs immediately. Change one copy and the other behaves differently. Duplicate workflow code causes bugs on the next deploy that hits the drifted path. You might not encounter it for weeks. When you do, the symptom (missing database column) has no obvious connection to the cause (missing SCP step in a workflow file you haven't opened in a month).
The takeaway
If you maintain multiple workflow files:
- Diff them
- Categorize every difference as intentional (values) or drifted (logic)
- Extract the shared logic into a reusable workflow
- Keep environment files thin — parameters and secrets only
- Test on staging first
- Accept deployment timing differences
Because unified code doesn't mean synchronized deployments. If production deploys are manual, build a habit or an alert for triggering them after merges.