I recently refactored a data pipeline: 14 sequential scripts, 6 intermediate CSV files, three separate API calls to the same source.

It worked. But it was one upstream schema change away from breaking.

The actual problem wasn't the code

When I drew out the pipeline, the issue became obvious. We were pulling data from Shopify three separate times, flattening the responses in Python, writing the results to CSV files, then loading it into PostgreSQL, where we immediately start joining it back together.

Key insight

We were doing in Python what the database was designed to do natively.

The 14 scripts weren't the disease. They were a symptom of one design decision made early on: using Python as the transformation layer. Over time, that decision had compounded.

The book that changed how I saw the system

I was reading Martin Kleppmann's Designing Data-Intensive Applications during this period. Two ideas landed differently once I had a real system in front of me:

Schema-on-read vs schema-on-write

The old pipeline enforced schema at write time. Python needed to know the exact structure of the JSON before it could flatten it. Any change from Shopify's API broke the script.

If instead you store the raw JSON and let the database read from it, you can handle schema evolution gracefully. New field in the API response? It's already in your database. Just update the query that reads it.

Push computation to the data

We were pulling raw JSON into Python, transforming it, then loading it into PostgreSQL. PostgreSQL is built for transforming data. The right move was to store the raw JSON and let SQL do the work.

There's a gap between understanding these ideas in the abstract and seeing exactly where they apply in a system you're responsible for. That gap is where the work is.

What the refactor looked like

Old architecture
  • 3 separate API extractions (orders via REST, products via REST, orders via GraphQL)
  • Python flattening scripts to normalize the nested JSON
  • CSV as an intermediate format
  • 14 scripts running sequentially
New architecture
  • A single GraphQL query fetching orders with all nested data in one call
  • Raw JSON inserted directly into PostgreSQL as a JSONB column
  • SQL transforms that extract, cast, and join data at query time
  • 3 scripts total

The results

14 → 3
Scripts reduced
0
Intermediate CSV files
0
Fragile flattening logic

If Shopify changes their response format, the raw data is already in the database. The only adjustment is updating the SQL, not the extraction code.

The tradeoffs I accepted

Every architectural decision introduces tradeoffs. Here are the ones I consciously accepted:

Batch over real-time

Data is as fresh as the last ETL run, not live. For this use case (supplier cost dashboards), that's fine. If the business needed sub-minute freshness, I'd revisit it.

Denormalized final table

The output table has one row per line item, which means order-level fields (customer name, shipping address) repeat across rows. That's a storage tradeoff I accepted because it makes dashboard queries simpler — no complex joins at read time.

Learning curve

PostgreSQL's JSONB functions aren't something every engineer knows. I wrote documentation. That's the cost of using the right tool.

What I'd take to the next project

The pattern

Preserve raw data. Defer transformation.

Storing the raw API response as JSONB means the database is the source of truth, not a derived artifact. You can always reprocess it. You can't go back and re-flatten a CSV that's already been overwritten.

One more lesson: draw the system before you touch it.

The 14-script pipeline was built incrementally, one reasonable addition at a time. The complexity was invisible until the entire system was mapped out on paper.

Originally published on Medium. Read the original article →