What dbt is

Last time, I argued for moving the data cleaning off the app server and into the database. This is the build: how I rebuilt a Shopify pipeline with dbt, what broke on the first run, and how I proved the new version matched the old one, row for row.

The old pipeline cleaned its data in pandas on the app server before anything reached the database. Fourteen scripts handed CSV files down a line. I argued for the opposite: store the raw API responses, then transform them with SQL, inside the database.

That post was the case. This one is the build. The tool was dbt. It rebuilds the same output table as a tested chain of SQL you can read and check.

If you have not met dbt, the short version: you write each transformation as a SQL file. dbt calls it a model. Each model selects from the ones before it, so a chain forms: raw data at the top, cleaned tables in the middle, the finished table at the bottom. dbt reads those references and runs them in order. You get a map of how every table is built, for free.

The part I care about most is the testing. Next to each model, you write what you expect to be true: this column has no nulls, this column is unique, one row per order. dbt checks those claims on every run. What used to be a hopeful comment in a Python file, the database now enforces.

That last test, the one about uniqueness, is where this story turns.

The shape of the build

I kept the structure boring, which is the point. Raw data comes in as seeds and sources. Seeds are small CSVs dbt loads and owns, like the shipping-rate table; sources are tables something else already filled, which dbt only reads and tests. A staging layer cleans each one, renaming columns and fixing types, and nothing else. An intermediate layer does the joins. A final table, the mart, is what the rest of the company reads.

The old pandas script did all of this in one file, in memory, in one pass. Now each step is its own small model with a name and a test. When something breaks, dbt tells you which step and which row.

Before: one pandas script
  • Reads three CSVs, joins, casts, derives, writes — all at once
  • State lives in memory; nothing in between is inspectable
  • A failure points at a line of Python, not a row of data
After: a dbt DAG
  • Small models: staging, intermediate, mart — each one named
  • Every table is queryable and rebuildable on its own
  • Tests ride alongside the models and run every build

A snag worth showing

The move had friction. The first thing that bit me was the IDs. Shopify order IDs are huge, larger than a standard Postgres integer holds. Load them as integers, and the seed fails with an integer out of range before you transform a single row.

The fix is unglamorous: load every seed column as text, then cast it in the staging layer, guarding blanks as you go.

models/staging/stg_line_items.sql SQL
-- seeds load as text; nullif guards blanks before the cast

nullif(price, '')::numeric as price

That one line, repeated across the staging models, is most of what moving to the database feels like: careful translation of the casts and edge cases the old code never made explicit.

The shipping fee, and the bug it was hiding

Now the part I did not expect.

The store charges shipping by weight. A rate table says a parcel between this weight and that weight costs this much. In the old code, one small function did that lookup. Give it a parcel's weight, and it returns the matching price from the rate table:

routes/line_items.py Python
def get_shipping_fee(weight):
    rate = rates[(rates.min_weight <= weight) & (rates.max_weight >= weight)]
    return rate['shipping_fee'].values[0] if not rate.empty else None

Read the last line again. It finds every rate band that matches the weight, then takes .values[0], the first one. For two years, nobody questioned it.

I rebuilt the lookup as a SQL join: a parcel matches a band when its weight falls between the band's minimum and maximum. Then I added the test I add to every table: a check on its grain, what one row is supposed to stand for. Here, one row per order.

It failed.

dbt stopped the build: you promised one row per order, and here are two.

The cause sat in the data, not my code. Past about 570 grams, the weight ranges in the rate table start to overlap. One range runs up to 660 grams; the next already starts at 650. So a 650-gram parcel falls inside both rows at once, each with its own fee.

min_weightmax_weightshipping_feecontains 650g?
640659.9919.80yes
650669.9919.96yes

Nothing in the code raised a flag. .values[0] returns the first row a filter matches, so when a weight fell in two bands, the code kept the lower fee and dropped the other, for every order, for two years. The number was probably right anyway, by luck of the sort order. But the code never recorded that choice. It took the first row and moved on.

The SQL fix makes the choice on purpose:

models/intermediate/int_order_shipping.sql SQL
-- lowest band wins, on purpose (matches the old pandas .values[0])

row_number() over (
  partition by order_id order by min_weight asc
) as rn

-- ... then, in the final select:
where rn = 1

The difference. Same answer as the old code produced: the lowest band wins. The overlap itself is a data-entry error in the rate table, so this copies the old behaviour rather than curing it; the real fix is correcting the weights upstream. Either way, the choice is now explicit in the SQL, with a comment explaining "why", instead of buried in a method call. The grain test passes because the tie is settled in the open.

Proving the new pipeline matched the old

A rewrite is worth trusting only if you can show it gives the same answers. So I built both paths: the new SQL one and a faithful copy of the old logic, reading the raw JSON. Then I wrote a test that compares them row by row, in both directions.

Result. Zero rows different. The new pipeline matches the old one exactly, and dbt rechecks that on every run.

Same outputs, new internals. That makes the whole job a refactor, and the parity test is the proof. The one known defect, the overlapping rate bands, rides along unchanged on purpose; fixing it is separate work, since it would change the answers.

What the move bought me

Last time the argument was about clutter: fewer scripts, fewer files, raw data kept raw. The build delivered that. Fourteen scripts became a handful of small models.

The thing I did not see coming is the one I would lead with now. Rebuilding it in dbt did more than tidy the pipeline. The old code would return an answer, any answer, with no signal that the inputs were ambiguous. dbt checks instead: you declare what should be true, and the first time the data disagrees, the build stops and points at the row.

The overlapping bands had been there for two years. A test that demands one row per order is what surfaced them.

Key takeaway

dbt did not just move transformations. It made assumptions visible and testable.

Part 2 of 2. Grounded in a real dbt build.

Originally published on Medium. Read the original article →
Let's talk

Drowning in scripts and CSV handoffs?

We help teams move from fragile pipelines to tested, maintainable data infrastructure.

Book a free consultation