Column-level data lineage in practice

·5 min read·

Every data team eventually asks two questions that lineage exists to answer. If I change this column, what breaks? And, when a number is wrong: where did this come from? Without lineage you answer both by grepping the warehouse and hoping. With it, you get an exact list.

Most lineage you'll see is table-level: model A feeds model B feeds the dashboard. That's useful for a high-altitude map, but it's too coarse for the questions that actually keep you up. Table-level lineage says "these two models are connected." It can't tell you that renaming orders.amount breaks revenue.gross and finance.mrr but leaves the other forty columns in those models untouched. For that you need column-level lineage — the derivation of each output column traced back through every transformation to the source columns it depends on.

The data lineage viewer below is the mental model: nodes are datasets, edges are flows, and clicking a connection traces the column-level impact path through the graph. Drag it around — the structure is the point.

Loading artifact: data-lineage...

Where the lineage actually comes from#

You don't annotate lineage by hand — you'd never keep it current. It's derived, and there are two complementary sources.

Static: parse the SQL#

Your transformations are the lineage. A SELECT is a precise statement of which input columns produce which output column, and you can recover that by parsing the query into an AST and walking it. sqlglot does exactly this — it has a lineage function that, given a column and the SQL (plus the schema), returns the upstream columns it derives from:

from sqlglot.lineage import lineage
 
node = lineage(
    "total_revenue",
    "SELECT SUM(price * qty) AS total_revenue FROM orders",
)
# walking the node yields the real dependencies:
#   orders.price, orders.qty

Run that across every model in a dbt project and you've reconstructed the whole column graph from nothing but the SQL you already wrote. No manual metadata, no drift — regenerate it on every commit and it's always true.

i

The reason static SQL parsing is so powerful: the lineage can't go stale, because it is the code. If the SQL changes, the next parse reflects it. Hand-maintained lineage documentation is wrong the moment someone merges a PR; derived lineage is correct by construction.

Runtime: capture from the orchestrator#

Static parsing covers SQL-defined transformations. But pipelines also move data through Python, ingestion tools, and steps that aren't a single query. For those you want lineage emitted as the job runs. OpenLineage is the open standard for this — a spec for run-level lineage events (which job read which datasets and wrote which others, with column-level facets) that integrations emit from orchestrators like Airflow and Dagster. Marquez is the reference store.

The two together give you full coverage: static parsing for the transformation logic, runtime events for everything that happens around it.

Using lineage when it counts#

A lineage graph that just sits in a catalog is a museum piece. The value shows up at two specific moments.

Impact analysis, before you change anything. This connects directly to schema contracts. In the dlt and DuckDB pipeline post I argued for freezing schemas in production so an upstream change fails loud instead of corrupting quietly. Lineage is the other half of that loop: when the contract catches a changed column, the lineage graph tells you the blast radius — exactly which downstream columns, models, and dashboards depend on it. You go from "something upstream changed, hope nothing broke" to "this change touches these three columns, I'll update them in the same PR."

Root-cause, after a number goes wrong. A metric looks off. Column-level lineage lets you walk upstream from the bad output column through each transformation to the source, checking the logic at every hop, instead of bisecting the whole warehouse. The wrong SUM or the bad join reveals itself because you only have to inspect the columns actually on the path.

There's a monitoring dimension too. A pipeline health dashboard that tracks freshness and flags schema drift becomes far more actionable when a drift alert can name the downstream columns at risk — that's lineage feeding observability rather than just raising an undifferentiated alarm.

What I'd actually do#

If you're starting from nothing: run sqlglot over your SQL models in CI and persist the column graph as an artifact on every commit. That alone gives you accurate impact analysis for the majority of teams, for roughly the cost of an afternoon — no platform, no license. Layer OpenLineage in later when you need runtime coverage across non-SQL steps, or when the org wants one catalog across many tools.

The mistake is treating lineage as documentation — something you draw once and admire. Treat it as derived data: regenerated from the code, queried at decision time, wired into the moments where a wrong assumption is about to cost you. Table-level lineage gives you a map. Column-level lineage gives you the precision to change things without flinching.

[S.01]§ Related

Data pipelines with dlt and DuckDB

Most pipeline code is glue nobody wants to maintain. dlt and DuckDB let you skip the glue and keep the parts that matter — schema inference, incremental loading, and contracts that fail loudly instead of silently corrupting your warehouse.

6 min read

SQL window functions: a practical guide

Window functions are the single highest-leverage SQL feature for analytics work, and the frame clause is the part everyone skips — right up until their running total is silently wrong. Here's the whole thing, practically.

5 min read