Data pipelines with dlt and DuckDB

Most data pipeline code is glue. Paginate an API, flatten the JSON, guess at types, create the table if it doesn't exist, figure out what changed since last time, handle the new field that appeared overnight. None of it is interesting, all of it breaks, and you maintain it forever.

dlt↗ (data load tool) and DuckDB↗ are the combination I reach for to delete most of that glue. dlt is an open-source Python library that handles extract, normalize, and load declaratively — you write a function that yields data, it works out the schema and the SQL. DuckDB is an embedded analytical database: no server, fast, and it runs the same on your laptop and in production. Together they make the boring 80% of a pipeline disappear so you can spend attention on the 20% that's actually about your data.

Here's how I think about each stage, and the two features — write dispositions and schema contracts — that separate a toy pipeline from one you'd trust in production.

A whole pipeline is about ten lines#

The thing that surprises people coming from hand-rolled ingestion is how little code dlt needs. A resource yields records; a pipeline loads them into a destination. That's the program.

import dlt
from dlt.sources.helpers import requests
 
@dlt.resource(write_disposition="merge", primary_key="id")
def issues(repo: str):
    url = f"https://api.github.com/repos/{repo}/issues"
    yield requests.get(url).json()
 
pipeline = dlt.pipeline(
    pipeline_name="github",
    destination="duckdb",
    dataset_name="github_data",
)
load_info = pipeline.run(issues("dlt-hub/dlt"))

You never wrote a CREATE TABLE. dlt inferred the schema from the data, created the issues table in DuckDB, and — because the API returns nested JSON — unnested child objects and arrays into their own related tables with foreign keys back to the parent. That normalization step is the part people most often get wrong by hand, and it's the part dlt is best at.

The dlt → DuckDB sandbox below lets you watch this happen: pick a source, choose a write disposition and a schema contract, run it, and step through extract → normalize → load. It's the same model the real library uses, made visible.

Loading artifact: dlt-pipeline-sandbox...

Write dispositions are the decision that actually matters#

When you load data twice, what should happen? That question has three answers, and picking the wrong one is how warehouses end up with either duplicate rows or missing history.

append — every run adds rows. Right for immutable event streams; wrong for anything you'll re-fetch, because you'll get duplicates.
replace — drop and reload the whole table each run. Simple and correct for small reference data you can afford to refetch entirely.
merge — upsert on a primary_key (and optionally a merge_key). This is the one you want for most real sources: re-running is idempotent, updates land in place, and you don't accumulate garbage.

The merge in the example above means I can run that pipeline every hour and the issues table always reflects current state without duplicating a single row. Idempotent re-runs are the difference between a pipeline you can schedule and walk away from and one you have to babysit.

For sources where you only want new records, dlt's incremental loading tracks a cursor — a timestamp or monotonic id — and only requests rows past the last high-water mark:

@dlt.resource(write_disposition="merge", primary_key="id")
def issues(repo, updated=dlt.sources.incremental("updated_at")):
    url = f"https://api.github.com/repos/{repo}/issues?since={updated.last_value}"
    yield requests.get(url).json()

Now each run pulls only what changed since last time, and merges it. Small, fast, idempotent.

Schema contracts decide whether you fail loud or corrupt quietly#

This is the feature I wish more people knew about, because it's the one that saves you at 2am.

Source schemas drift. A SaaS API adds a field. A column that was always an integer shows up as a string. A new nested object appears. By default, dlt evolves the schema to accommodate the change — usually what you want in development, occasionally a disaster in production, where a silently-added column or a silently-changed type can break everything downstream that assumed stability.

A schema contract lets you say, per pipeline, exactly how much change you'll tolerate — across three levels (tables, columns, data_type) with four modes each (evolve, freeze, discard_value, discard_row):

pipeline.run(
    source,
    schema_contract={
        "tables": "evolve",      # allow brand-new tables
        "columns": "freeze",     # but a new column on an existing table is an error
        "data_type": "freeze",   # and a type change is an error, not a coercion
    },
)

The pattern I use: evolve in dev, freeze in production. While you're building, let the schema move so you discover the real shape of the data. Once it's feeding something downstream, freeze the columns and types so an upstream change becomes a loud, catchable error instead of a quiet corruption that surfaces three transformations later.

A frozen contract turns "a field changed type and half the dashboards are now wrong" into "the pipeline stopped and told me which field." That's the trade you want. Failing loud is a feature.

Why DuckDB belongs here#

DuckDB earns its place as the destination for two unglamorous reasons. First, there's no infrastructure — it's a library, the database is a file, and your dev pipeline is byte-for-byte the same as production with the destination swapped. Second, it's a genuine columnar OLAP engine, so once dlt has loaded normalized tables you can query them with real analytical SQL immediately — joins across the unnested child tables, window functions, aggregates — without moving the data anywhere.

That last part connects directly to two things I've written about separately. Once your data lands in clean relational tables, the analytical work is SQL — and window functions are where most of the leverage is; I broke those down in a practical guide to SQL window functions. And when a frozen contract does catch an upstream change, the next question is always "what downstream breaks?" — which is a data lineage question.

The whole point#

dlt and DuckDB don't make pipelines magic. They make the boring parts boring again — schema inference, normalization, idempotent loading — so the only decisions left are the ones that need a human: which write disposition matches the source, and how much schema drift you're willing to absorb before something should break. Get those two right and the pipeline mostly runs itself.

Play with the sandbox to build the intuition, then go read the dlt docs↗ — they're genuinely good. The library does more than fits in one post: REST API sources with declarative pagination, secrets management, dozens of destinations. But write dispositions and schema contracts are the two ideas that change how you think about ingestion.