dlt REST API sources: declarative pagination without the glue

In the dlt and DuckDB pipeline post I argued that most pipeline code is glue you never wanted to maintain, and that dlt deletes the schema half of it: no CREATE TABLE, no hand-rolled normalization, no guessing at types. But there's a second pile of glue that post left for later, and it's the one you re-write for every single API you ingest — the pagination loop, the auth header, the since= parameter that fetches only what changed. dlt has a declarative answer for that too. It's the REST API source, and it turns an entire connector into a configuration dictionary.

That earlier post actually closed by naming what it hadn't covered: "REST API sources with declarative pagination, secrets management, dozens of destinations." This is that post. The dlt → DuckDB sandbox below is still the right mental model for the stages a source drives — extract → normalize → load — and the REST API source is just a declarative way to author the extract half.

Loading artifact: dlt-pipeline-sandbox...

A source is a config, not a connector#

The thing to internalize is that a REST API source is data, not code. You describe the API — where it lives, how it authenticates, which endpoints you want — and dlt builds the connector from that description.

import dlt
from dlt.sources.rest_api import RESTAPIConfig, rest_api_source
 
config: RESTAPIConfig = {
    "client": {
        "base_url": "https://api.github.com/repos/dlt-hub/dlt/",
        "auth": {"token": dlt.secrets["github_token"]},
    },
    "resource_defaults": {
        "primary_key": "id",
        "write_disposition": "merge",
        "endpoint": {"params": {"per_page": 100}},
    },
    "resources": [
        "stargazers",
        {
            "name": "issues",
            "endpoint": {
                "path": "issues",
                "params": {"state": "open", "sort": "updated"},
            },
        },
    ],
}
 
pipeline = dlt.pipeline(
    pipeline_name="github",
    destination="duckdb",
    dataset_name="github_data",
)
pipeline.run(rest_api_source(config))

Three pieces carry the whole thing. client holds the connection-level concerns — base URL, auth, and pagination — that apply to every request. resource_defaults applies settings to every resource at once; here it sets the merge write disposition and primary_key I argued for in the parent post, so each endpoint upserts instead of duplicating. resources is the list of endpoints — a bare string when the defaults are all you need, a dict when an endpoint takes its own params. There is no request loop anywhere in that program, and there isn't going to be one.

Pagination is the glue dlt actually deletes#

The default behaviour is the one you want most of the time: dlt tries to figure pagination out on its own. On the first response it inspects the body and headers and infers how the API pages — a Link header, a next URL embedded in the JSON, an offset, a page counter — then follows that to exhaustion. For a well-behaved API you configure nothing and still get every page.

When the guess can't be made, you name the paginator instead of writing the loop. dlt ships paginators for the common schemes — the RFC 8288 (Web Linking; formerly RFC 5988) Link header, a next-page URL inside the JSON body, offset/limit, page-number, and opaque cursor tokens echoed back each page. You set one on the client (or per endpoint) as a string alias, or as a dict when it needs parameters:

# simplest: follow the RFC 8288 Link header

# explicit page-number paging with a safety bound
"paginator": {
    "type": "page_number",
    "base_page": 1,
    "page_param": "page",
    "total_path": None,     # no total in the response — stop on an empty page
    "maximum_page": 50,
}

The win isn't that dlt knows every API — it's that pagination becomes declared intent instead of a hand-written while loop that silently breaks the day the API renames its next field. Let auto-detection try first; name the paginator only when the guess is wrong, and reach for a custom one only when the scheme is genuinely strange.

Auth is configuration too#

The same idea covers credentials. A bearer token is one line; an API key spells out where it goes:

"auth": {
    "type": "api_key",
    "name": "apiKey",
    "api_key": dlt.secrets["news_api_key"],
    "location": "query",     # or "header"
}

Note the dlt.secrets[...] lookups in both examples. Secrets resolve from .dlt/secrets.toml or environment variables, never from the config you commit — that's the "secrets management" the parent post gestured at, and it's why the connector description above is safe to check into git as-is.

Incremental loading drops into the same config#

In the parent post, incremental loading meant hand-writing a ?since={last_value} into the request URL. The REST API source makes the cursor a property of the parameter instead:

{
    "name": "issues",
    "endpoint": {
        "path": "issues",
        "params": {
            "since": {
                "type": "incremental",
                "cursor_path": "updated_at",
                "initial_value": "2024-01-01T00:00:00Z",
            },
        },
    },
}

dlt persists the high-water mark across runs and passes since= automatically. Combined with the merge and primary_key from resource_defaults, every run pulls only the rows that changed and upserts them in place. Incremental, idempotent, and you wrote no state-tracking code — the same merge/incremental pairing from the parent post, declared rather than coded.

Dependent resources without an orchestrator#

A shape you hit constantly: fetch a list, then fetch a child collection for each item. dlt resolves that dependency straight from the parent's data:

{
    "name": "issue_comments",
    "endpoint": {
        "path": "issues/{resources.issues.number}/comments",
    },
    "include_from_parent": ["id"],
}

The {resources.issues.number} reference tells dlt to run issues first, then call the comments endpoint once per issue number, threading the parent's id onto each child row via include_from_parent. What would otherwise be a nested fan-out with manual key propagation is a path template and one field list.

When to drop a level#

The declarative source covers the large majority of REST APIs, but not the genuinely awkward ones — a signed pagination scheme, a response you have to reshape before it's loadable. For those, dlt exposes the layer beneath it: a RESTClient whose paginate() method you drive in plain Python, reusing the same paginator and auth building blocks. You trade the config-as-connector ergonomics for full control without leaving dlt's machinery. That escalation path is the part I value most — declarative for the 90%, imperative for the 10%, one library for both.

The whole point#

The REST API source is the same bet as the rest of dlt: push the boring, error-prone, identical-across-every-source work — paging, auth plumbing, incremental bookkeeping — into declared configuration, and keep your attention for the decisions that are actually about your data. It pairs directly with the write dispositions and schema contracts from the parent post: the source decides what to pull and how to page it; the disposition and contract decide how it lands and what's allowed to change. Wire those together and an API pipeline is a config dictionary plus three lines of pipeline.run.

That preference — declare the boring parts, hand-write only the judgment — is the same instinct that runs through everything in how I build AI-native: move effort to where it survives contact with production, and let well-designed tools delete the glue. Open the dlt → DuckDB sandbox to watch the extract → normalize → load flow a source drives, then point a real rest_api_source at an API you actually use — the dlt docs↗ cover the REST API source in full.