SQL window functions: a practical guide

·5 min read·

If I could teach an analytics engineer exactly one thing beyond the basics, it would be window functions. They're the feature that turns SQL from "fetch and aggregate" into "compute over context" — rankings, period-over-period changes, running totals, moving averages — all without leaving the database or collapsing your rows. And yet most people use ROW_NUMBER() to dedup and stop there, never touching the frame clause, which is where half the power lives and most of the bugs hide.

Here's the practical whole of it. The window functions explainer below is the best way to build intuition — pick a function, then hover any result row to see exactly which input rows its window frame covers and why. I'll refer back to it.

Loading artifact: sql-window-functions...

The one idea: aggregate without collapsing#

GROUP BY computes an aggregate and throws your rows away — five orders become one row with a SUM. A window function computes across a set of related rows (the window) while keeping every original row. You get the total and the detail, side by side.

The whole thing hangs off the OVER clause, which has three parts:

SUM(amount) OVER (
  PARTITION BY customer_id   -- reset the calculation per group
  ORDER BY order_date        -- order rows within the partition
  ROWS BETWEEN ...           -- the frame: which rows are in scope
)

PARTITION BY is "group, but don't collapse." ORDER BY sequences rows inside each partition. The frame narrows it to a sliding window of rows. Master those three and everything below is just picking a function to put in front.

Ranking: pick the right one of three#

Three functions rank rows, and the difference between them is exactly how they treat ties.

SELECT
  name,
  score,
  ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,   -- 1,2,3,4 — always unique
  RANK()       OVER (ORDER BY score DESC) AS rank,       -- 1,2,2,4 — ties share, gaps after
  DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank  -- 1,2,2,3 — ties share, no gaps
FROM players;

The dedup pattern is worth memorising because you'll write it constantly:

SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) AS rn
  FROM contacts
) t
WHERE rn = 1;   -- the latest row per email, duplicates gone

Offset: LAG and LEAD for period-over-period#

LAG reaches back to a previous row; LEAD reaches forward. This is how you compute change over time without a self-join — month-over-month growth is two lines:

SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) AS prev_month,
  revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change
FROM monthly_revenue;

Anytime you catch yourself joining a table to itself on date = date - 1, stop — it's a LAG.

The frame clause, and the gotcha that bites everyone#

Now the part people skip. When you use a window aggregate with ORDER BY, you've implicitly created a frame — and the default is not what most people assume.

A running total looks like this:

SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

That accumulates from the first row through the current one. Good. But here's the trap: if you omit the frame, SQL defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, not ROWS. With RANGE, all rows that share the same ORDER BY value are treated as one peer group and collapsed into the same frame — so if two orders share a date, your "running total" jumps by both at once on each of those rows instead of stepping through them.

!

A running total that's mysteriously wrong on dates with duplicate values is almost always the RANGE-vs-ROWS default. When you want a true row-by-row accumulation, write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW explicitly. Don't rely on the default — it's RANGE, and it will surprise you.

A moving average is the same machinery with a bounded frame — a trailing 7-row window:

AVG(value) OVER (
  ORDER BY day
  ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg

Hover the rows in the explainer with a moving average selected and you'll see the frame slide — that's the whole concept made literal, and it's worth thirty seconds.

Why this is the highest-leverage SQL you can learn#

Window functions are where a lot of pipeline logic that people push into Python or a BI tool actually belongs — in the query, close to the data, expressed once. Dedup at load time. Sessionization. Cohort retention. Rolling metrics. The same engines I lean on make this cheap: once dlt has landed clean relational tables in DuckDB, this is the analytical layer that sits on top, and DuckDB runs window functions fast.

Learn the three-part OVER clause, memorise the dedup and LAG patterns, and — above all — write your frames explicitly. The default will not do what you think, and a silently-wrong running total is the kind of bug that ships to a dashboard and lives there for months.

[S.01]§ Related

Data pipelines with dlt and DuckDB

Most pipeline code is glue nobody wants to maintain. dlt and DuckDB let you skip the glue and keep the parts that matter — schema inference, incremental loading, and contracts that fail loudly instead of silently corrupting your warehouse.

6 min read

Column-level data lineage in practice

Table-level lineage tells you two models are connected. Column-level lineage tells you that changing one field breaks exactly these three downstream columns and nothing else. That precision is the difference between a confident change and a prayer.

5 min read