AI For Humanity

Lecture 12: Double ML - Second Serving

Jonas Skjold Raaschou-Pedersen

2025-11-26

Previously on “AI For Humanity”

  • Talked about estimands — interpretable quantities we would like to estimate
  • We link causal estimand to estimable estimand by causal assumptions
    • If assumptions hold we can estimate the causal estimand using our data
  • Estimand often depends on high-dimensional nuisances
    • Need to estimate nuisances — but they are not interesting in themselves
  • AI For Humanity — therefore also for Econometricians1
  • Use AI/Machine Learning to deal with nuisances
  • Naive approach by plugging in machine learners doesn’t work (generally)
    • Need to be more clever — Double/Debiased Machine Learning is that
    • Neyman Orthogonality and Cross-fitting as central components

Double/Debiased Machine Learning Recap

  • Target parameter (estimand) \(\theta_0\) defined as the solution to a moment condition \[\begin{align} E\left[ \psi(W; \theta_{0}, \eta_{0}) \right] & = 0 \end{align}\] where \(\eta_{0}\) is a nuisance parameter, \(\psi\) a Neyman orthogonal score function and \(W\) a tuple of data (e.g. \(W = (Y, D, X)\))
  • Estimating \(\eta_{0}\) using machine learning leads to regularization bias and overfitting bias
  • Neyman orthogonality: \[\begin{align} \frac{\partial}{\partial \lambda} E\left[ \psi(W; \theta_{0}, \eta_{0} + \lambda [\eta - \eta_{0}]) \right] \Big|_{\lambda =0} & = 0, \quad \forall \eta \in \mathcal{T}, \quad \lambda \in [0, 1) \end{align}\]
  • Cross-fitting to reduce dependence between first-step estimation error and the data
  • Neyman orthogonality alleviates regularization bias
  • Cross-fitting alleviates overfitting bias
  • Focus on estimand frees our mind and heals our soul

DML Estimator

  • A DML estimator is a plug-in estimator that leverages a first-step nuisance parameter estimator by relying on Neyman orthogonal scores and cross-fitting

  • A DML estimator solves: \[\begin{align} \label{eq:dml} \hat{\theta}: \quad \frac{1}{n} \sum_{k=1}^{K} \sum_{i \in I_{k}} \psi(W_{i}; \hat{\theta}, \hat{\eta}_{-k}) = 0 \end{align} \tag{1}\]

    where:

    • \(\{I_{k}\}_{k=1}^{K}\) is a random partition of the individuals \(\{1, \ldots, n\}\) into \(K\) subsamples
    • \(\{\hat{\eta}_{-k}\}_{k=1}^{K}\) are cross-fitted nuisance parameter estimators, each calculated on the observations excluding those in the sub-sample \(I_{k}\)

General DML Algorithm

The search for estimands

  • We’ll consider two methods today
  • Difference-in-Differences
    • Arguably the most popular method in the social sciences1
    • Interesting in itself
    • Also a good showcase of the importance of being clear about what your are estimating (i.e. your estimand)
  • Instrumental Variables
    • Estimand of interest is the LATE

Two period Difference-in-Differences

  • \(n\) units observed in two periods \(t \in \{0, 1\}\)
  • \(Y_{t}\) outcome
  • \(D\) binary treatment equal to \(1\) for the treated units
  • Potential outcomes (POs): \(Y_t(d)\) where \(t,d\in\{0,1\}\)
  • Observed outcome \(Y_{t} = DY_{t}(1) + (1 - D)Y_{t}(0)\)
  • Estimand, the average treatment effect on the treated (ATT):

\[\begin{align} \tau := E[Y_{1}(1) - Y_{1}(0) \mid D = 1] = E[Y_{1} \mid D = 1] - E[Y_{1}(0) \mid D = 1] \end{align} \tag{2}\]

  • Data: \((Y_1, Y_0, D)\)

DID illustration

DID illustration

DID illustration

DID illustration

Identifying assumptions

  • Consistency

\[\begin{align} Y_{t} = DY_{t}(1) + (1 - D)Y_{t}(0) \end{align}\]

  • No Anticipation (NA)

\[\begin{align} E[Y_{0}(0) \mid D = 1] = E[Y_{0}(1) \mid D = 1] \end{align}\]

  • Parallel Trends (PT)

\[\begin{align} E[Y_{1}(0) - Y_{0}(0) \mid D = 1] = E[Y_{1}(0) - Y_{0}(0) \mid D = 0] \end{align}\]

DID illustration

DID illustration

DID illustration

Identification with Consistency, NA and PT assumptions

What we came from

Identification two period simple DID

  • Counterfactual for the treated identified by:

\[\begin{align} E[Y_{1}(0) \mid D = 1] = E[Y_{0} \mid D = 1] + E[Y_{1} - Y_{0} \mid D = 0] \end{align}\]

  • Plugging this into the ATT gives the indentified parameter:

\[\begin{align} \tau & = E[Y_{1} \mid D = 1] - E[Y_{1}(0) \mid D = 1] \\ & = E[Y_{1} - Y_{0} \mid D = 1] - E[Y_{1} - Y_{0} \mid D = 0] \end{align}\]

  • A difference in differences
  • Natural plug-in estimator using sample averages:

\[\begin{align} \hat{\tau} = E_{n}[Y_{1} - Y_{0} \mid D = 1] - E_{n}[Y_{1} - Y_{0} \mid D = 0] \end{align}\]

OLS estimator two period DID

  • Data: \(n\) i.i.d. tuples \((Y_{i0}, Y_{i1}, D_{i}, T_{i})\) where
    • \(Y_{i0}\) outcome in period \(t = 0\)
    • \(Y_{i1}\) outcome in period \(t = 1\)
    • \(D_{i} = \mathbf{1}\{\text{unit $i$ treated in $t = 1$}\}\) treatment indicator
    • \(T_{i} = \mathbf{1}\{t = 1\}\) post-treatment period dummy
  • Linear model:

\[\begin{align} Y_{it} = \alpha_{1} + \alpha_{2}T_{i} + \alpha_{3} D_{i} + \tau_{\text{ols}}(T_{i} \cdot D_{i}) + \varepsilon_{it}, \quad E[\varepsilon_{it} \mid D_{i}, T_{i}] = 0 \end{align}\]

  • Under the same assumptions, \(\tau_{\text{ols}} = \tau\)
  • OLS estimator \(\hat{\tau}_{\text{ols}}\) equivalent to the plug-in estimator \(\hat{\tau}\)

Exercise: Plug-in and OLS

  • Dataset as in previous slide
>>> df.head(2)
shape: (2, 4)
┌─────┬─────┬─────┬───────────┐
id  ┆ D   ┆ T   ┆ Y         │
------------
│ i64 ┆ i64 ┆ i8  ┆ f64       │
╞═════╪═════╪═════╪═══════════╡
010-1.085631
0112.08532
└─────┴─────┴─────┴───────────┘
  1. Download the dataset
wget -P data/ https://gist.githubusercontent.com/jsr-p/5e61db4595cbf718d6791aaad31b0cec/raw/066ee05186fcb4d59ff61edf466f1cbb0936d1bd/did1-lecture12.csv

or directly in Python:

import polars as pl
df = pl.read_csv(
    "https://gist.githubusercontent.com/jsr-p/5e61db4595cbf718d6791aaad31b0cec/raw/066ee05186fcb4d59ff61edf466f1cbb0936d1bd/did1-lecture12.csv"
)
  1. Implement the plug-in and OLS estimator; apply the estimators on the dataset
# Copy this if you like
import statsmodels.formula.api as smf

Something is missing…

  • Covariates \(X\)

  • Usually parallel trends assumption more plausible conditional on covariates

  • Condtional No Anticipation (CNA)

\[\begin{align} E[Y_{0}(0) \mid X, D = 1] = E[Y_{0}(1) \mid X, D = 1] \end{align}\]

  • Conditional Parallel Trends (CPT) \[\begin{align} E[Y_{1}(0) - Y_{0}(0) \mid X, D = 1] = E[Y_{1}(0) - Y_{0}(0) \mid X, D = 0] \end{align}\]

  • Overlap

\[\begin{align} \exists \epsilon > 0: P(D = 1) \geq \epsilon \text{ and } P(D = 1 \mid X) \leq 1 - \epsilon \end{align}\] i.e. there is a treatment group and controls for every value of \(X\)

Hackin’ away

Conditional ATT (CATT)

\[\begin{align} \tau(X) &= E[Y_{1}(1) - Y_{1}(0) \mid X, D = 1] \\ &= E[Y_{1}\mid X, D = 1] - E[Y_{1}(0)\mid X, D = 1] \end{align}\]

Add \(0\) trick:

\[\begin{align} E[Y_{1}(0)\mid X, D = 1] &= E[Y_{1}(0)\mid X, D = 1] \pm E[Y_{0}(0)\mid X, D = 1] \\ &= E[Y_{0}(0)\mid X, D = 1] + E[Y_{1}(0) - Y_{0}(0)\mid X, D = 1] \\ &\overset{\text{CNA}}{=} E[Y_{0}(1)\mid X, D = 1] + E[Y_{1}(0) - Y_{0}(0)\mid X, D = 1] \\ & \overset{\text{CPT}}{=} E[Y_{0}(1)\mid X, D = 1] + E[Y_{1}(0) - Y_{0}(0)\mid X, D = 0] \\ &= E[Y_{0}\mid X, D = 1] + E[Y_{1} - Y_{0}\mid X, D = 0] \end{align}\]

Hence the CATT:

\[\begin{align} \tau(X) = E[Y_{1} - Y_{0}\mid X, D = 1] - E[Y_{1} - Y_{0}\mid X, D = 0] \end{align}\]

CATT to ATT by the LIE

  • Derive the ATT from the CATT using the LIE:

\[\begin{align} \tau &= E[ \underbrace{ E[Y_{1}(1) - Y_{1}(0) \mid X, D = 1] }_{\tau(X)}\mid D = 1 ] \\ &= E\left[ E[Y_{1} - Y_{0}\mid X, D = 1] \mid D = 1\right] - E[E[Y_{1} - Y_{0}\mid X, D = 0] \mid D = 1] \\ &= E[Y_{1} - Y_{0}\mid D = 1] - E[E[Y_{1} - Y_{0}\mid X, D = 0] \mid D = 1] \end{align}\]

  • The Law of Iterated Expectations (LIE)
    • It might be called the LIE — but it’s a truth in the deepest sense
    • We used for \(X, Y, Z\) random variables: \(E[E[Y \mid X, Z] \mid Z] = E[Y \mid Z]\)
  • With \(\Delta Y := Y_{1} - Y_{0}\):

\[\begin{align} \tau &= E[\Delta Y\mid D = 1] - E[E[\Delta Y\mid X, D = 0] \mid D = 1] \end{align} \tag{3}\]

But why not just OLS on model with covariates?

  • Extend previous model with covariates:

\[\begin{align} Y_{it} = \alpha_{1} + \alpha_{2}T_{i} + \alpha_{3} D_{i} + \tau_{\text{ols}}(T_{i} \cdot D_{i}) + \theta' X_{i} + \varepsilon_{it}, \quad E[\varepsilon_{it} \mid X_{i}, D_{i}, T_{i}] = 0 \end{align} \tag{4}\]

  • Seems good! However, Equation 4 implicitly imposes:
    1. \(E[Y_{1}(1) - Y_{1}(0) \mid X, D = 1] = \tau_{\text{ols}}\)
    2. For \(d \in \{0, 1\}\): \(E[Y_{1} - Y_{0} \mid X, D = d] = E[Y_{1} - Y_{0} \mid D = d]\)
  • The two implicit assumption means that the regression model:
    1. Assumes homogeneous in \(X\) treatment effects
    2. Rules out \(X\)-specific trends in both treated and comparison groups
  • If the implicit assumptions are violated, the estimand \(\tau_{\text{ols}}\) is, in general, different from the ATT, \(E[Y_{1}(1) - Y_{1}(0) \mid D = 1]\)
  • This motivates the use of a DML estimator instead to target the ATT directly

Neyman orthogonal score for the ATT in two-period DID setup

  • Identification got us Equation 3
  • Neyman orthogonal score for our parameter:

\[\begin{align} \varphi(W; \tau, \eta_{0}) & = \frac{D - P(D = 1 \mid X)}{P(D = 1)[1 - P(D = 1 \mid X)]} (\Delta Y - E[\Delta Y\mid X, D = 0]) - \frac{D}{P(D = 1)} \tau \\&= \frac{D - m_{0}(X)}{p_{0}[1 - m_{0}(X)]} [\Delta Y - g_{0}(0, X)] - \frac{D}{p_{0}} \tau, \end{align}\] where \(W = (Y_{1}, Y_{0}, D, X)\) data, \(\tau\) our estimand the ATT, and \(\eta = (p, m, g)\) are the nuisance parameters with true values: \[\begin{align*} m_{0}(X) := P(D = 1 \mid X), \quad g_{0}(0, X) := E[\Delta Y\mid X, D = 0], \quad p_{0} := P(D = 1) \end{align*}\]

  • DML framework enables use of machine learners for estimating potentially high-dimensional \(m_{0}(X)\) and \(g_{0}(0, X)\) (simple average enough for \(p_{0}\) :=)

DML Algorithm DID

Exercise: Non-linear DGP

# True ATT equals 2
>>> df.head(2)
shape: (2, 9)
┌─────┬─────┬────────────┬─────┬───────────┬───────────┬──────────┬───────────┬───────────┐
id  ┆ t   ┆ Y          ┆ D   ┆ X1        ┆ X2        ┆ X3       ┆ X4        ┆ ite       │
---------------------------
│ i64 ┆ i32 ┆ f64        ┆ i32 ┆ f64       ┆ f64       ┆ f64      ┆ f64       ┆ f64       │
╞═════╪═════╪════════════╪═════╪═══════════╪═══════════╪══════════╪═══════════╪═══════════╡
00233.2127410.21216-0.1021270.2500560.9764661.587659
01251.85740110.21216-0.1021270.2500560.9764662.081767
└─────┴─────┴────────────┴─────┴───────────┴───────────┴──────────┴───────────┴───────────┘

Exercise: Non-linear DGP

  1. Download the dataset
wget -P data/ https://gist.githubusercontent.com/jsr-p/b7c620d0692c4c63c807f5b0882f786a/raw/04e45b7066e4e3d7df55be650407265f2700704c/did2-lecture12.csv

or directly in Python:

import polars as pl
df = pl.read_csv(
    "https://gist.githubusercontent.com/jsr-p/b7c620d0692c4c63c807f5b0882f786a/raw/04e45b7066e4e3d7df55be650407265f2700704c/did2-lecture12.csv"
)
  1. Implement the DML Algorithm for the two-period DID assuming CNA, CPT and Overlap. The true ATT equals \(2\).

  2. Try using both RandomForest and the linear models for the nuisances. Estimate also the linear model by OLS as a comparison.

You can use the snippet below and the loading function above.

import numpy as np
import polars as pl
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_predict

Difference-in-Differences - Staggered Adoption

  • \(T\) periods indexed by \(t = 1, 2, \ldots, T\)
  • Units can receive binary treatment in any \(t > 1\)
  • \(D_{i, t}\) indicator for whether unit \(i\) receives treatment in period \(t\)
  • \(G_{i} = \min \{t \mid D_{i, t} = 1\}\) period of treatment
  • Treatment absorbing state: \(D_{i, t} = 1\) for \(t \geq G_{i}\)
  • We have variation in treatment timing across the units

Static TWFE

  • Static two-way fixed effects (TWFE) model1:

\[\begin{align} Y_{i, t} = \alpha_{i} + \phi_{t} + D_{i, t}\beta^{\text{DD}} + \varepsilon_{i, t} \end{align} \tag{5}\] with \(\alpha_{i}\) unit and \(\phi_{t}\) time fixed effects

  • \(\beta^{\text{DD}}\) is a sensible causal estimand, right?
  • Only if there is no heterogeneity in treatment effects across either time or units
  • If not then estimand does “forbidden comparisons” — uses already treated units as controls

Bacon Decomposition - Notation

\[\begin{align} \bar{y}_{b}^{POST(a)} := \frac{1}{T - (a - 1)} \sum_{t = a}^{T} \frac{ \sum_{i} y_{it} \mathbf{1}\{t_{i} =b \} }{ \sum_{i} \mathbf{1}\{t_{i} =b \} } \end{align}\] Sample mean of units treated at \(t = b\) during post period for treatment day \(t = a\)

Bacon Decomposition

OLS estimator using model in Equation 5 is a weighted average of all possible two-by-two DD estimators

Forbidden comparison \(\hat{\beta}^{2 \times 2, \ell}_{k\ell}\): comparing group \(\ell\) with group \(k\); control group are units that are already treated in earlier period

Bacon Decomposition

Dynamic TWFE

  • Often interested in dynamic causal effects
  • Dynamic TWFE:

\[\begin{align} Y_{i, t} = \alpha_{i} + \phi_{t} + \sum_{r \neq -1} \mathbf{1}\{R_{i, t} = r\} \beta_{r} + \varepsilon_{i, t} \end{align}\]

where \(R_{i, t} = t - G_{i}\) is time relative to treatment

  • The \(\beta_{r}\)s equal a sensible causal estimand, right?
  • Not generally, no!
    • If the only heterogeneity is in time since treatment, they do
    • If there is heterogeneity across cohorts, they don’t
  • Again “forbidden comparisons” — see Sun and Abraham (2021)
  • Wish: a method that allows for heterogeneity in time and cohorts

General theme in previous examples

  • Starting point is the regression model
  • Goal is to estimate some sensible causal estimand
  • Starting from the regression model potentially leads us astray
  • Hmm …

Solution

  • Why start from the regression model?
  • How about starting from the estimand and targeting it with whatever it takes?
  • Maybe we could:
    • Start with our estimand
    • Do identification through causal assumptions
    • End up with some estimable quantity that depends on nuisances
    • Estimate the nuisances and then estimate our target parameter of interest
  • Callaway and Sant’Anna 2021 do exactly that
    • They don’t do DML; but they do provide a Neyman orthogonal score

Callaway and Sant’Anna 2021

  • Still in Staggered Adoption setting
  • Individuals partitioned into cohorts \(G_{i} = \min \{t \mid D_{i, t} = 1\}\)
  • Generalize estimand from two-period DID Equation 2 to staggered adoption setting
  • Allow for heterogeneity across time and cohorts directly
  • Estimand: group-time average treatment effect

\[\begin{align} ATT(g, t) = E\left( Y_{t}(g) - Y_{t}(0) \mid G^{g} = 1 \right) \end{align} \tag{6}\]

where \(G^{g} := \mathbf{1}\{G = g\}\) and \(Y_{t}(0)\) counterfactual outcome

  • This estimand allow us to answer

    Are ATTs heterogenous across groups and time?

Callaway and Sant’Anna 2021

  • Let \(e = t - g\) the event time i.e. time elapsed since treatment
  • Can stitch \(ATT(g, t)\)’s together to a new estimand: e.g. average effect of treatment on the treated \(e\) periods after treatment
    • Plotting these correspond to classic “event study plots”
  • Concretely, can aggregate \(ATT(g, t)\)’s with respect to \(e\) as

\[\begin{align} \theta_{es}(e) = \sum_{g \in \mathcal{G}} \mathbf{1}\{g + e \leq T\} P(G = g \mid G + e \leq T)ATT(g, g + e) \end{align}\]

  • See the paper for other aggregation schemes (e.g. average effect for a given cohort across all post-treatment periods)
  • Identification: assumptions needed similar to the one in the two-period case with covariates + some more
  • See paper for doubly robust estimand with corresponding Neyman Orthogonal score

DML Staggered Adoption

  • Starting from the estimand turned out to be the right way cf. Callaway and Sant’Anna 2021
  • Our new friend:
    • The group-time average treatment effect on the treated estimand, \(ATT(g, t)\)
  • Besides enlightening us, Callaway and Sant’Anna 2021 also gave us a Neyman orthogonal score
  • Estimand & Neyman Orthogonal score: all we need to do Double/Debiased Machine Learning

CSA2021 - example

DoubleML Package

Instrumental Variables (IV)

  • Causal DAG for IV
  • We cannot always just “control for everything” to get at causal estimates

LATE

  • LATE estimand

DML IV

  • Neyman Orthogonal score