Exercise Class 12

Author

Jonas Skjold Raaschou-Pedersen

Published

November 27, 2025

Exercise 1 - Two-Period DID

Implement the DML Algorithm for the two-period DID with covariates as shown in the slides from last lecture assuming CNA, CPT and Overlap.

While implementing the algorithm you can pick any nuisance estimators you like.

Load the data using the snippet below.
```
import polars as pl


df = pl.read_csv(
    "https://gist.githubusercontent.com/jsr-p/b7c620d0692c4c63c807f5b0882f786a/raw/04e45b7066e4e3d7df55be650407265f2700704c/did2-lecture12.csv"
)
```
Note: The true ATT equals \(2\).
Use your algorithm to compare the estimates, standard errors and confidence intervals when using the following three set of nuisance estimators for \(m_{0}(X)\) and \(g_{0}(0, X)\), respectively.
1. LogisticRegression, LinearRegression
2. RandomForestClassifier, RandomForestRegressor
3. LGBMClassifier, LGBMRegressor from the lightgbm package.
For the \(p_{0}\) nuisance use DummyRegressor in all three of the above cases.

You can use the snippet below as a starting point.
```
import numpy as np
import polars as pl
from sklearn.dummy import DummyRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_predict
```

Exercise 2 - Instrumental Variables

This exercise will revist the paper “Children and their Mom’s Labor Supply” by Angrist and Evans (1998). The data is a modified version of the data from the original paper. You can download the dataset on Absalon here. The provided data is from an old introductory econometrics exam with some modifications¹. We are interested in estimating the causal link between fertility and female labor supply using the data at hand — in particular using Double/Debiased Machine Learning.

An overview of the variables is listed below:

workm: Dummy, indicating if the mother works or not
weekm: Weeks worked during the year
faminc: Family (joint) income in US dollars
lfaminc: Log of faminc
agem: Age of mother
agem1stkid: Age of mother at first birth
nkids: Number of kids born to mother
morekids: Dummy, indicating if mother had more than two children
boy1stkid: Dummy, indicating if firstborn is a boy
boy2ndkid: Dummy, indicating if secondborn is a boy
samesex: Dummy, indicating if firstborn and secondborn have the same sex
black: Dummy, indicating mother’s race (reference: White)
hisp: Dummy, indicating mother’s race (reference: White)
other: Dummy, indicating mother’s race (reference: White)
lowedu: Dummy, indicating if mother’s education is less than high school
hsedu: Dummy, indicating if mother is a high-school graduate
hedu: Dummy, indicating if mother’s education is more than high school

After having downloaded the data and moved it into e.g. your data folder in your project directory, you can load the data with the following snippet:

import polars as pl

df = pl.read_parquet("data/angristevans.parquet")

For a recap of instrumental variables and potential outcomes, see chapter 4 of (Angrist & Pischke 2008). The setup is as follows: we consider an outcome variable \(Y\), endogenous treatment \(D\), instrument \(Z\) and covariates \(X\). The endogenous treatment is confounded so a simple regression will provide biases estimates of the

In the provided dataset:

\(Y\): either workm, weekm or lfaminc
\(D\): morekids
\(Z\): samesex
\(X\): all other variables

Our estimand of interest is the local average treatment effect (LATE): \[\begin{align*} \theta_{0} := E[Y(1) - Y(0) \mid D(1) > D(0)] \end{align*} \tag{1}\] identified by (without covariates) \[\begin{align*} \tilde{\theta} = \frac{ E[Y \mid Z = 1] - E[Y \mid Z = 0] }{ E[D \mid Z = 1] - E[D \mid Z = 0] }, \end{align*} \tag{2}\] which is also known as the simple Wald estimand, and in the case with covariates, by: \[\begin{align*} \theta = \frac{ E[ E[Y \mid X, Z = 1] - E[Y \mid X, Z = 0] ] }{ E[E[D \mid X, Z = 1] - E[D \mid X, Z = 0]] }. \end{align*} \tag{3}\]

In the following, do each step for each of the three outcomes workm, weekm and lfaminc.

Install the linearmodels package.
Estimate Equation 2 using a simple plug-in estimator (i.e. by replacing the expectation with sample averages). Estimate Equation 2 but now using 2SLS. Add a constant term before estimating the model. Assert that the estimates you get from the two estimators are the same using np.allclose.
Estimate now the quantity Equation 3 using only 2SLS from the linearmodels package. Use as covariates the ones below (where const is a constant term that you should add to your dataset):
```
covariates = [
 "const",
 "boy1stkid",
 "boy2ndkid",
 "agem",
 "agem1stkid",
 "black",
 "hisp",
 "other",
 "hsedu",
 "highedu",
]
```
Compare the estimates to the simple Wald estimates from the previous exercise.
Estimate Equation 3 again but now using Double/Debiased Machine Learning. Specifically, you should use the Interactive IV model (IIVM). Follow the example in the documentation. You can use any machine learners as the nuisance estimators.
Compare all estimates of the LATE estimand Equation 1 that you have gotten in the previous exercises. Construct a dataframe with the estimates and confidence intervals where you sort by the name of the outcome variables. How do the estimates of the percentage change in joint familiy income in US dollars differ between the models?

Note: the change in percentage can be computed as \(100 \cdot (\exp(\hat{\theta}) - 1)\) where \(\hat{\theta}\) is some estimate of the LATE when using lfaminc as the outcome variable.
Optional: Try using different nuisance estimators. Does it make a difference which one you use?
Optional: Implement the DML algorithm using the score for the IIVM model as shown in the lecture slides here.

Footnotes

See p.4 in the link for a more complete explanation of the data. We have removed all observations with faminc equal to \(0\)↩︎