Exercise Class 12
Exercise 1 - Two-Period DID
Implement the DML Algorithm for the two-period DID with covariates as shown in the slides from last lecture assuming CNA, CPT and Overlap.
While implementing the algorithm you can pick any nuisance estimators you like.
Load the data using the snippet below.
import polars as pl df = pl.read_csv( "https://gist.githubusercontent.com/jsr-p/b7c620d0692c4c63c807f5b0882f786a/raw/04e45b7066e4e3d7df55be650407265f2700704c/did2-lecture12.csv" )Note: The true ATT equals \(2\).
Use your algorithm to compare the estimates, standard errors and confidence intervals when using the following three set of nuisance estimators for \(m_{0}(X)\) and \(g_{0}(0, X)\), respectively.
LogisticRegression, LinearRegressionRandomForestClassifier, RandomForestRegressorLGBMClassifier, LGBMRegressorfrom the lightgbm package.
For the \(p_{0}\) nuisance use
DummyRegressorin all three of the above cases.You can use the snippet below as a starting point.
import numpy as np import polars as pl from sklearn.dummy import DummyRegressor from lightgbm import LGBMClassifier, LGBMRegressor from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.model_selection import StratifiedKFold, cross_val_predict
Exercise 2 - Instrumental Variables
This exercise will revist the paper “Children and their Mom’s Labor Supply” by Angrist and Evans (1998). The data is a modified version of the data from the original paper. You can download the dataset on Absalon here. The provided data is from an old introductory econometrics exam with some modifications1. We are interested in estimating the causal link between fertility and female labor supply using the data at hand — in particular using Double/Debiased Machine Learning.
An overview of the variables is listed below:
workm: Dummy, indicating if the mother works or notweekm: Weeks worked during the yearfaminc: Family (joint) income in US dollarslfaminc: Log offamincagem: Age of motheragem1stkid: Age of mother at first birthnkids: Number of kids born to mothermorekids: Dummy, indicating if mother had more than two childrenboy1stkid: Dummy, indicating if firstborn is a boyboy2ndkid: Dummy, indicating if secondborn is a boysamesex: Dummy, indicating if firstborn and secondborn have the same sexblack: Dummy, indicating mother’s race (reference: White)hisp: Dummy, indicating mother’s race (reference: White)other: Dummy, indicating mother’s race (reference: White)lowedu: Dummy, indicating if mother’s education is less than high schoolhsedu: Dummy, indicating if mother is a high-school graduatehedu: Dummy, indicating if mother’s education is more than high school
After having downloaded the data and moved it into e.g. your data folder in your project directory, you can load the data with the following snippet:
import polars as pl
df = pl.read_parquet("data/angristevans.parquet")For a recap of instrumental variables and potential outcomes, see chapter 4 of (Angrist & Pischke 2008). The setup is as follows: we consider an outcome variable \(Y\), endogenous treatment \(D\), instrument \(Z\) and covariates \(X\). The endogenous treatment is confounded so a simple regression will provide biases estimates of the
In the provided dataset:
- \(Y\): either
workm,weekmorlfaminc - \(D\):
morekids - \(Z\):
samesex - \(X\): all other variables
Our estimand of interest is the local average treatment effect (LATE): \[\begin{align*} \theta_{0} := E[Y(1) - Y(0) \mid D(1) > D(0)] \end{align*} \tag{1}\] identified by (without covariates) \[\begin{align*} \tilde{\theta} = \frac{ E[Y \mid Z = 1] - E[Y \mid Z = 0] }{ E[D \mid Z = 1] - E[D \mid Z = 0] }, \end{align*} \tag{2}\] which is also known as the simple Wald estimand, and in the case with covariates, by: \[\begin{align*} \theta = \frac{ E[ E[Y \mid X, Z = 1] - E[Y \mid X, Z = 0] ] }{ E[E[D \mid X, Z = 1] - E[D \mid X, Z = 0]] }. \end{align*} \tag{3}\]
In the following, do each step for each of the three outcomes workm, weekm and lfaminc.
Install the linearmodels package.
Estimate Equation 2 using a simple plug-in estimator (i.e. by replacing the expectation with sample averages). Estimate Equation 2 but now using 2SLS. Add a constant term before estimating the model. Assert that the estimates you get from the two estimators are the same using
np.allclose.Estimate now the quantity Equation 3 using only
2SLSfrom thelinearmodelspackage. Use as covariates the ones below (whereconstis a constant term that you should add to your dataset):covariates = [ "const", "boy1stkid", "boy2ndkid", "agem", "agem1stkid", "black", "hisp", "other", "hsedu", "highedu", ]Compare the estimates to the simple Wald estimates from the previous exercise.
Estimate Equation 3 again but now using Double/Debiased Machine Learning. Specifically, you should use the Interactive IV model (IIVM). Follow the example in the documentation. You can use any machine learners as the nuisance estimators.
Compare all estimates of the LATE estimand Equation 1 that you have gotten in the previous exercises. Construct a dataframe with the estimates and confidence intervals where you sort by the name of the outcome variables. How do the estimates of the percentage change in joint familiy income in US dollars differ between the models?
Note: the change in percentage can be computed as \(100 \cdot (\exp(\hat{\theta}) - 1)\) where \(\hat{\theta}\) is some estimate of the LATE when using
lfamincas the outcome variable.Optional: Try using different nuisance estimators. Does it make a difference which one you use?
Optional: Implement the DML algorithm using the score for the IIVM model as shown in the lecture slides here.