Talked about estimands — interpretable quantities we would like to estimate
We link causal estimand to estimable estimand by causal assumptions
If assumptions hold we can estimate the causal estimand using our data
Estimand often depends on high-dimensional nuisances
Need to estimate nuisances — but they are not interesting in themselves
AI For Humanity — therefore also for Econometricians1
Use AI/Machine Learning to deal with nuisances
Naive approach by plugging in machine learners doesn’t work (generally)
Need to be more clever — Double/Debiased Machine Learning is that
Neyman Orthogonality and Cross-fitting as central components
Double/Debiased Machine Learning Recap
Target parameter (estimand) \(\theta_0\) defined as the solution to a moment condition\[\begin{align}
E\left[ \psi(W; \theta_{0}, \eta_{0}) \right] & = 0
\end{align}\] where \(\eta_{0}\) is a nuisance parameter, \(\psi\) a Neyman orthogonal score function and \(W\) a tuple of data (e.g. \(W = (Y, D, X)\))
Estimating \(\eta_{0}\) using machine learning leads to regularization bias and overfitting bias
Focus on estimand frees our mind and heals our soul
DML Estimator
A DML estimator is a plug-in estimator that leverages a first-step nuisance parameter estimator by relying on Neyman orthogonal scores and cross-fitting
\(\{I_{k}\}_{k=1}^{K}\) is a random partition of the individuals \(\{1,
\ldots, n\}\) into \(K\) subsamples
\(\{\hat{\eta}_{-k}\}_{k=1}^{K}\) are cross-fitted nuisance parameter estimators, each calculated on the observations excluding those in the sub-sample \(I_{k}\)
General DML Algorithm
The search for estimands
We’ll consider two methods today
Difference-in-Differences
Arguably the most popular method in the social sciences1
Interesting in itself
Also a good showcase of the importance of being clear about what your are estimating (i.e. your estimand)
Instrumental Variables
Estimand of interest is the LATE
Two period Difference-in-Differences
\(n\) units observed in two periods \(t \in \{0, 1\}\)
\(Y_{t}\) outcome
\(D\) binary treatment equal to \(1\) for the treated units
Potential outcomes (POs): \(Y_t(d)\) where \(t,d\in\{0,1\}\)
\[\begin{align}
\exists \epsilon > 0: P(D = 1) \geq \epsilon
\text{ and }
P(D = 1 \mid X) \leq 1 - \epsilon
\end{align}\] i.e. there is a treatment group and controls for every value of \(X\)
Hackin’ away
Conditional ATT (CATT)
\[\begin{align}
\tau(X)
&= E[Y_{1}(1) - Y_{1}(0) \mid X, D = 1]
\\
&= E[Y_{1}\mid X, D = 1]
- E[Y_{1}(0)\mid X, D = 1]
\end{align}\]
Add \(0\) trick:
\[\begin{align}
E[Y_{1}(0)\mid X, D = 1]
&= E[Y_{1}(0)\mid X, D = 1]
\pm
E[Y_{0}(0)\mid X, D = 1]
\\
&=
E[Y_{0}(0)\mid X, D = 1]
+
E[Y_{1}(0) - Y_{0}(0)\mid X, D = 1]
\\
&\overset{\text{CNA}}{=}
E[Y_{0}(1)\mid X, D = 1]
+
E[Y_{1}(0) - Y_{0}(0)\mid X, D = 1]
\\
& \overset{\text{CPT}}{=}
E[Y_{0}(1)\mid X, D = 1]
+
E[Y_{1}(0) - Y_{0}(0)\mid X, D = 0]
\\
&=
E[Y_{0}\mid X, D = 1]
+
E[Y_{1} - Y_{0}\mid X, D = 0]
\end{align}\]
Hence the CATT:
\[\begin{align}
\tau(X)
=
E[Y_{1} - Y_{0}\mid X, D = 1]
-
E[Y_{1} - Y_{0}\mid X, D = 0]
\end{align}\]
CATT to ATT by the LIE
Derive the ATT from the CATT using the LIE:
\[\begin{align}
\tau
&= E[
\underbrace{
E[Y_{1}(1) - Y_{1}(0) \mid X, D = 1]
}_{\tau(X)}\mid D = 1
]
\\
&= E\left[ E[Y_{1} - Y_{0}\mid X, D = 1]
\mid D = 1\right]
-
E[E[Y_{1} - Y_{0}\mid X, D = 0] \mid D = 1]
\\
&= E[Y_{1} - Y_{0}\mid D = 1]
-
E[E[Y_{1} - Y_{0}\mid X, D = 0] \mid D = 1]
\end{align}\]
The Law of Iterated Expectations (LIE)
It might be called the LIE — but it’s a truth in the deepest sense
We used for \(X, Y, Z\) random variables: \(E[E[Y \mid X, Z] \mid Z] = E[Y \mid Z]\)
With \(\Delta Y := Y_{1} - Y_{0}\):
\[\begin{align}
\tau &= E[\Delta Y\mid D = 1]
-
E[E[\Delta Y\mid X, D = 0] \mid D = 1]
\end{align} \tag{3}\]
Seems good! However, Equation 4 implicitly imposes:
\(E[Y_{1}(1) - Y_{1}(0) \mid X, D = 1] = \tau_{\text{ols}}\)
For \(d \in \{0, 1\}\): \(E[Y_{1} - Y_{0} \mid X, D = d] = E[Y_{1} - Y_{0} \mid D = d]\)
The two implicit assumption means that the regression model:
Assumes homogeneous in \(X\) treatment effects
Rules out \(X\)-specific trends in both treated and comparison groups
If the implicit assumptions are violated, the estimand \(\tau_{\text{ols}}\) is, in general, different from the ATT, \(E[Y_{1}(1) - Y_{1}(0) \mid D = 1]\)
This motivates the use of a DML estimator instead to target the ATT directly
Neyman orthogonal score for the ATT in two-period DID setup
\[\begin{align}
\varphi(W; \tau, \eta_{0})
& =
\frac{D - P(D = 1 \mid X)}{P(D = 1)[1 - P(D = 1 \mid X)]}
(\Delta Y - E[\Delta Y\mid X, D = 0])
- \frac{D}{P(D = 1)} \tau
\\&=
\frac{D - m_{0}(X)}{p_{0}[1 - m_{0}(X)]}
[\Delta Y - g_{0}(0, X)]
- \frac{D}{p_{0}} \tau,
\end{align}\] where \(W = (Y_{1}, Y_{0}, D, X)\) data, \(\tau\) our estimand the ATT, and \(\eta = (p, m, g)\) are the nuisance parameters with true values: \[\begin{align*}
m_{0}(X) := P(D = 1 \mid X),
\quad
g_{0}(0, X) := E[\Delta Y\mid X, D = 0],
\quad
p_{0} := P(D = 1)
\end{align*}\]
DML framework enables use of machine learners for estimating potentially high-dimensional \(m_{0}(X)\) and \(g_{0}(0, X)\) (simple average enough for \(p_{0}\) :=)
DML Algorithm DID
Exercise: Non-linear DGP
Non-linear DGP; nuisances non-linear and high-dimensional
\[\begin{align}
\bar{y}_{b}^{POST(a)} := \frac{1}{T - (a - 1)} \sum_{t = a}^{T}
\frac{
\sum_{i} y_{it} \mathbf{1}\{t_{i} =b \}
}{
\sum_{i} \mathbf{1}\{t_{i} =b \}
}
\end{align}\] Sample mean of units treated at \(t = b\) during post period for treatment day \(t = a\)
Bacon Decomposition
OLS estimator using model in Equation 5 is a weighted average of all possible two-by-two DD estimators
Forbidden comparison \(\hat{\beta}^{2 \times 2, \ell}_{k\ell}\): comparing group \(\ell\) with group \(k\); control group are units that are already treated in earlier period