Exercise Class 11
Exercise 1 - Scores and Estimators
Let \(W = (Y, D, X)\) be a tuple of data with \(Y\) outcome, \(D\) treatment and \(X\) covariates. We assume we have access to a dataset of \(n\) observations \(\{(Y_{i}, D_{i}, X_{i})\}_{i=1}^{n}\) (below we wil simulate from a specific DGP). In the following our estimand is the average treatment effect (ATE):
\[\begin{align*} \theta_{0} = E[Y(1) - Y(0)]. \end{align*}\]
Recall the scores from the lecture for ATE estimand:
Regression adjustment score:
\[\begin{align*} m(W; \theta_{0}, \eta_{0}) = \ell_{0}(1, X) - \ell_{0}(0, X) - \theta_{0} \end{align*} \tag{1}\] where \(\ell_{0}(D, X) := E[Y \mid D, X]\).
Inverse propensity weighted (IPW) score:
\[\begin{align*} m(W; \theta_0, \eta_0) = \alpha_0(D, X)Y - \theta_0 \end{align*} \tag{2}\] where \[\begin{align*} \alpha_{0}(D, X) = \frac{D}{r_{0}(X)} - \frac{1 - D}{1 - r_{0}(X)}, \quad r_{0}(X) = E[D \mid X] = P(D = 1 \mid X), \end{align*}\] and \(\eta_{0} = \alpha_{0}\).
Doubly robust score:
\[\begin{align*} \psi(W; \theta_{0}, \eta_{0}) = \alpha_{0}(D, X)[Y - \ell_{0}(D, X)] + \ell_{0}(1, X) - \ell_{0}(0, X) - \theta_{0} \end{align*} \tag{3}\] where \(\ell_{0}(D, X) := D \ell_{0}(1, X) + (1 - D) \ell_{0}(0, X)\) and \(\eta_{0} = (\alpha_{0}, \ell_{0})\).
Optional: Read through “The basics of double/debiased machine learning” in the DoubleML package documentation as a refresher/another perspective.
Read the first part of the DoubleML documentation on scores. Convince yourself that the scores in Equation 1, Equation 2 and Equation 3 are linear in \(\theta\) such that they can be written on the form: \[\begin{align*} \psi(W; \theta, \eta) = \psi_{a}(W; \eta)\theta + \psi_{b}(W; \eta). \end{align*}\]
An estimator for our estimand/target parameter can then be constructed as: \[\begin{align*} \tilde{\theta}_{0} = -\frac{E_{n}[\psi_{b}(W; \eta)]}{E_{n}[\psi_{a}(W; \eta)]}, \end{align*}\] where we use the notation \(E_{n}[f(W)] := \frac{1}{n} \sum_{i=1}^{n} f(W_{i})\) for the average over the \(n\) observations of a function of random variable(s) \(W\).
Write out an expression for a plug-in estimator for the three scores in Equation 1, Equation 2 and Equation 3.
Install the DoubleML Python package using
uv.Try applying your three estimators on a simulated dataset from the interactive regression model.
Use the
make_irm_datafunction to simulate a dataset.Set
theta=0.5, n_obs=1_000, dim_x=10.You can choose any nuisance function estimators that you like. You could for instance pick the ones in the snippet below.
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor ml_g = RandomForestRegressor( n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2 ) ml_m = RandomForestClassifier( n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2 )
Hint: You can set
return_type="np.ndarray"inmake_irm_datato directly get the data returned as a tuple of numpy arrays(X, y, d).Optional: make a simulating study comparing the three estimators using the simulation function above. You can change the hyperparameters of the simulation function if you like.
Exercise - DML Algorithm
Recall that of the three scores introduced at the start, it is only the score in Equation 3 that is Neyman Orthogonal. Hence, we will now implement the DML algorithm using only that one.
Implement the DML estimator for the ATE (Step 1: to 8: of the DML Algorithm for the ATE). That is, implement \(\hat{\theta}\) that solves
\[\begin{align} \label{eq:dml} \frac{1}{n} \sum_{k=1}^{K} \sum_{i \in I_{k}} \psi(W_{i}; \hat{\theta}, \hat{\eta}_{-k}) = 0 \end{align} \tag{4}\]
where \(\psi\) is the score in Equation 3.
Compute an estimate with your estimator using the same simulated dataset as in the previous exercise.
Tips:
Finish your implementation of the DML Algorithm for the ATE by implementing the estimator for the covariance matrix. Using the estimated score from the previous exercise and your covariance estimator, compute the standard error and a 95% confidence interval for \(\theta_0\). Use the same dataset again.
Tip:
- Read section 8.1 of the DoubleML documentation
- Note that the estimator simplifies because of the score being linear in \(\theta\)
Try testing your function against the DoubleML implementation, DoubleMLIRM. You should get the exact same estimate and standard error as the package. To do this, you should:
Use
LinearRegressionandLogisticRegressionas nuisance function estimators (to avoid randomness)Trim your propensity score (estimator of \(P(D = 1 \mid X)\)) using
np.clip(a=r_hat, a_min=thresh, a_max=1 - thresh)withthresh=0.01(the default in the package); herer_hatis a vector of predicted probabilities.Split the data up front and use this with your own DML estimator and provide it to the DML package, see the snippet below. Set
draw_sample_splitting=Falsewhen constructing theDoubleMLIRMmodel object to indicate that you provide the splits yourself. You can use the snippet below as a starting point.from sklearn.linear_model import LinearRegression, LogisticRegression import doubleml as dml ml_g = LinearRegression() ml_m = LogisticRegression() dml_irm_obj = dml.DoubleMLIRM( obj_dml_data, ml_g, ml_m, draw_sample_splitting=False ) # splits is a list of the indices using `KFold` dml_irm_obj.set_sample_splitting(splits) # TODO: fit the DoubleMLIRM model and your implemented estimator # Assuming `est`, `se` are your estimate and standard error # and `est_pkg`, `se_pkg` are the estimate and standard error using the # when using package. # Assert our estimates matches the package assert np.allclose(est, est_pkg) assert np.allclose(se, se_pkg)