Exercise Class 11

Author

Jonas Skjold Raaschou-Pedersen

Published

November 20, 2025

Exercise 1 - Scores and Estimators

Let \(W = (Y, D, X)\) be a tuple of data with \(Y\) outcome, \(D\) treatment and \(X\) covariates. We assume we have access to a dataset of \(n\) observations \(\{(Y_{i}, D_{i}, X_{i})\}_{i=1}^{n}\) (below we wil simulate from a specific DGP). In the following our estimand is the average treatment effect (ATE):

\[\begin{align*} \theta_{0} = E[Y(1) - Y(0)]. \end{align*}\]

Recall the scores from the lecture for ATE estimand:

Regression adjustment score:

\[\begin{align*} m(W; \theta_{0}, \eta_{0}) = \ell_{0}(1, X) - \ell_{0}(0, X) - \theta_{0} \end{align*} \tag{1}\] where \(\ell_{0}(D, X) := E[Y \mid D, X]\).
Inverse propensity weighted (IPW) score:

\[\begin{align*} m(W; \theta_0, \eta_0) = \alpha_0(D, X)Y - \theta_0 \end{align*} \tag{2}\] where \[\begin{align*} \alpha_{0}(D, X) = \frac{D}{r_{0}(X)} - \frac{1 - D}{1 - r_{0}(X)}, \quad r_{0}(X) = E[D \mid X] = P(D = 1 \mid X), \end{align*}\] and \(\eta_{0} = \alpha_{0}\).
Doubly robust score:

\[\begin{align*} \psi(W; \theta_{0}, \eta_{0}) = \alpha_{0}(D, X)[Y - \ell_{0}(D, X)] + \ell_{0}(1, X) - \ell_{0}(0, X) - \theta_{0} \end{align*} \tag{3}\] where \(\ell_{0}(D, X) := D \ell_{0}(1, X) + (1 - D) \ell_{0}(0, X)\) and \(\eta_{0} = (\alpha_{0}, \ell_{0})\).

Optional: Read through “The basics of double/debiased machine learning” in the DoubleML package documentation as a refresher/another perspective.
Read the first part of the DoubleML documentation on scores. Convince yourself that the scores in Equation 1, Equation 2 and Equation 3 are linear in \(\theta\) such that they can be written on the form: \[\begin{align*} \psi(W; \theta, \eta) = \psi_{a}(W; \eta)\theta + \psi_{b}(W; \eta). \end{align*}\]

An estimator for our estimand/target parameter can then be constructed as: \[\begin{align*} \tilde{\theta}_{0} = -\frac{E_{n}[\psi_{b}(W; \eta)]}{E_{n}[\psi_{a}(W; \eta)]}, \end{align*}\] where we use the notation \(E_{n}[f(W)] := \frac{1}{n} \sum_{i=1}^{n} f(W_{i})\) for the average over the \(n\) observations of a function of random variable(s) \(W\).
Write out an expression for a plug-in estimator for the three scores in Equation 1, Equation 2 and Equation 3.
Install the DoubleML Python package using uv.
Try applying your three estimators on a simulated dataset from the interactive regression model.
- Use the make_irm_data function to simulate a dataset.
- Set theta=0.5, n_obs=1_000, dim_x=10.
- You can choose any nuisance function estimators that you like. You could for instance pick the ones in the snippet below.
```
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor


ml_g = RandomForestRegressor(
    n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2
)
ml_m = RandomForestClassifier(
    n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2
)
```
Hint: You can set return_type="np.ndarray" in make_irm_data to directly get the data returned as a tuple of numpy arrays (X, y, d).
Optional: make a simulating study comparing the three estimators using the simulation function above. You can change the hyperparameters of the simulation function if you like.

Exercise - DML Algorithm

Recall that of the three scores introduced at the start, it is only the score in Equation 3 that is Neyman Orthogonal. Hence, we will now implement the DML algorithm using only that one.

Implement the DML estimator for the ATE (Step 1: to 8: of the DML Algorithm for the ATE). That is, implement \(\hat{\theta}\) that solves

\[\begin{align} \label{eq:dml} \frac{1}{n} \sum_{k=1}^{K} \sum_{i \in I_{k}} \psi(W_{i}; \hat{\theta}, \hat{\eta}_{-k}) = 0 \end{align} \tag{4}\]

where \(\psi\) is the score in Equation 3.

Compute an estimate with your estimator using the same simulated dataset as in the previous exercise.
Tips:
- For splitting the data into \(K\) folds you can use this function from sklearn
- For an animation of the cross-fitting procedure check out this video from the DoubleML documentation
Finish your implementation of the DML Algorithm for the ATE by implementing the estimator for the covariance matrix. Using the estimated score from the previous exercise and your covariance estimator, compute the standard error and a 95% confidence interval for \(\theta_0\). Use the same dataset again.
Tip:
- Read section 8.1 of the DoubleML documentation
- Note that the estimator simplifies because of the score being linear in \(\theta\)
Try testing your function against the DoubleML implementation, DoubleMLIRM. You should get the exact same estimate and standard error as the package. To do this, you should:
- Use LinearRegression and LogisticRegression as nuisance function estimators (to avoid randomness)
- Trim your propensity score (estimator of \(P(D = 1 \mid X)\)) using np.clip(a=r_hat, a_min=thresh, a_max=1 - thresh) with thresh=0.01 (the default in the package); here r_hat is a vector of predicted probabilities.
- Split the data up front and use this with your own DML estimator and provide it to the DML package, see the snippet below. Set draw_sample_splitting=False when constructing the DoubleMLIRM model object to indicate that you provide the splits yourself. You can use the snippet below as a starting point.
```
from sklearn.linear_model import LinearRegression, LogisticRegression
import doubleml as dml


ml_g = LinearRegression()
ml_m = LogisticRegression()

dml_irm_obj = dml.DoubleMLIRM(
    obj_dml_data, ml_g, ml_m, draw_sample_splitting=False
)
# splits is a list of the indices using `KFold`
dml_irm_obj.set_sample_splitting(splits)

# TODO: fit the DoubleMLIRM model and your implemented estimator

# Assuming `est`, `se` are your estimate and standard error
# and `est_pkg`, `se_pkg` are the estimate and standard error using the
# when using package.
# Assert our estimates matches the package
assert np.allclose(est, est_pkg)
assert np.allclose(se, se_pkg)
```