Exercise Class 2

Author

Jonas Skjold Raaschou-Pedersen

Published

September 11, 2025

If you haven’t solved all coding exercises from last week then please do before attempting this week. Nevertheless, we take outset in this repository in the following. You can compare your own solutions from last week with that if you like.

Exercise 0 - Testing

To verify that your functions are correct we will try something new this week. We will install the pytest module with uv as a development dependency.

Run
```
uv add pytest --dev
```
in your shell.

Verify that the following field is now present in your pyproject.toml file:
```
[dependency-groups]
dev = [
    "pytest>=8.4.2",
]
```
Make a directory called tests in your project directory (the directory where your pyproject.toml file is located). Add a __init__.py file inside the tests directory.
Add a file called test_test.py inside the tests directory. Paste the following function inside that file:
```
def test_test():
    assert 1 > 0
```

Run:

pytest

You should get output similar to:

================================= test session starts =================================
platform linux -- Python 3.13.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /home/jsr-p/ai-4-humanity/exercises
configfile: pyproject.toml
plugins: anyio-4.10.0
collected 1 item

tests/test_test.py .                                                            [100%]

================================== 1 passed in 0.02s ==================================

If so, great, now you are ready to bulletproof your code with tests. We won’t go into details but browse around the docs of pytest if you think tests are interesting.

Our use case for pytest is only for you to verify that your functions are correct while solving the exercises. See the next exercise.

Download the tests for this week’s exercise from github here. You should place the files
```
tests/test_tree.py
tests/test_network.py
```
in your own tests directory of your project.

Exercises Tree-based methods

Exercise 1 - Impurity functions

Add scikit-learn as a project dependency using uv add.
Make a file called src/ai4h/models/tree.py i.e. a new module for functions related to tree-based methods.

All the following functions should be implemented inside your new tree.py module.

Implement a function named cprobs that for a given input vector of numeric class labels computes the relative counts. This function might be of interest.

It should have the following signature:
```
from numpy.typing import NDArray


def cprobs(y: NDArray[np.integer]):
    ...
```
Test it with:
```
pytest tests/test_tree.py -k test_cprobs
```
Implement the gini impurity function from this slide.

It should have the following signature:
```
def gini(y: NDArray[np.integer]):
    """gini impurity function"""
    ...
```
Compared to the definition in the slides the function should take the vector of outcome labels, y, as input and not the conditional probabilities. Use the cprobs function to compute the conditional probabilities inside your implementation.

Test it with:
```
pytest tests/test_tree.py -k test_gini
```

Exercise 2 - Goodness of split and best split

Implement a function called goodness_split with the following signature:
```
def goodness_split(
    y_left: NDArray,
    y_right: NDArray,
    imp_parent: float,
    impurity_fn: Callable[[NDArray], float] = gini,
) -> float:
   ...
```
This corresponds to the one defined in this slide.

Note: The type hint Callable[[NDArray], float] = gini for the impurity_fn argument means that impurity_fn is a function taking a numpy array (with any values) and returning a single float; and that its default value is the gini function defined in the previous exercise. See also here.

Test it with:
```
pytest tests/test_tree.py -k test_goodness_split
```

The next function we implement is the function that will find the best split using our goodness_split function.

def best_split(
    X: NDArray,
    y: NDArray,
    *,
    impurity_fn: Callable[[NDArray], float] = gini,
    min_samples_leaf: int = 1,
) -> tuple[int, float, float]:
   ...

Here is some pseudocode for what the function should do:

1. Let `n, d` = number of samples and number of features from shape of `X`.
2. If `n < 2 * min_samples_leaf`:
       return `(-1, NaN, 0.0)` (not enough samples to split).
3. Compute parent impurity = `impurity_fn(y)`.
4. Initialize:
       `best_feature = -1`
       `best_threshold = NaN`
       `best_gain = 0.0`
5. For each feature index `j` from `0` to `d-1`:
       a. Extract column `xj = X[:, j]`.
       b. Compute sorted order of `xj`, keeping stability.
       c. Create `x_sorted` and `y_sorted` accordingly.
       d. For each split index `i` from `min_samples_leaf-1` to `n - min_samples_leaf - 1`:
         i. If `x_sorted[i] == x_sorted[i+1]`, skip (no threshold between equal values).
         ii. Compute gain = `goodness_split(y_sorted[0:i+1], y_sorted[i+1:])` using parent impurity.
         iii. If `gain > best_gain`:
           – Set `best_threshold = average of x_sorted[i] and x_sorted[i+1]` (i.e. midpoint between the two feature values)
           – Set `best_gain = gain`.
           – Set `best_feature = j`.
6. Return `(best_feature, best_threshold, best_gain)`.

Use np.argsort with kind="stable" for sorting.

Exercise 3 - Building the tree

Copy the following code and insert it into your tree.py file:

from dataclasses import dataclass


def compute_pred(y: NDArray) -> float:
    """computes predicted values.

    Guesses whether classification or regression tree
    by inspecting dtype of outcome variable.
    """
    if np.issubdtype(y.dtype, np.integer):
        return int(np.argmax(np.bincount(y)))
    return float(y.mean())


@dataclass
class TerminalNode:
    prediction: float
    impurity: float
    depth: int
    n_samples: int

    def predict_one(self, _):
        return self.prediction

    def predict(self, X):
        return np.full(len(X), self.prediction)

    def pretty(self, indent=""):
        return f"{indent}Leaf(pred={self.prediction:.2f}, n={self.n_samples})\n"


@dataclass
class NonterminalNode:
    feature: int
    threshold: float
    left: Node
    right: Node
    impurity: float
    depth: int
    n_samples: int

    def predict_one(self, x):
        return (
            self.left.predict_one(x)
            if x[self.feature] <= self.threshold
            else self.right.predict_one(x)
        )

    def predict(self, X):
        return np.array([self.predict_one(x) for x in X])

    def pretty(self, indent=""):
        out = (
            f"{indent}Node(f{self.feature}<={self.threshold:.2f}, n={self.n_samples})\n"
        )
        return out + self.left.pretty(indent + "  ") + self.right.pretty(indent + "  ")


Node = TerminalNode | NonterminalNode

Convince yourself that the predict methods for the two node classes make sense e.g. what happens if we input X = np.array([[6.1, 2.6, 5.6, 1.4]]) (and we assume the fitted tree is as in the lecture)?

Write a function called build_tree that builds the tree using best_split and the TerminalNode and NonterminalNode classes above.

It should have the following signature:
```
def build_tree(
    X: NDArray,
    y: NDArray,
    *,
    impurity_fn: Callable[[NDArray], float] = gini,
    max_depth: int = 32,
    min_samples_leaf: int = 1,
    depth: int = 0,
) -> Node:
    ...
```
Note that we have three arguments besides the input data and the impurity function:
- max_depth: parameter to set a max depth that the tree can grow to
- min_samples_leaf: minimum amount of samples in terminal nodes before stopping
- depth: the depth of the current node in the tree
build_tree should build the tree recursively as follows:
- If depth >= max_depth or len(y) <= 2 * min_samples_leaf or best_split gives no valid/beneficial split.
  - Return a TerminalNode
  - The TerminalNode should compute the prediction using the compute_pred function given above
- Otherwise:
  - Split on best feature/threshold and return a NonterminalNode with its left and right values being Nodes returned from a recursive call to build_tree on the two subsamples based on the found split
  - The conditions above ensure the function terminates.
This is the recursive binary splitting procedure.

The function returns a Node which can be either a terminal or nonterminal node cf. the definition: TerminalNode | NonterminalNode

Test it with:
```
pytest tests/test_tree.py -k test_build_tree
```

Build a tree as follows:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


X, y = load_iris(return_X_y=True)
train_test_split(X, y, random_state=2025, train_size=0.8)
tree = build_tree(X_train, y_train, impurity_fn=gini, min_samples_leaf=5)

This should build the tree as shown in the lecture.

If you print out the tree variable you should get:

NonterminalNode(feature=2, threshold=np.float64(2.45), left=TerminalNode(prediction=0, impurity=0.0, depth=1, n_samples=36), right=NonterminalNode(feature=3, threshold=np.float64(1.75), left=NonterminalNode(feature=2, threshold=np.float64(4.85), left=NonterminalNode(feature=0, threshold=np.float64(5.15), left=TerminalNode(prediction=1, impurity=0.32, depth=4, n_samples=5), right=TerminalNode(prediction=1, impurity=0.0, depth=4, n_samples=38), impurity=0.045429962141698255, depth=3, n_samples=43), right=TerminalNode(prediction=1, impurity=0.5, depth=3, n_samples=6), impurity=0.14993752603082044, depth=2, n_samples=49), right=TerminalNode(prediction=2, impurity=0.0, depth=2, n_samples=35), impurity=0.49744897959183676, depth=1, n_samples=84), impurity=0.66375, depth=0, n_samples=120)

Compare it to the figure from the lecture.

You can format it manually to show the tree structure as follows:

NonterminalNode(
        feature=2,
        threshold=np.float64(2.45),
        left=TerminalNode(prediction=0, impurity=0.0, depth=1, n_samples=36),
        right=NonterminalNode(
                feature=3,
                threshold=np.float64(1.75),
                left=NonterminalNode(
                        feature=2,
                        threshold=np.float64(4.85),
                        left=NonterminalNode(
                                feature=0,
                                threshold=np.float64(5.15),
                                left=TerminalNode(
                                        prediction=1,
                                        impurity=0.32,
                                        depth=4,
                                        n_samples=5
                                ),
                                right=TerminalNode(
                                        prediction=1,
                                        impurity=0.0,
                                        depth=4,
                                        n_samples=38
                                ),
                                impurity=0.045429962141698255,
                                depth=3,
                                n_samples=43
                        ),
                        right=TerminalNode(
                                prediction=1,
                                impurity=0.5,
                                depth=3,
                                n_samples=6
                        ),
                        impurity=0.14993752603082044,
                        depth=2,
                        n_samples=49
                ),
                right=TerminalNode(
                        prediction=2,
                        impurity=0.0,
                        depth=2,
                        n_samples=35
                ),
                impurity=0.49744897959183676,
                depth=1,
                n_samples=84
        ),
        impurity=0.66375,
        depth=0,
        n_samples=120
)

A beautiful tree grown by you!

You can also use the pretty method:

print(tree.pretty())

NonTerminalNode(f2<=2.45, n=120)
  Terminal(pred=0.00, n=36)
  NonTerminalNode(f3<=1.75, n=84)
    NonTerminalNode(f2<=4.85, n=49)
      NonTerminalNode(f0<=5.15, n=43)
        Terminal(pred=1.00, n=5)
        Terminal(pred=1.00, n=38)
      Terminal(pred=1.00, n=6)
    Terminal(pred=2.00, n=35)

Optional: Exercise O1 - Regression Tree

Make a mse impurity function to be used instead of gini for regression data (i.e. continuous outcome variable).
If you already have a working build_tree function then you are done. All you need is a dataset to apply it on and set impurity_fn=mse. Try running the test:
```
pytest tests/test_tree.py -k test_regression_tree
```
It estimates a regression tree on the diabetes dataset of sklearn. Play around with the code in the test if you like.

Optional: Exercise O2 - Random Forest

Make a DecisionTreeClassifier class in your tree.py file that follows sklearn’s model API; it should:
- have a fit method
- have a predict method
- take the max_depth and min_samples_leaf parameters as input in the constructor, and also the impurity function
Make a module src/ai4h/models/ensemble.py.
Create a RandomForestClassifier inside ensemble.py. Try to make it follow sklearn’s API with minimum amount of relevant parameters; see the source for help (if you like).
Test it on the iris data and compare it

Note: No tests are given for this exercise.

Exercises Neural Networks

Exercise 4 - Simplest Neural Network from the lecture

Make a module src/ai4h/models/network.py.

Copy the following helpers to your module:

from pathlib import Path

import numpy as np
from numpy.typing import NDArray
from sklearn.datasets import fetch_openml


fp_proj = Path(__file__).parents[2]
fp_data = fp_proj.joinpath("data")
fp_data.mkdir(exist_ok=True)
mnist_file = fp_data / "mnist.npz"


def onehot(y: NDArray[np.uint8], n_classes: int = 10) -> NDArray[np.uint8]:
    return np.eye(n_classes, dtype=np.uint8)[y]


def prepare_nielsen(
    X: NDArray[np.int64],
    y: NDArray[np.uint8],
    y_transform=lambda x: x.reshape(10, 1),
    one_hot: bool = True,
) -> list[tuple[NDArray, NDArray]]:
    """Massages data into the shape MN's code expects."""
    return [
        (x.reshape(784, 1) / 255, y_transform(v))
        for x, v in zip(X, onehot(y) if one_hot else y)
    ]


def download_mnist():
    if mnist_file.exists():
        print("Mnist data already exist in data folder")
        return

    mnist = fetch_openml("mnist_784", as_frame=False)
    X = mnist["data"].astype(np.uint8)  # type: ignore
    y = mnist["target"].astype(np.uint8)  # type: ignore

    np.savez_compressed(mnist_file, X=X, y=y)
    print(f"Mnist data saved to {fp_data}")


def load_mnist() -> tuple[NDArray[np.int64], NDArray[np.uint8]]:
    if not mnist_file.exists():
        raise ValueError("Mnist data hasn't been downloaded")
    data = np.load(fp_data / "mnist.npz")
    X, y = data["X"], data["y"]
    return X, y

Copy the following function to your module:

from numpy.typing import NDArray
import numpy as np


def softmax(z: NDArray, axis: int = 1) -> NDArray:
    z_shift = z - z.max(axis=axis, keepdims=True)  # numerical stability
    expz = np.exp(z_shift)
    return expz / expz.sum(axis=axis, keepdims=True)

Read the first part this section on wiki and explain the function to yourself.

Implement a function named cross_entropy that computes the cross entropy loss as shown here. You should add eps = 1e-9 to the predicted values before taking the log for numerical stability.

Test it with:
```
pytest tests/test_network.py -k test_cross_entropy
```

Copy the following into your module:

class BasicNetwork:
    """Basic network with 1 hidden layer (30 units)."""

    def __init__(self, rng: np.random.Generator | None = None):
        rng = rng or np.random.default_rng()
        self.W_1 = rng.normal(0, 0.1, size=(_, _))
        self.b_1 = np.zeros((_, _))
        self.W_2 = rng.normal(0, 0.1, size=(_, _))
        self.b_2 = np.zeros((_, _))

Replace the placeholders _ with the correct dimensions cf. the example of the slides here.

Implement a method train_step that does a forward pass and then does backpropagation. It should have the following signature:

    def train_step(
        self,
        x: NDArray,
        y: NDArray,
    ) -> tuple[float, NDArray, NDArray, NDArray, NDArray]:

It returns the loss and gradients for a single input example x, y i.e. it it should have as return statement: return loss, dW1, db1, dW2, db2. Get guidance from the lecture.

You can use the following helper functions:

def sigmoid(z: NDArray):
    """The sigmoid function."""
    return 1.0 / (1.0 + np.exp(-z))


def sigmoid_prime(z: NDArray):
    """Derivative of the sigmoid function."""
    return sigmoid(z) * (1 - sigmoid(z))

Test it with:

pytest tests/test_network.py -k test_train_step

Implement a method train_update that takes as input a mini batch of data and a learning rate and does the following:
- loops over each tuple of pairs of vectors (x, y)
- applies train_step and accumulates the gradients to suitable variables
- does a SGD update by dividing the accumulated gradients with the total number of observations in the mini batch
  - I.e. it performs batch SGD (average of gradients over the mini batch)
It should have the following signature:
```
   def train_update(
       self, mini_batch: list[tuple[NDArray, NDArray]], lr: float
   ) -> float:
```
Test it with:
```
pytest tests/test_network.py -k test_train_update
```

Implement a predict method that computes a forward pass for a single observation.

    def predict(self, x: NDArray) -> NDArray:
        ...

Test it with:

pytest tests/test_network.py -k test_predict

Make a training loop where you estimate the model. Get inspiration from here.

Optional: Exercise 5 - Neural Network theory

Prove the following about the softmax function:

For any coordinates \(i, k \in \{1,\dots,K\}\): \[ \frac{\partial \mathrm{sm}(z)_i}{\partial z_k} = \mathrm{sm}(z)_i \, [\delta_{ik} - \mathrm{sm}(z)_k], \] where \(\delta_{ik} = \mathbf{1}\{i = k\}\) is the Kronecker delta.
Prove the equation for the error of the output layer from the lecture. I.e. for:

\[\begin{aligned} \ell = -\sum_{k = 1}^{10} y_{k} \log \hat{y}_{k} = -\sum_{k = 1}^{10} y_{k} \log \mathrm{sm}(z^{l_{2}})_{k} \end{aligned}\] that \[\begin{aligned} \frac{\partial \ell}{\partial z_{j}^{l_{2}}} = \frac{\partial \ell}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_{j}^{l_{2}}} = \hat{y}_{j} - y_{j} \end{aligned}\]