Exercise Class 2
If you haven’t solved all coding exercises from last week then please do before attempting this week. Nevertheless, we take outset in this repository in the following. You can compare your own solutions from last week with that if you like.
Exercise 0 - Testing
To verify that your functions are correct we will try something new this week. We will install the pytest module with uv as a development dependency.
Run
uv add pytest --devin your shell.
Verify that the following field is now present in your
pyproject.tomlfile:[dependency-groups] dev = [ "pytest>=8.4.2", ]Make a directory called
testsin your project directory (the directory where yourpyproject.tomlfile is located). Add a__init__.pyfile inside thetestsdirectory.Add a file called
test_test.pyinside thetestsdirectory. Paste the following function inside that file:def test_test(): assert 1 > 0Run:
pytestYou should get output similar to:
================================= test session starts ================================= platform linux -- Python 3.13.7, pytest-8.4.2, pluggy-1.6.0 rootdir: /home/jsr-p/ai-4-humanity/exercises configfile: pyproject.toml plugins: anyio-4.10.0 collected 1 item tests/test_test.py . [100%] ================================== 1 passed in 0.02s ==================================If so, great, now you are ready to bulletproof your code with tests. We won’t go into details but browse around the docs of pytest if you think tests are interesting.
Our use case for pytest is only for you to verify that your functions are correct while solving the exercises. See the next exercise.
Download the tests for this week’s exercise from github here. You should place the files
tests/test_tree.py tests/test_network.pyin your own
testsdirectory of your project.
Exercises Tree-based methods
Exercise 1 - Impurity functions
Add
scikit-learnas a project dependency usinguv add.Make a file called
src/ai4h/models/tree.pyi.e. a new module for functions related to tree-based methods.
All the following functions should be implemented inside your new tree.py module.
Implement a function named
cprobsthat for a given input vector of numeric class labels computes the relative counts. This function might be of interest.It should have the following signature:
from numpy.typing import NDArray def cprobs(y: NDArray[np.integer]): ...Test it with:
pytest tests/test_tree.py -k test_cprobsImplement the
giniimpurity function from this slide.It should have the following signature:
def gini(y: NDArray[np.integer]): """gini impurity function""" ...Compared to the definition in the slides the function should take the vector of outcome labels,
y, as input and not the conditional probabilities. Use thecprobsfunction to compute the conditional probabilities inside your implementation.Test it with:
pytest tests/test_tree.py -k test_gini
Exercise 2 - Goodness of split and best split
Implement a function called
goodness_splitwith the following signature:def goodness_split( y_left: NDArray, y_right: NDArray, imp_parent: float, impurity_fn: Callable[[NDArray], float] = gini, ) -> float: ...This corresponds to the one defined in this slide.
Note: The type hint
Callable[[NDArray], float] = ginifor theimpurity_fnargument means thatimpurity_fnis a function taking a numpy array (with any values) and returning a singlefloat; and that its default value is theginifunction defined in the previous exercise. See also here.Test it with:
pytest tests/test_tree.py -k test_goodness_splitThe next function we implement is the function that will find the best split using our
goodness_splitfunction.def best_split( X: NDArray, y: NDArray, *, impurity_fn: Callable[[NDArray], float] = gini, min_samples_leaf: int = 1, ) -> tuple[int, float, float]: ...Here is some pseudocode for what the function should do:
1. Let `n, d` = number of samples and number of features from shape of `X`.
2. If `n < 2 * min_samples_leaf`:
return `(-1, NaN, 0.0)` (not enough samples to split).
3. Compute parent impurity = `impurity_fn(y)`.
4. Initialize:
`best_feature = -1`
`best_threshold = NaN`
`best_gain = 0.0`
5. For each feature index `j` from `0` to `d-1`:
a. Extract column `xj = X[:, j]`.
b. Compute sorted order of `xj`, keeping stability.
c. Create `x_sorted` and `y_sorted` accordingly.
d. For each split index `i` from `min_samples_leaf-1` to `n - min_samples_leaf - 1`:
i. If `x_sorted[i] == x_sorted[i+1]`, skip (no threshold between equal values).
ii. Compute gain = `goodness_split(y_sorted[0:i+1], y_sorted[i+1:])` using parent impurity.
iii. If `gain > best_gain`:
– Set `best_threshold = average of x_sorted[i] and x_sorted[i+1]` (i.e. midpoint between the two feature values)
– Set `best_gain = gain`.
– Set `best_feature = j`.
6. Return `(best_feature, best_threshold, best_gain)`.
Use np.argsort with kind="stable" for sorting.
Exercise 3 - Building the tree
Copy the following code and insert it into your
tree.pyfile:from dataclasses import dataclass def compute_pred(y: NDArray) -> float: """computes predicted values. Guesses whether classification or regression tree by inspecting dtype of outcome variable. """ if np.issubdtype(y.dtype, np.integer): return int(np.argmax(np.bincount(y))) return float(y.mean()) @dataclass class TerminalNode: prediction: float impurity: float depth: int n_samples: int def predict_one(self, _): return self.prediction def predict(self, X): return np.full(len(X), self.prediction) def pretty(self, indent=""): return f"{indent}Leaf(pred={self.prediction:.2f}, n={self.n_samples})\n" @dataclass class NonterminalNode: feature: int threshold: float left: Node right: Node impurity: float depth: int n_samples: int def predict_one(self, x): return ( self.left.predict_one(x) if x[self.feature] <= self.threshold else self.right.predict_one(x) ) def predict(self, X): return np.array([self.predict_one(x) for x in X]) def pretty(self, indent=""): out = ( f"{indent}Node(f{self.feature}<={self.threshold:.2f}, n={self.n_samples})\n" ) return out + self.left.pretty(indent + " ") + self.right.pretty(indent + " ") Node = TerminalNode | NonterminalNodeConvince yourself that the
predictmethods for the two node classes make sense e.g. what happens if we inputX = np.array([[6.1, 2.6, 5.6, 1.4]])(and we assume the fitted tree is as in the lecture)?Write a function called
build_treethat builds the tree usingbest_splitand theTerminalNodeandNonterminalNodeclasses above.It should have the following signature:
def build_tree( X: NDArray, y: NDArray, *, impurity_fn: Callable[[NDArray], float] = gini, max_depth: int = 32, min_samples_leaf: int = 1, depth: int = 0, ) -> Node: ...Note that we have three arguments besides the input data and the impurity function:
max_depth: parameter to set a max depth that the tree can grow tomin_samples_leaf: minimum amount of samples in terminal nodes before stoppingdepth: the depth of the current node in the tree
build_treeshould build the tree recursively as follows:- If
depth >= max_depthorlen(y) <= 2 * min_samples_leaforbest_splitgives no valid/beneficial split.- Return a
TerminalNode - The
TerminalNodeshould compute the prediction using thecompute_predfunction given above
- Return a
- Otherwise:
- Split on best feature/threshold and return a
NonterminalNodewith itsleftandrightvalues beingNodes returned from a recursive call tobuild_treeon the two subsamples based on the found split - The conditions above ensure the function terminates.
- Split on best feature/threshold and return a
This is the recursive binary splitting procedure.
The function returns a
Nodewhich can be either a terminal or nonterminal node cf. the definition:TerminalNode | NonterminalNodeTest it with:
pytest tests/test_tree.py -k test_build_treeBuild a tree as follows:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) train_test_split(X, y, random_state=2025, train_size=0.8) tree = build_tree(X_train, y_train, impurity_fn=gini, min_samples_leaf=5)This should build the tree as shown in the lecture.
If you print out the
treevariable you should get:NonterminalNode(feature=2, threshold=np.float64(2.45), left=TerminalNode(prediction=0, impurity=0.0, depth=1, n_samples=36), right=NonterminalNode(feature=3, threshold=np.float64(1.75), left=NonterminalNode(feature=2, threshold=np.float64(4.85), left=NonterminalNode(feature=0, threshold=np.float64(5.15), left=TerminalNode(prediction=1, impurity=0.32, depth=4, n_samples=5), right=TerminalNode(prediction=1, impurity=0.0, depth=4, n_samples=38), impurity=0.045429962141698255, depth=3, n_samples=43), right=TerminalNode(prediction=1, impurity=0.5, depth=3, n_samples=6), impurity=0.14993752603082044, depth=2, n_samples=49), right=TerminalNode(prediction=2, impurity=0.0, depth=2, n_samples=35), impurity=0.49744897959183676, depth=1, n_samples=84), impurity=0.66375, depth=0, n_samples=120)Compare it to the figure from the lecture.
You can format it manually to show the tree structure as follows:
NonterminalNode( feature=2, threshold=np.float64(2.45), left=TerminalNode(prediction=0, impurity=0.0, depth=1, n_samples=36), right=NonterminalNode( feature=3, threshold=np.float64(1.75), left=NonterminalNode( feature=2, threshold=np.float64(4.85), left=NonterminalNode( feature=0, threshold=np.float64(5.15), left=TerminalNode( prediction=1, impurity=0.32, depth=4, n_samples=5 ), right=TerminalNode( prediction=1, impurity=0.0, depth=4, n_samples=38 ), impurity=0.045429962141698255, depth=3, n_samples=43 ), right=TerminalNode( prediction=1, impurity=0.5, depth=3, n_samples=6 ), impurity=0.14993752603082044, depth=2, n_samples=49 ), right=TerminalNode( prediction=2, impurity=0.0, depth=2, n_samples=35 ), impurity=0.49744897959183676, depth=1, n_samples=84 ), impurity=0.66375, depth=0, n_samples=120 )A beautiful tree grown by you!
You can also use the
prettymethod:print(tree.pretty())NonTerminalNode(f2<=2.45, n=120) Terminal(pred=0.00, n=36) NonTerminalNode(f3<=1.75, n=84) NonTerminalNode(f2<=4.85, n=49) NonTerminalNode(f0<=5.15, n=43) Terminal(pred=1.00, n=5) Terminal(pred=1.00, n=38) Terminal(pred=1.00, n=6) Terminal(pred=2.00, n=35)
Optional: Exercise O1 - Regression Tree
Make a
mseimpurity function to be used instead ofginifor regression data (i.e. continuous outcome variable).If you already have a working
build_treefunction then you are done. All you need is a dataset to apply it on and setimpurity_fn=mse. Try running the test:pytest tests/test_tree.py -k test_regression_treeIt estimates a regression tree on the
diabetesdataset of sklearn. Play around with the code in the test if you like.
Optional: Exercise O2 - Random Forest
Make a
DecisionTreeClassifierclass in yourtree.pyfile that follows sklearn’s model API; it should:- have a
fitmethod - have a
predictmethod - take the
max_depthandmin_samples_leafparameters as input in the constructor, and also the impurity function
- have a
Make a module
src/ai4h/models/ensemble.py.Create a RandomForestClassifier inside
ensemble.py. Try to make it follow sklearn’s API with minimum amount of relevant parameters; see the source for help (if you like).Test it on the
irisdata and compare itNote: No tests are given for this exercise.
Exercises Neural Networks
Exercise 4 - Simplest Neural Network from the lecture
Make a module
src/ai4h/models/network.py.Copy the following helpers to your module:
from pathlib import Path import numpy as np from numpy.typing import NDArray from sklearn.datasets import fetch_openml fp_proj = Path(__file__).parents[2] fp_data = fp_proj.joinpath("data") fp_data.mkdir(exist_ok=True) mnist_file = fp_data / "mnist.npz" def onehot(y: NDArray[np.uint8], n_classes: int = 10) -> NDArray[np.uint8]: return np.eye(n_classes, dtype=np.uint8)[y] def prepare_nielsen( X: NDArray[np.int64], y: NDArray[np.uint8], y_transform=lambda x: x.reshape(10, 1), one_hot: bool = True, ) -> list[tuple[NDArray, NDArray]]: """Massages data into the shape MN's code expects.""" return [ (x.reshape(784, 1) / 255, y_transform(v)) for x, v in zip(X, onehot(y) if one_hot else y) ] def download_mnist(): if mnist_file.exists(): print("Mnist data already exist in data folder") return mnist = fetch_openml("mnist_784", as_frame=False) X = mnist["data"].astype(np.uint8) # type: ignore y = mnist["target"].astype(np.uint8) # type: ignore np.savez_compressed(mnist_file, X=X, y=y) print(f"Mnist data saved to {fp_data}") def load_mnist() -> tuple[NDArray[np.int64], NDArray[np.uint8]]: if not mnist_file.exists(): raise ValueError("Mnist data hasn't been downloaded") data = np.load(fp_data / "mnist.npz") X, y = data["X"], data["y"] return X, yCopy the following function to your module:
from numpy.typing import NDArray import numpy as np def softmax(z: NDArray, axis: int = 1) -> NDArray: z_shift = z - z.max(axis=axis, keepdims=True) # numerical stability expz = np.exp(z_shift) return expz / expz.sum(axis=axis, keepdims=True)Read the first part this section on wiki and explain the function to yourself.
Implement a function named
cross_entropythat computes the cross entropy loss as shown here. You should addeps = 1e-9to the predicted values before taking the log for numerical stability.Test it with:
pytest tests/test_network.py -k test_cross_entropyCopy the following into your module:
class BasicNetwork: """Basic network with 1 hidden layer (30 units).""" def __init__(self, rng: np.random.Generator | None = None): rng = rng or np.random.default_rng() self.W_1 = rng.normal(0, 0.1, size=(_, _)) self.b_1 = np.zeros((_, _)) self.W_2 = rng.normal(0, 0.1, size=(_, _)) self.b_2 = np.zeros((_, _))Replace the placeholders
_with the correct dimensions cf. the example of the slides here.Implement a method
train_stepthat does a forward pass and then does backpropagation. It should have the following signature:def train_step( self, x: NDArray, y: NDArray, ) -> tuple[float, NDArray, NDArray, NDArray, NDArray]:It returns the loss and gradients for a single input example
x, yi.e. it it should have as return statement:return loss, dW1, db1, dW2, db2. Get guidance from the lecture.You can use the following helper functions:
def sigmoid(z: NDArray): """The sigmoid function.""" return 1.0 / (1.0 + np.exp(-z)) def sigmoid_prime(z: NDArray): """Derivative of the sigmoid function.""" return sigmoid(z) * (1 - sigmoid(z))Test it with:
pytest tests/test_network.py -k test_train_stepImplement a method
train_updatethat takes as input a mini batch of data and a learning rate and does the following:- loops over each tuple of pairs of vectors
(x, y) - applies
train_stepand accumulates the gradients to suitable variables - does a SGD update by dividing the accumulated gradients with the total number of observations in the mini batch
- I.e. it performs batch SGD (average of gradients over the mini batch)
It should have the following signature:
def train_update( self, mini_batch: list[tuple[NDArray, NDArray]], lr: float ) -> float:Test it with:
pytest tests/test_network.py -k test_train_update- loops over each tuple of pairs of vectors
Implement a
predictmethod that computes a forward pass for a single observation.def predict(self, x: NDArray) -> NDArray: ...Test it with:
pytest tests/test_network.py -k test_predictMake a training loop where you estimate the model. Get inspiration from here.
Optional: Exercise 5 - Neural Network theory
Prove the following about the softmax function:
For any coordinates \(i, k \in \{1,\dots,K\}\): \[ \frac{\partial \mathrm{sm}(z)_i}{\partial z_k} = \mathrm{sm}(z)_i \, [\delta_{ik} - \mathrm{sm}(z)_k], \] where \(\delta_{ik} = \mathbf{1}\{i = k\}\) is the Kronecker delta.
Prove the equation for the error of the output layer from the lecture. I.e. for:
\[\begin{aligned} \ell = -\sum_{k = 1}^{10} y_{k} \log \hat{y}_{k} = -\sum_{k = 1}^{10} y_{k} \log \mathrm{sm}(z^{l_{2}})_{k} \end{aligned}\] that \[\begin{aligned} \frac{\partial \ell}{\partial z_{j}^{l_{2}}} = \frac{\partial \ell}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_{j}^{l_{2}}} = \hat{y}_{j} - y_{j} \end{aligned}\]