AI For Humanity

Lecture 2 - Tree-based methods and Neural Networks

Jonas Skjold Raaschou-Pedersen

2025-09-10

Tree-based Methods

  • Why Tree-based methods?
    • Performs well on tabular data
    • Decision Trees, Random Forest, XGBoost are tree-based methods
    • Hence good to know the fundamentals!
    • Naturally handle mixed feature types (numeric & categorical)
    • Provide interpretability (e.g. decision tree diagrams)
    • Require minimal preprocessing (no scaling needed, robust to outliers)
  • Later on in the course we consider the Causal Forest method
    • Builds upon Random Forests
  • And so we build the foundation

Setup

Consider a dataset with outcomes belonging to \(3\) classes…

Setup

The Iris dataset (Fisher, 1936)

Setup

from sklearn.datasets import load_iris

print(load_iris().DESCR)

yields:

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================
  • Goal: classify iris flowers to the three categories
  • Approach: we’ll use a Decision Tree as our classifier

source 1 & 2

All 2-combinations of features

Could we carve out the feature space…?

Asking questions to the data

Questions that arise

  • How do “ask questions to the data” i.e. how do we choose a split?
  • How do we measure how “good” a split is?
  • When to stop splitting?
  • How to make predictions in the leaves?

Approach we’ll take

  • Cannot consider every possible partition of the feature space into e.g. \(J\) boxes
  • Instead: top-down, greedy approach known as recursive binary splitting
    • top-down: starts at top of tree then successively splits data (indicated via branches)
    • greedy: at each step, the best split is locally optimal (doesn’t look ahead)
    • recursive: after splitting, solve similar problem on the two subsamples
    • binary: the splits are binary (yielding a binary tree; see previous slide)

Setup

  • Data tuple \((X, Y) \in \mathcal{X} \times \mathcal{Y}\) where \(\mathcal{X}\) is the feature space and \(\mathcal{Y}\) the outcome space
    • Classification: \(\mathcal{Y} = \{1, 2, \ldots, K\}\)
    • Regression: \(\mathcal{Y} = \mathbb{R}\)
      • Note: we won’t consider Regression Trees in the lecture; only in the exercise class
    • Iris data:
      • \(\mathcal{X} = \mathbb{R}^{4}\) (sepal length, sepal width, petal length, petal width)
      • \(\mathcal{Y} = \{1, 2, 3\}\) (Setosa, Versicolour, Virginica)
  • Dataset \(\{(X_{i}, Y_{i})\}_{i=1}^{n}\) (e.g. \(n = 150\) in the Iris data)
  • ESLII and ISL books both denote nodes in the decision tree as “boxes”; I’ll instead follow the notation and terminology in (Breiman, 1984)
  • Node: A node \(t\) is defined as a subset of \(\mathcal{X}\) (the feature space) i.e. \(t \subset \mathcal{X}\)
    • Example: \(t = \{X_{i} \mid X_{i, 3} \leq 2.45\}\) (all samples with petal length \(\leq\) 2.45 cm)
  • Nonterminal node: a node where a split is made
  • Terminal node: a node where no further split is made

Inspecting the Tree

  • Question: consider \(X_{i} = (6.1, 2.6, 5.6, 1.4)^{T}\); what is the predicted label \(\hat{Y}_{i}?\)
    • Order of \(X\) variables: (sepal length, sepal width, petal length, petal width)

Construction of tree

  • Construction of a tree:
    1. Selecting the splits
    2. Declaring a node terminal or continue splitting it
    3. Assignment of each terminal node to a class
  • Class assignment is simple
  • Main problem: finding good splits and knowing when to stop splitting
  • We want to the split of each node so that the data in each of the descendant nodes are “purer” than the data in the parent node

Counting and proportions

  • Define:
    • \(N\): total number of samples
    • \(N_{j}\): total number of samples in class \(j\)
    • \(N(t)\): number of samples going to node \(t\)
    • \(N_{j}(t)\): number of samples going to node \(t\) of class \(j\)
  • Let \(p(t)\) be the proportion of samples going to node \(t\); \(p(j, t)\) the proportion going to node \(t\) and being class \(j\); and \(p(j \mid t)\) the proportion being class \(j\) given node \(t\)
  • Then we can compute the proportions: \[\begin{aligned} &p(t) = \frac{N(t)}{N} \\ &p(j, t) = \frac{N_{j}(t)}{N} \\ &p(j \mid t) = p(j, t) / p(t) = \frac{N_{j}(t)}{N} / \frac{N(t)}{N} = \frac{N_{j}(t)}{N(t)} \end{aligned}\]

Counting and proportions

  • Consider the split of node \(t\) into \(t_{L}\) and \(t_{R}\) (left and right), then the following holds: \[\begin{aligned} & p(t_{L}) + p(t_{R}) = p(t) \\ & p_{L} = \frac{p(t_{L})}{p(t)}, \ p_{R} = \frac{p(t_{R})}{p(t)}, \ p_{L} + p_{R} = 1 \end{aligned}\] where \(p_{L}\) is the proportion going to the left node and \(p_{R}\) to the right

  • E.g. cf. previous slide \[\begin{aligned} p_{L} = \frac{p(t_{L})}{p(t)} = \frac{ \frac{N(t_{L})}{N} }{ \frac{N(t)}{N} } = \frac{N(t_{L})}{N(t)} \end{aligned}\]

Inspecting the Tree

  • Question: consider the nonterminal node with \(N(t) = 84\).

    • Compute the vector of estimated conditional probabilities: \[ (p(1 \mid t), p(2 \mid t), p(3 \mid t))^{T} \] using the expression on the previous slides and the counts in the tree
    • Compute \(p_{L}\) and \(p_{R}\) relative to the same node \(t\)

Approaching a criterion for splitting

  • Define impurity measure of node \(t\): \[\begin{aligned} i(t) = \phi(p(1 \mid t), \ldots, p(K \mid t)) \end{aligned}\] where \(\phi\) is the impurity function defined on all \(K\)-tuples of probabilities

  • Goodness of split \(s\) for node \(t\): \[\begin{aligned} \Delta i(s, t) & := i(t) - p_{R}i(t_{R}) - p_{L}i(t_{L}) \\ &= i(t) - \frac{N(t_{R})}{N(t)}i(t_{R}) - \frac{N(t_{L})}{N(t)}i(t_{L}) \end{aligned}\]

  • The split \(s\) splits node \(t\) into two disjoint subsets \(t_{L}\), \(t_{R}\)

  • Splitting rule: at each nonterminal node \(t\) the split selected, \(s^{*}\), is the one which maximizes \(\Delta i(s, t)\) i.e. \(\Delta i(s^{*}, t) = \max_{s} \Delta i(s, t)\)

    • The split should decrease the weighted impurity of nodes \(t_{L}\) and \(t_{R}\) compared to \(t\)

Split \(s\)

Illustration of split from (Breiman, 1984)

Impurity functions

  • Gini index: \[\begin{aligned} \phi(p(1 \mid t), \ldots, p(K \mid t)) = \sum_{j = 1}^{K} p(j \mid t)(1 - p(j \mid t)) \end{aligned}\]

  • Entropy: \[\begin{aligned} \phi(p(1 \mid t), \ldots, p(K \mid t)) = -\sum_{j = 1}^{K} p(j \mid t) \log p(j \mid t) \end{aligned}\]

  • Misclassification error \[\begin{aligned} \phi(p(1 \mid t), \ldots, p(K \mid t)) = 1 - \max_{} p(j \mid t) \end{aligned}\]

  • Using the Misclassification error as impurity function has drawbacks

    • The criterion may be zero for all splits \(s\)
    • Isn’t a good criterion for the overall multistep tree growing procedure
    • Reducing the misclassification rate does not appear to be a good criterion

Impurity functions

In the two-class case; image from ESLII.

Computing impurity

  • With \(i(t) = \sum_{j = 1}^{K} p(j \mid t)(1 - p(j \mid t))\) (Gini impurity function) and the expression

\[\begin{aligned} \Delta i(s, t) = i(t) - p_{R}i(t_{R}) - p_{L}i(t_{L}) \end{aligned}\]

we can seek the best split \(s\) for each nonterminal node

Computing impurity

  • Question: compute \(\Delta i(s, t)\) for the nonterminal node from before with \(N(t) = 84\); you can use the function below and \(p_{L}\) and \(p_{R}\) from before.
def gini_from_counts(N_j_t: np.ndarray, N_t: int):
    p_j_t = N_j_t / N_t
    return float(np.sum(p_j_t * (1 - p_j_t)))

Building a Tree

  • With the splitting rule that seeks \(s\) that maximizes \(\Delta i(s, t)\) you are ready to build/grow a tree!
  • One stopping rule is to stop growing the tree whenever a split would result in either terminal node having less than min_samples_leaf samples
    • this parameter is found in sklearn as well; they denote terminal nodes by leafs
  • We leave out details for how to search for the splits to the exercise class

Growing (our toolbox)

Define:

  • Majority label at node \(t\): \(k(t) := \underset{j}{\arg\max} \ p(j \mid t)\)

  • Resubstitution estimate \[\begin{aligned} r(t) = 1 - \max_{j}p(j \mid t) = 1 - p(k(t) \mid t) \end{aligned}\]

  • Node misclassification cost: \(R(t) := r(t)p(t)\)

  • Tree misclassification cost: \[\begin{aligned} R(T) = \sum_{t \in \tilde{T}} R(t) = \sum_{t \in \tilde{T}}r(t)p(t). \end{aligned}\]

where \(\tilde{T}\) is the set of all terminal nodes.

Problem with growing

  • Problem: If we split a node \(t\) into child nodes the misclassification rate is ensured to improve; i.e. for any split of a node \(t\) into \(t_{L}\) and \(t_{R}\), \[\begin{aligned} R(t) \geq R(t_{L}) + R(t_{R}). \end{aligned}\]
  • The resubstitution error rate is biased downwards; this means that if we simply minimize the resubstitution error rate, we would always prefer a bigger tree.
    • Overfitting!
  • Strategy is:
    • Grow a large tree \(T_{max}\) until stopping criteria (e.g. min_samples_leaf) is met
    • prune the tree (i.e. delete nodes) in a clever way

Pruning

  • Cost-complexity measure \[\begin{aligned} R_{\alpha}(T) = R(T) + \alpha \vert \tilde{T} \vert \end{aligned}\]
  • Can think of \(\alpha\) as the complexity cost per terminal node
    • High \(\alpha\) \(\rightarrow\) simpler tree; low \(\alpha\) \(\rightarrow\) more complex tree
  • Finding optimal sequence of \(\alpha\) is called cost-complexity pruning
    • Unfortunately we won’t have time for it
    • See here or here if you are interested [or ch. 3. (Breiman, 1984)]
  • Finally, \(K\)-fold cross-validation is used to choose \(\alpha\)

The running example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree


X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=2025, train_size=0.8
)
clf = DecisionTreeClassifier(
    min_samples_leaf=5,
    criterion="gini",
    random_state=2025,
)
clf.fit(X_train, y_train)

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(8, 8))
plot_tree(
    clf,
    filled=False,
    proportion=False,
    ax=axes,
    impurity=True,
    feature_names=[
      'sepal length (cm)',
      'sepal width (cm)',
      'petal length (cm)',
      'petal width (cm)'
    ],
)
# plt.savefig("figs/iris_dt2.png", bbox_inches="tight", pad_inches=0)

From Trees to Forests

  • Bagging
    • Build many trees on bootstrapped samples
    • Bagged Trees
  • Random Forests
    • Build many trees on bootstrapped samples
    • Improvement over Bagged Trees: Random subset of features at each split
    • Decorrelating the trees
    • Aggregate predictions (majority vote / average)
  • Boosting
    • Boosting: grow trees sequentially, each focusing on previous errors → reduces bias
  • Random Forests are easy to implement once we know Decision Trees / Regression Trees
    • Hence all the time spent on Decision Trees!

Trees and Forests

Neural Networks

  • Fundamental method that underlies all of the generative AI revolution
    • Transformers = fancy neural networks
  • Large subject: today we’ll focus on a simple example but still with a lot of depth
  • In the next lectures you will build upon this e.g. when considering language models

Setup

  • The MNIST dataset consists of 70,000 images of handwritten digits (0–9), each a 28 × 28 pixel grayscale image
    • 50,000 training images; 10,000 validation images; 10,000 test images
  • We reshape the pixel matrices into \(28 \cdot 28 = 784\)-dimensional vectors, \(x_{i} \in \mathbb{R}^{784 \times 1}\); also, we divide them by \(255\)
  • We one-hot encode the labels \(y_{i} \in \{0, 1\}^{10 \times 1}\) (with a \(1\) at the index corresponding to the true label and \(0\) elsewhere)
  • Our goal is to estimate a neural network (think of it as a complex non-linear statistical model) to correctly predict the class labels

A Neural Network

source

Neural Network Model

  • A neural network takes as input \(x \in \mathbb{R}^{784 \times 1}\) and succesively transforms it as follows. Below we set \(a^{0} = x\)
  • Activations of \(j\)th neuron in the \(l\)th layer related to the actions in the \((l - 1)\)th layer by: \[\begin{aligned} a^{l}_{j} = \sigma \left( \sum_{k} w_{jk}^{l} a^{l-1}_{k} + b_{j}^{l} \right) = \sigma (z^{l}_{j}) \end{aligned}\] \[\begin{aligned} z^{l}_{j} = \sum_{k} w_{jk}^{l} a^{l-1}_{k} + b_{j}^{l} \end{aligned}\]
  • Weight notation \(w^{l}_{jk}\) to denote weight for the connection from the \(k\)th neuron in the \((l - 1)\)th layer to the \(j\)th neuron in the \(l\)th layer
  • \(b_{j}^{l}\) for the \(j\)th bias in the \(l\)th layer

Neural Network Weights

Loss function

  • To measure whether our predictions are correct we need a loss function
  • Let \(\hat{y} \in \mathbb{R}^{10 \times 1}\) be the predictions from our neural network model based on input \(x\)
    • They are predicted probabilities of being in each of the \(10\) classes
  • We will measure how good our model is by the cross-entropy loss function:

\[\ell := \text{CE}(y, \hat{y}) := -\sum_{j=1}^{10} y_j \log \hat{y}_j\]

  • For a sample of \(n\) observations the loss is:

\[\mathcal{L} := -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{10} y_{i, j} \log \hat{y}_{i, j}\]

Loss function - Intuition

Again:

\[\ell = -\sum_{j=1}^{10} y_j \log \hat{y}_j\]

  • We want to minimize this function; note:
    • Only one term in the sum is non-zero, the one corresponding to the true label, \(j\), say
    • Also, \(\hat{y}_{j} \in (0, 1)\) so \(\log \hat{y}_{j} \in (-\infty, 0)\) and \(-\log \hat{y}_{j} \in (0, \infty)\)
    • To minimize \(-\log \hat{y}_{j}\) we want \(\hat{y}_{j}\) as close to \(1\) as possible; i.e. we want our model to have as high a predicted probability for the true class as possible

Neural Network Model Vectorized

  • Can write the previous equations as vectors: \[\begin{aligned} z^{l} := W^{l}a^{l-1} + b^{l}, \quad a^{l} = \sigma \left(z^{l}\right) \end{aligned}\] where \(W^{l}\) has dimension \(d_{l+1} \times d_{l}\) (dimension of layer \(l + 1\) times that of layer \(l\))
  • \(\sigma\) is the sigmoid function for the intermediate layers while it is the softmax function for the output (final) layer
  • We denote the softmax function by \(\mathrm{sm}\): \[\begin{aligned} \mathrm{sm}: \mathbb{R}^K \to \mathbb{R}^K,\qquad \mathrm{sm}(z)_i = \frac{\exp (z_{i})}{\sum_{j=1}^K \exp (z_{j})}. \end{aligned}\]
  • Transforms a given vector \(z\) into a vector of probabilities
  • Very handy when we want to classify digits (or predict the next token in a sentence, to be introduced later in the course)

Neural Network Model - Forward pass

  • In the simplest example in Nielsen’s code he uses the following input sizes \((784, 30, 10)\) leading to the weight matrices: \[\begin{aligned} W^{1} \in \mathbb{R}^{30 \times 784}, \ W^{2} \in \mathbb{R}^{10 \times 30}, \ b^{1} \in \mathbb{R}^{30 \times 1}, \ b^{2} \in \mathbb{R}^{10 \times 1}. \end{aligned}\]
  • A given input vector \(x \in \mathbb{R}^{784 \times 1}\) (with one-hot encoded label \(y\)) is passed through the network as: \[\begin{aligned} & \mathbb{R}^{30 \times 1} \ni a^{1} = \sigma(z^{1}) = \sigma(W^{1}x + b^{1}) \\ & \mathbb{R}^{10 \times 1} \ni \hat{y} := a^{2} = \mathrm{sm}(z^{2}) = \mathrm{sm}(W^{2}a^{1} + b^{2}) \end{aligned}\] and then compute the loss as: \[\begin{aligned} \ell := \text{CE}(y, \hat{y}) := -\sum_{j=1}^{10} y_j \log \hat{y}_j \end{aligned}\]

Neural Network Model - Code example

  • go through function example_lecture in scripts/estimate_network_mn_rw.py until backpropagation part

Updating the parameters

  • We update the parameters using stochastic gradient descent (SGD)

  • To perform SGD we need to compute the derivatives of the loss function wrt. our parameters: \[ \frac{\partial \ell}{\partial \theta}, \quad \theta = (W^{1}, W^{2}, b^{1}, b^{2}) \]

  • Spelling out the loss function for our two-layer network:

\[\begin{aligned} \ell := \text{CE}(y, \hat{y}) := -\sum_{j=1}^{10} y_j \log \hat{y}_j = -\sum_{j=1}^{10} y_j \log \mathrm{sm}(W^{2}\sigma(W^{1}x + b^{1}) + b^{2})_{j} \end{aligned}\]

  • Want:

\[ \frac{\partial \ell}{W^{1}}, \ \frac{\partial \ell}{b^{1}}, \ \frac{\partial \ell}{W^{2}}, \ \frac{\partial \ell}{b^{2}} \]

Updating the parameters

  • SGD updates as:

\[ \theta \leftarrow \theta - \eta \frac{\partial \ell}{\partial \theta} \] for \(\eta\) the learning rate; step in opposite direction of steepest ascent

  • But that doesn’t answer the question of how to compute the derivatives
    • Backpropagation does that!

Backpropagation - Equations

Backpropagation

  • Do I have to understand backpropagation?
    • Yes, says Andrey Karpathy (superstar AI researcher)

Backpropagation - example

  • It can be shown that the error of the output layer with the cross-entropy loss function equals: \[\begin{aligned} \delta^{l_{2}} = \frac{\partial \ell}{\partial z^{l_{2}}} = \frac{\partial \ell}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{l_{2}}} = \hat{y} - y. \end{aligned}\]
    • \(\frac{\partial \ell}{\partial \hat{y}}\) the derivative of cross-entropy loss wrt. predicted probabilities
    • \(\frac{\partial \hat{y}}{\partial z^{l_{2}}}\) the derivative of softmax function wrt. input vector
  • Then we compute the derivatives: \[\begin{aligned} \frac{\partial \ell}{\partial b^{l_{2}}} = \delta^{l_{2}}, \quad \frac{\partial \ell}{\partial W^{l_{2}}} = \delta^{l_{2}}(a^{l_{1}})^{T} \end{aligned}\]
  • Note that \(a^{l_{1}} \in \mathbb{R}^{30 \times 1}\), \(\delta^{l_{2}} \in \mathbb{R}^{10 \times 1}\), and we want to update \(W^{2} \in \mathbb{R}^{10 \times 30}\); hence we should compute the outer product of \(\delta^{l_{2}}\) and \(a^{l_{1}}\) yielding the expression above.

Backpropagation - example continued

  • Next, we backpropagate the errors to the first layer: \[\begin{aligned} \delta^{l_{1}} = [(W^{l_{2}})^{T} \delta^{l_{2}}] \odot \sigma'(z^{l_{1}}) = [(W^{l_{2}})^{T} \delta^{l_{2}}] \odot (\sigma(z^{l_{1}}) \odot [\iota - \sigma(z^{l_{1}})]) \end{aligned}\] where \(\iota\) a vector of ones and we used the derivative of the sigmoid function (e.g. equation (4) last week) and then update the weights in the first layer with the derivatives: \[\begin{aligned} \frac{\partial \ell}{\partial b^{l_{1}}} = \delta^{l_{1}}, \quad \frac{\partial \ell}{\partial W^{l_{2}}} = \delta^{l_{1}}(a^{l_{0}})^{T} \end{aligned}\] where we compute the outer product of \(\delta^{l_{2}} \in \mathbb{R}^{30 \times 1}\) and \(a^{l_{0}} \in \mathbb{R}^{784 \times 1}\) to update \(W^{1} \in \mathbb{R}^{30 \times 784}\).

  • Update using just computed derivatives and rule \(\theta \leftarrow \theta - \eta \frac{\partial \ell}{\partial \theta}\)

Backpropagation - code example

  • go through backprop in function example_lecture in scripts/estimate_network_mn.py
  • see code here
  • torch example: automatic differentiation
    • scripts/estimate_network_torch.py
    • have to implement yourself at exercise class

Extensions

  • Nielsen’s implementation passes single vectors through the network one at a time
    • Inefficient
    • Can pass \((B, 784)\) matrices of batches through the network and backprop at once
  • Regularization
    • Dropout
    • Weight decay
  • Weight initialization
  • More…
  • We’ll probably return to it

Wrap up

  • Implementation details are left for the exercise classes