Exercise Class 9

Author

Jonas Skjold Raaschou-Pedersen

Published

November 6, 2025

Recall last week’s exercise class. This week we will try to create a variation of the contraction curve based on a simplified simulated dataset akin to Kleinberg et al. (2018). The simulated data is stored in the csv crime.csv which can be downloaded here or at Absalon here.


Printing out an overview of the data shows:

import polars as pl


df = pl.read_csv("data/crime.csv")
print(df)
shape: (50_000, 6)
┌───────────┬───────────────┬─────┬─────────┬──────┬───────┐
│ PriorFTAs ┆ FelonyArrests ┆ Age ┆ Release ┆ FTA  ┆ ID    │
│ ---       ┆ ---           ┆ --- ┆ ---     ┆ ---  ┆ ---   │
│ i64       ┆ i64           ┆ i64 ┆ i64     ┆ i8   ┆ u32   │
╞═══════════╪═══════════════╪═════╪═════════╪══════╪═══════╡
│ 1         ┆ 3             ┆ 21  ┆ 0       ┆ null ┆ 0     │
│ 1         ┆ 3             ┆ 20  ┆ 1       ┆ 0    ┆ 1     │
│ 3         ┆ 1             ┆ 23  ┆ 1       ┆ 0    ┆ 2     │
│ 3         ┆ 2             ┆ 18  ┆ 1       ┆ 0    ┆ 3     │
│ 3         ┆ 3             ┆ 19  ┆ 0       ┆ null ┆ 4     │
│ …         ┆ …             ┆ …   ┆ …       ┆ …    ┆ …     │
│ 4         ┆ 3             ┆ 33  ┆ 0       ┆ null ┆ 49995 │
│ 5         ┆ 3             ┆ 35  ┆ 1       ┆ 0    ┆ 49996 │
│ 2         ┆ 3             ┆ 34  ┆ 1       ┆ 0    ┆ 49997 │
│ 1         ┆ 3             ┆ 38  ┆ 0       ┆ null ┆ 49998 │
│ 4         ┆ 3             ┆ 46  ┆ 1       ┆ 0    ┆ 49999 │
└───────────┴───────────────┴─────┴─────────┴──────┴───────┘

The data consists of \(50,000\) rows and \(6\) columns:

  1. PriorFTAs: Number of prior failures to appear (FTA) for the individual.
  2. FelonyArrests: Number of prior felony arrests.
  3. Age: Age of the individual (in years).
  4. Release: Indicator for whether the individual was released pretrial (1) or detained (0).
  5. FTA: Outcome variable — whether the individual failed to appear in court (1 = failed to appear, 0 = appeared, null = missing).
  6. ID: Unique identifier for each individual (ranging from 0 to 49,999).

Exercise 1 - Contraction curve

  1. Download the data and place it in some folder in your project directory e.g. data/. The data can then be loaded using for instance polars and the snippet provided above.

    Tip: the data can be downloaded directly to the data/ folder using wget as follows:

    wget -P data/ https://gist.githubusercontent.com/jsr-p/db713b7779f30e205ea93ac93e542a9d/raw/2661cd31ac85ffdae392e2d54483c49eb88ee047/crime.csv

    which you can copy, paste and run in your shell.

  2. Split the data into train and test sets using train_test_split. Set test_size=0.2 and random_state=42.

  3. Estimate a simple Logistic Regression model from sklearn i.e. use model = LogisticRegression(). The features should be ["PriorFTA", "FelonyArrests", "Age"] and the outcome "FTA". Before estimating the model you should create a boolean array and select all entries in the training data where "FTA" is not null.

  4. Compute the risk score defined as the predicted probability of FTA being equal to 1; use the predict_proba method of the estimated model object. with the simulated dataset.

  5. Assign the predicted risk score to a dataframe consisting of all the individuals in the test set.

  6. Subset the above-mentioned dataframe to only include individuals where “FTA” is not null. Then partition the individuals into 10 quantile bins based on the assigned risk score. Name the quantile groups Q1, Q2, ..., Q10 corresponding to the 10% individuals with lowest risk etc. Call this new column with the quantile groups for bin.

    Tip:
    • Use the qcut method from polars to partition the individuals into quantile bins. The labels can be set with the labels argument.
    • If you use polars (you should!) cast the bin column to an enum such that you can sort by the quantile group labels; this will become useful in the following.
  7. Starting from the highest quantile group (i.e. with highest risk of committing a crime), loop through each quantile group and select all individuals in the given quantile group or higher.

    Then do the following:

    • Based on the model prediction we jail all individuals with a risk score in group lab or higher. Concretely, construct a column Jailed equal to 1 for all individuals in lab or higher; otherwise it should equal 0.
    • For those individuals with Jailed equal to 1 set the outcome FTA equal to 0 (jailed individuals cannot commit a crime).
    • Finally, compute the jail rate (average Jailed) and the crime rate (average FTA).
    • Save the results in a list

    After the loop construct a dataframe with the jail and crime rate from each iteration of the loop above.

    Hint: The loop should go as:

    for lab in ["Q10", "Q9", "Q8", "Q7", "Q6", "Q5", "Q4", "Q3", "Q2", "Q1"]:
        # ... insert code here
  8. Do the same as above but now instead of using a partition of the individuals based on the model prediction, use a random partition of the individuals. This corresponds to just selecting a random individual to jail instead of the one with the highest model risk.

    Hint: One way to do this is to use numpy.random.randint to draw a random integer from 1 to 10 for each individual.

  9. Concatenate the dataframes from 7. and 8. into one dataframe. Make a column to distinguish between the algorithm (results from 7.) and the random decision (results from 8.).

  10. Compute the crime rate for the test set. Denote this as the baseline crime rate. Then compute the relative decrease in crime using the dataframe from 9. and the baseline crime rate. Do this for the algorithm and the random decision. Plot the curves using matplotlib. Interpret the figure.

    You should get results similar to:

Note: This is a slightly different contraction curve than the one of Kleinberg et. al. 2018. However, the idea is basically the same.

Optional: Exercise 2 - Exam project workshop

Spend some time discussing ideas or your idea related to the exam project. See the announcement on Absalon for more details. You are welcome to discuss with your fellow students (and TA).