# Practical: Decision Trees

Herman Kamper, 2020-2021. Licensed under [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/).

## Preliminaries

In [None]:
%matplotlib inline
from matplotlib.colors import ListedColormap
from scipy.spatial import distance
from sklearn import datasets, tree
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

## 1. Decision tree on the Iris data set

In this section we train a decisoin tree on the Iris data set. We will use scikit-learn to train the model, and then visualise the decision boundary.

### 1.1. Load and visualise the data

**Question:** The code below uses the `dataset` scikit-learn module to load two features from the Iris data set, specifically the petal length and width. The class labels are also loaded. Produce a scatter plot with the petal length on the $x$-axis and petal width on the $y$-axis, with each class given in a different colour. A legend should indicate which colour corresponds to which class.

In [None]:
# Load data
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"]

In [None]:

# Answer: Add code and new cells here


### 1.2. Train a decision tree on the two features from the Iris data set

**Questions:**
- Use the scikit-learn `DecisionTreeClassifier` in `sklearn.tree` to a fit a decision tree to the data.
- Read through the parameters of `DecisionTreeClassifier` and see whether you can map each of the parameters to what you learned the way in which decision trees are grown.

In [None]:

# Answer: Add code and new cells here


### 1.3. Visualise the decision boundary

**Questions:**
- Visualise the decision boundaries by completing the code below. Specifically, you need to add the line for making predictions on a grid covering the input feature space. The decision boundary from the model is visualised by then plotting the prediction from the model at each of the grid points.
- On the same figure, also plot the data as you did in Section 1.1.

In [None]:
# Question: Complete the code below

# Make grid
N = 150
grid_x1 = np.linspace(np.min(X[:, 0]) - 0.1, np.max(X[:, 0]) + 0.1, N)
grid_x2 = np.linspace(np.min(X[:, 1]) - 0.1, np.max(X[:, 1]) + 0.1, N)
grid_x1, grid_x2 = np.meshgrid(grid_x1, grid_x2)
X_grid = np.c_[grid_x1.ravel(), grid_x2.ravel()]

# Answer: Replace the line below to give the correct output from the model
random_predictions = np.random.randint(max(y), size=X_grid.shape[0])
grid_predictions = random_predictions.reshape(grid_x1.shape)

# Plot the decision boundary
plt.contourf(grid_x1, grid_x2, grid_predictions, cmap=ListedColormap(["C0", "C1", "C2"]))
plt.xlim([np.min(X[:, 0]) - 0.1, np.max(X[:, 0]) + 0.1])
plt.ylim([np.min(X[:, 1]) - 0.1, np.max(X[:, 1]) + 0.1])

# Answer: Add the code to also plot the data

**Questions**:
- Use the `plot_tree` function in `sklearn.tree` to visualise the resulting decision tree. You can use `plt.savefig("tree.pdf")` to export the tree to a PDF, where it might be easier to look at the structure.
- Repeat the steps in Sections 1.1 and 1.2, but instead of training a decision tree on only two features, now train it on all four of the Iris data set features. How does it differ with the tree trained on only two features?

In [None]:

# Answer: Add code and new cells here


## 2. Decision tree on heart disease data set

The heart disease dataset (available [here](http://faculty.marshall.usc.edu/gareth-james/ISL/data.html)) contains the records of patients who presented with chest pain and was subsequently diagnosed with having or not having a heart disease.

In this section, you will train a decision tree to classify whether an unseen (test) patient has or does not have a heart disease.

### 2.1. Load the data using the pandas data analysis library

[Pandas](https://pandas.pydata.org/) is a useful Python library to deal with and visualise tabular data. The code below loads the heart disease data from a CSV file into a pandas `DataFrame`.

**Questions:**
- Read through the pandas documentation to make sure you understand each of the steps.
- Split the data into training and test sets, using 20 of the patients as test patients. (Should you select the last 20 patients as test subjects? Or do you select randomly? Does it matter?)

In [None]:
# If you haven't downloaded the data already with the notebook: Download data
import urllib
urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/kamperh/data414/main/practicals/decision_trees/heart.csv",
    "heart.csv"
    )

In [None]:
# Load data
df = pd.read_csv("heart.csv")
df = df.drop("Unnamed: 0", axis=1)
df = df.dropna()

# Convert discrete features to integers
df.ChestPain = pd.factorize(df.ChestPain)[0]
df.Thal = pd.factorize(df.Thal)[0]

# Construct input design matrix X and output vector y
X = df.drop("AHD", axis=1)
y = pd.factorize(df.AHD)[0]

In [None]:
# Show the DataFrame
X

In [None]:

# Answer: Add code and new cells here


### 2.2. Train and evaluate a decision tree

**Questions:**
- Use scikit-learn to train a decision tree on the training data.
- Visualise the tree.
- Evaluate the model by calculating the recall, precision and accuracy on the test data.

In [None]:

# Answer: Add code and new cells here
