Resampling Methods

Olatomiwa Bifarin.
PhD Candidate Biochemistry and Molecular Biology
@ The University of Georgia

This is a draft copy, a work in progress

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import style
#For Seaborn plots
#import seaborn as sns; sns.set(style='white')

# More sharp and legible graphics
%config InlineBackend.figure_format = 'retina'

# Set seaborn figure labels to 'talk', to be more visible. 
#sns.set_context('talk', font_scale=0.8)

These are the methods involved in sampling during machine learning. In my last notebook-blog, I hinted the idea of an analogy between a 12-year-old girl studying for an exam, and our machine trying to learn. If you are the instructor attempting to set the questions for the exams, you will agree with me that giving exactly the same example from the class note isn’t a bright idea. No? What you want to do is to know if our dear student has learnt the materials, as such you want to give her new problems to solve. In the same vein, you want to do the same for your machine learning models, and the various ways of accomplishing this are called resampling methods.

1. Validation Set Method

In this method, samples are randomly divided into training set and test or validation set at a ratio of your choosing shown pictorally below

In [2]:
# Lets import the  breast cancer wisconsin dataset[link]
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data["data"]
y = data["target"]
feature_names = data["feature_names"]

df = pd.DataFrame(data=X, columns=feature_names)
df.head()
Out[2]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

Using the model selection library in sklearn package: train_test_split

In [3]:
from sklearn.model_selection import train_test_split
#split train test 8:2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)
Training Features Shape: (455, 30)
Training Labels Shape: (455,)
Testing Features Shape: (114, 30)
Testing Labels Shape: (114,)

2. k-Fold Cross Validation Method

A CV method whereby the dataset is divided into k fold(s). One of the the k fold is used as test set while the rest (1-k) are used as the training set. the entire process is repeated k times and a different fold (chunk) of the dataset is used as the test set each time.

k = 5 and k = 10 are popular choices.

In [4]:
# Lets take the first 20 samples of the breast cancer data set 
# to demonstrate Kfold methods
from sklearn.model_selection import KFold

# Create a custom CV so we can seed with random state
kf = KFold(n_splits=5, random_state=42)
for train, test in kf.split(X[:20]):
    print('train: %s, test: %s' % (train, test))
train: [ 4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19], test: [0 1 2 3]
train: [ 0  1  2  3  8  9 10 11 12 13 14 15 16 17 18 19], test: [4 5 6 7]
train: [ 0  1  2  3  4  5  6  7 12 13 14 15 16 17 18 19], test: [ 8  9 10 11]
train: [ 0  1  2  3  4  5  6  7  8  9 10 11 16 17 18 19], test: [12 13 14 15]
train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15], test: [16 17 18 19]

In the case of an unbalanced dataset, stratified Kfold is recommended. This simply implies that for each K-fold, approximately equal percentage of classes are present in a complete set.

3. Leave One Out Cross Validation Method

Leave one out cross validation (LOOCV) method is a special case of the k-fold CV method where k=1.

In [5]:
# Lets take the first 10 samples of the breast cancer data set to demonstrate Kfold methods
from sklearn.model_selection import LeaveOneOut

LOOCV = LeaveOneOut()
for train, test in LOOCV.split(X[:10]):
    print("%s %s" % (train, test))
[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]

4. Bias-Variance Tradeoff

The Bias-Variance tradeoff also inflicts the selection of cross-validation methods. For the bias factor, the validation set method gives the highest bias as it could lead to an overestimatation the test error because it trains with a lesser percentage of the dataset. At the other extreme is the LOOCV method which gives the lowest bias of the three main methods as it trains on n - 1 sample size. The reverse is the case for variance. Recall that variance is the spread of test error rate around the best case scenario. As such, variance will be at the highest with the LOOCV method followed by the k-fold method, and lowest in the validation method. Given this tradeoff, k-fold (k = 5 and k = 10) are popular choices.

References and Resources

  • Introduction to Statistical Learning, Chapter 5: Resampling Methods
  • Cross-validation: evaluating estimator performance cross validation, scikit-learn documentation
  • Wikipedia, Resampling (statistics)
In [ ]: