Supervised Learning Basics

Now that we’ve surveyed the different unsupervised learning algorithms, let’s move on to supervised learning:

Supervised learning in a nutshell (with scikit-learn):

First, the available data needs to be split into a training and test part, where we’re assigning the majority of the data points to the training set and the rest to the test set. (Please note that, as an exception, in this graphic, \(X\) and \(\mathbf{y}\) are rotated by 90 degrees, i.e., the features are in the rows and the data points are in the columns.) Next, we need to decide on the type of model that we want to use to describe the relationship between the inputs \(X\) and the outputs \(\mathbf{y}\) and this model also comes with some hyperparameters that we need to set (which are passed as arguments when instantiating the respective sklearn class). Then we can call the .fit(X_train, y_train) method on the model to learn the internal model parameters by minimizing some model-specific objective function on the training data. Now the model is ready to generate predictions for new data points, i.e., by calling .predict(X_test), we obtain the predicted values \(\mathbf{\hat{y}}\) for the test points. Finally, to get an estimate of how useful the model will be in practice, we evaluate it by comparing the predicted target values of the test set to the corresponding true labels.

In the following sections, we introduce the different approaches to supervised learning and explain when to use which kind of model, then discuss how to evaluate and select the hyperparameters of a supervised learning model.

Different types of models

The most important task of a data scientist is to select an appropriate model (and its hyperparameters) for solving a problem.

Three considerations when choosing a supervised learning model

  1. Problem type:
    What kind of problem are you trying to solve: regression or classification?
    → Depends on the type of target variable, i.e., if it has continuous or discrete values.

  2. Problem complexity:
    How complicated is the relationship between the input features and target variable: linear or nonlinear?
    → Depends on the available data, i.e., how easily the target can be predicted from the inputs.

  3. Algorithmic approach:
    Which type of model works best for this dataset size & complexity: features-based or similarity-based?
    → Depends on the model you choose, i.e., it either learns according to the first or second strategy.

Problem type: regression vs. classification

The type of the target variable that we want to predict determines whether we are dealing with a regression or classification problem.

Regression:
Prediction of continuous value(s) (e.g., price, number of users, etc.).

Classification:
Prediction of discrete values:

  • binary (e.g., product will be faulty: yes/no)
  • multi-class (e.g., picture displays cat/dog/house/car/…)
  • multi-label (e.g., picture may display multiple objects)

→ Many classification models actually predict probabilities for the different classes, i.e., a score between 0 and 1 for each class. The final class label is then chosen by applying a threshold on this score (typically 0.5 for binary classification problems) or by taking the outcome with the highest score (in multi-class problems).

⇒ Whether we are dealing with a regression or classification problem is important to know and has implications for our overall workflow, e.g., how we define & measure success. However, the actual models that we use to solve these problems are very similar, e.g., almost all sklearn models exist in either a Regressor or Classifier variant to generate the appropriate output for the respective problem type.

Note

In some cases it is even entirely up to us whether to frame the task as a regression or classification problem.

Example: A product is deemed faulty, if it breaks within the warranty period of 6 months, where we assume that normally the product would be used at most 300 times during these 6 months.

Depending on the use case, we might either be interested in how long a product lasts in total, or only whether this product will be a warranty case or not, i.e., we could formulate the problem as:

  • A regression task: Predict, after how many uses the product will break.
  • A classification task: Predict, if the product will break before it was used 300 times.

However, in many cases with a continuous target variable, we can learn a more accurate prediction model with the regression approach, since here the labels carry more information and edge case errors are not penalized as much. For example, in the classification case, predicting that a product is okay when it lasted 299 uses would be just as wrong as predicting ‘okay’ for a product that lasted only 2 uses.

If the workflow, where our model is later embedded in, requires a classification output, we can still transform the regression output of the model into a classification output later by simply setting a threshold, i.e., in this case if the regression model predicts a value lower than 300 we would output ‘faulty’ and otherwise ‘okay’.

Problem complexity: linear or nonlinear

In accordance with the product warranty example described above, we now illustrate what it means for a problem to be linear or nonlinear on a small toy dataset:

The first input feature we consider is the production temperature, which is an independent variable, since the operator in control of the production process is free to just choose some value here. The target variable that we want to predict is the lifetime of the produced product and is dependent on this input variable. For the different temperature settings that were tested, we have collected the respective lifetime measurements (and, as you can imagine, getting such labels can be quite expensive, since these products not only needed to be produced, but also go through some stress test to determine after how many uses they break).

As you might have already guessed, the relationship between the target variable lifetime and the input feature temperature (top left) is a linear one, i.e., we could describe it by a straight line. Since the lifetime is a continuous variable, this is a regression problem. However, by defining some cutoff on the target variable (e.g., at 300 usages for in and out of warranty), this problem can be transformed into a classification problem, where the task here is to distinguish between faulty and good products (bottom left). Again, the relationship between the target and the input variable temperature is linear, since the problem can be solved by drawing one line parallel to the y-axis, which defines some threshold on the temperature, where all data points with a temperature lower than this threshold would be classified as ‘ok’ and those products produced at a higher temperature would be predicted as ‘broken’. In the center panel, we have the same situation only this time with a different input feature, pressure. In the regression case, the lifetime values now follow some quadratic curve, i.e., we could not adequately describe the relationship between our target variable and the input feature ‘pressure’ with a single straight line, which makes this a nonlinear problem. Similarly, in the classification case below, we would need to define two thresholds on the pressure values to separate the two classes, i.e., again we could not solve the problem with a single straight line. While these four examples were univariate problems, i.e., with a single input variable, in machine learning we usually deal with multivariate problems, often with hundreds or even thousands of input features. For the case of two input variables, this is illustrated on the right, where the data is now plotted in a scatter plot where the axis correspond to the features temperature and pressure, while the color of the dots indicates the target value (either on a continuous scale or the discrete class labels). In the multidimensional case, a problem would be considered linear if it can by solved by a single hyperplane (instead of a line) in the input feature space.

As illustrated in the above examples, whether a problem can be solved by a simple linear model (i.e., a single straight line or hyperplane) or requires a more complex nonlinear model to adequately describe the relationship between the input features and target variable entirely depends on the given data.
This also means that sometimes we can just install an additional sensor to measure some feature that is linearly related to the target variable or do some feature engineering to then be able to get satisfactory results with a linear model, i.e., sometimes, with the right preprocessing, a nonlinear problem can also be transformed into a linear one.

Algorithmic approaches: features-based vs. similarity-based models

Finally, lets look at how the different models work and arrive at their predictions. This is what really distinguishes the various algorithms, whereas we have already established that there always exists a regression and a classification variant of each model and some models are inherently expressive enough that they can be used to describe nonlinear relationships in the data, while others will only yield satisfactory results if there exists a linear relationship between the available input features and the target variable.

Features-based models learn some parameters or rules that are applied directly to a new data point’s input feature vector \(\mathbf{x} \in \mathbb{R}^d\). Similarity-based models, on the other hand, first compute a vector \(\mathbf{s} \in \mathbb{R}^n\) with the similarities of the new sample to the training data points and the model then operates on this vector instead of the original input features.

Left: Features-based models are described by a parametrized function that operates on the original input features. For example, in the regression case (top), the predicted target for a new data point is the value on the line corresponding to its input features, while in the classification case (bottom), the prediction is made according to on which side of the dividing line the inputs lie. Right: The similarity-based models make the prediction for a new sample based on this data point’s nearest neighbors. Depending on the type of model, this could be as simple as predicting the average target value of the nearest neighbors (regression, top) or their most frequent class (classification, bottom).
Important

This distinction between algorithmic approaches is not only interesting from a theoretical point of view, but even more so from a practitioner’s perspective: When using a similarity-based algorithm, we have to be deliberate about which features to include when computing the similarities, make sure that these features are appropriately scaled, and in general think about which similarity measure is appropriate for this data. For example, we could capture domain knowledge by using a custom similarity function specifically tailored to the problem. When using a features-based model, on the other hand, the model itself can learn which features are most predictive by assigning individual weights to each input feature and therefore possibly ignore irrelevant features or account for variations in heterogeneous data. But of course, subject matter expertise is still beneficial here, as it can, for example, guide us when engineering additional, more informative input features.

Okay, now, when should we use which approach?

Features-based models:

  • Number of features should be less than the number of samples!
  • Good for heterogeneous data due to individual feature weights (although scaling is usually still a good idea).
  • Easier to interpret (since they describe a direct relationship between input features & target).

Similarity-based models:

  • Nonlinear models for small datasets.
  • Need appropriate similarity function → domain knowledge! (especially for heterogeneous data)

This is an overview of the different features-based and similarity-based models we’ll discuss in the following chapters, as well as for which dataset size and problem complexity they are appropriate: While linear models are rather simple and – as the name suggests – only yield good results for linear problems, they are very efficient and can be applied to large datasets. Decision trees are very efficient as well, but can also capture more complex relationships between inputs and output. Similarity-based models like k-nearest neighbors (kNN) or their more sophisticated cousin kernel methods, on the other hand, can be used to solve very complex nonlinear problems, however, as they require the computation of a similarity matrix that scales with the number of training points, they should only be applied to comparatively small datasets (~a few thousand points). Finally, neural networks are the most sophisticated model class and can be used to solve extremely complex problems, but they are also rather data hungry (depending on the size of the network).

Model Evaluation

Since in supervised learning problems we know the ground truth, we can objectively evaluate different models and benchmark them against each other.

Is this a good model for the task?

I.e., does the model generate reliable predictions for new data points?

  • Split the data into training and test sets to be able to get a reliable estimate of how the model will later perform when applied to new data points that it wasn’t trained on.

  • Quantify the quality of the model’s predictions on the test set with a suitable evaluation metric (depending on the problem type).
    Evaluation metrics:

    • Regression: mean absolute error, mean squared error, \(R^2\)
    • Classification: (balanced) accuracy
    • Ranking: precision/recall (→ F1 score), hits@k

    ⇒ Are some mistakes worse than others (e.g., consider false positives vs. false negatives in medical tests)?
    ⇒ Always choose a single metric/KPI to optimize (maybe: additional constraints like runtime).

  • Compare the model to a ‘stupid baseline’ that predicts the mean (→ regression) or most frequent class (→ classification).

Evaluation Metrics

We start with three evaluation metrics for regression problems: the mean absolute error, mean squared error, and \(R^2\).

Mean Absolute Error (MAE)

This is probably the most straightforward regression error metric and additionally easy to interpret since the error is given in the same units of measurement as the target variable (e.g., if we’re predicting a price in euros, we would know exactly by how many euros the model is off on average).
from sklearn.metrics import mean_absolute_error

Mean Squared Error (MSE)

Since this regression error metric is differentiable, it is often used internally when optimizing the parameters of a model (e.g., in linear regression). When reporting the final error of a model, one often takes the square root of the result, i.e., instead reports the root mean squared error (RMSE), since this is again in the same units as the original target variable (but still less intuitive than the MAE).
from sklearn.metrics import mean_squared_error

\(R^2\)

The \(R^2\), or coefficient of determination, essentially compares the MSE of a regression model against the MSE of the ‘stupid baseline’ for the regression (i.e., predicting the mean), i.e., it normalizes the MSE by the variance of the data. In the best case, the \(R^2\) is 1, i.e., when the model explains the data perfectly, and in the worst case, it can even become negative, i.e., when the model performs worse then simply predicting the mean.
from sklearn.metrics import r2_score

Now lets look at evaluation metrics for classification problems.

Classification errors in detail

Most (binary) classification models predict the probability that a sample belongs to a certain class, i.e., a score between 0 and 1. By setting a threshold on this probability (typically at 0.5), the prediction score can be transformed into the class label, i.e., either predicting the positive or negative class for this sample. We distinguish between two types of mistakes that a classification model can make in this prediction: false positives (FP), i.e., incorrectly predicting the positive class for samples belonging to the negative class, and false negatives (FN), i.e., incorrectly predicting the negative class for samples from the positive class. Depending on the goal of our application, one type of error might be worse than the other, e.g., in a medical test we might rather tell someone that they should come in for a second test to confirm some problematic results than send them home with an undetected disease. By moving the classification threshold, we can control the trade-off between FP and FN.

Accuracy

The accuracy is the most widely used classification evaluation metric, where we simply check, out of all samples, how many were classified correctly (i.e., TP and TN). However, this can be misleading for unequal class distributions and we should always compare the accuracy of the model against the ‘stupid baseline’ for classification, i.e., what the accuracy would be for a “model” that always predicts the most frequent class.
from sklearn.metrics import accuracy_score

Unbalanced class distributions: Accuracy vs. Balanced Accuracy

Below we see the decision boundaries of two models on a toy dataset, where the background color indicates whether the model predicts the blue or red class for a data point in this area. Which model do you think is more useful?

With unbalanced class distributions, e.g., in this case a lot more samples from the blue compared to the red class, the accuracy of a model that simply always predicts the most frequent class can be quite large. But while a 90% accuracy might sound impressive when we report the performance of a model to the project’s stakeholders, this does not necessarily mean that the model is useful, especially since in real world problems the undersampled class is often the one we care about most, e.g., people with a rare disease or products that have a defect.

Instead, the balanced accuracy is often the more informative measure when evaluating classification models and can help us to distinguish between a model that has actually learned something and the ‘stupid baseline’:

Balanced Accuracy

To avoid pitfalls of accuracy: consider misclassification rates of both classes separately:

For the balanced accuracy, first the fraction of correctly classified points is computed for each class individually and then these values are averaged.
from sklearn.metrics import balanced_accuracy_score

Multi-class problems: micro vs. macro averaging

The accuracy and balanced accuracy scores can be generalized to the multi-class classification case. Here we instead use the terms micro- and macro-averaging to describe the two strategies (which can also be used for other kinds of metrics like the F1-score), where micro-averaging means we compute the score by averaging over all samples, while macro-averaging means we first compute the score for each class separately and then average over the values for the different classes.

Micro-averaged score (→ accuracy_score):

\[ \begin{aligned} \frac{TP + TN}{TP + FN + TN + FP} = \frac{TP_\text{pos} + TP_\text{neg}}{n_\text{pos} + n_\text{neg}} & \quad \Rightarrow \quad \frac{\sum_c TP_{c}}{\sum_c n_{c}} \end{aligned} \]

\(n_{c}\) : number of samples belonging to class \(c\)
\(TP_{c}\) : number of correctly classified samples from class \(c\)

Macro-averaged score (→ balanced_accuracy_score):

\[ \begin{aligned} \frac{1}{2} \left(\frac{TP}{TP+FN} + \frac{TN}{TN+FP} \right) = \frac{1}{2} \left(\frac{TP_\text{pos}}{n_\text{pos}} + \frac{TP_\text{neg}}{n_\text{neg}} \right) & \quad \Rightarrow \quad \frac{1}{C} \sum_{c=1}^C \frac{TP_{c}}{n_{c}} \end{aligned} \]

Multi-class problems: Confusion matrix

Similarly, the table with the TP/FP/TN/FN entries can be extended for the multi-class classification case:

The heatmap on the left shows the (normalized) confusion matrix for a ten-class classification problem (recognizing handwritten digits), while the plot on the right shows example images for each case. Examining the confusion matrix and some individual examples can give us more faith in the predictions of our model, as we might realize that some misclassifications (highlighted in red) could also happen to a human, e.g., the 4 that was classified as a 1 or even the 4 that was classified as a 7 (which might even be a labeling error from when the dataset was originally created).
from sklearn.metrics import confusion_matrix

Model Selection

After we’ve chosen an appropriate evaluation metric for our problem, we can use the resulting scores to automatically select the best hyperparameters for a model and ultimately the best model.

The case for an additional validation set

As we’ve established in the beginning, before experimenting with any models, the dataset should be split into a training and test set. However, this isn’t all: Since we are typically experimenting with many different types of models and for each model type with dozens of hyperparameter settings, we should not use this test set to evaluate each of these model candidates, since it might happen that with all these things we try out, we end up choosing a model that just by chance performs well on this test set, but does not generalize to new data later and we would have no way of finding this out before deploying the model in production. Therefore, we introduce a new data split, the validation set, that is used to evaluate the different candidate models, while the test set remains locked away until we’re ready to evaluate our final model to get a realistic estimate of how it performs on new data.

If the original dataset is quite big, say, over 100k samples (depending on the diversity of the data, e.g., the number of classes), then it is usually enough to just split the data into training, validation, and test sets at the start, where the validation and test sets contain about 10% of the data each and should be representative of the diversity of the original dataset. However, when the original dataset is smaller, it might not be possible to get such representative splits, which is when a technique called cross-validation (“x-val”) comes in handy.

Important

Especially when working with small datasets, it is important that these splits are well balanced, i.e., that all classes are represented equally in the training, validation, and test sets. This is also called stratified sampling.

Cross-Validation

In a k-fold cross-validation, the dataset is split into k parts, where each part is once the designated validation set, while the remaining k-1 parts are used for training. This way, the model is trained and evaluated k times, each time on different splits. By computing the mean and standard deviation of the error metrics on all folds, we get a reliable estimate of the model’s generalization error and its variation due to the diversity of the dataset. The extreme case of the k-fold cross-validation is the Leave-One-Out cross-validation, where the model is always evaluated on only a single sample.
Caution

Most datasets are collected over longer time periods. Often, the samples are correlated over time, i.e., samples collected around the same time are more similar to each other than samples collected weeks or months apart. This is often very apparent in time series data (e.g., seasonality effects in sales data), but it can also true for other types of data (e.g., the topics discussed in newspaper articles change over time; a camera lens might slowly accumulate dust). To get a realistic estimate of how well the model will perform on new data, it is usually best to use the most recent samples as the test set. Additionally, it might be necessary to use a time series split for the cross-validation, where the model is always trained on past data and evaluated on newer data. If there is a big difference between the model performance on random vs. chronological train/validation splits, this is a strong indication that the samples are correlated over time!

Hyperparameter Tuning

Often it is necessary to systematically evaluate a given model with different hyperparameter values to find the best settings. One straightforward approach for doing this is a grid search: In a grid search, we define the different values we want to test for each of the model’s hyperparameters and then all combinations of these different values for all hyperparameters are automatically evaluated, similar to how we would do it manually with nested for-loops. This is very useful, as often the different hyperparameter settings influence each other. Conveniently, sklearn furthermore combines this with a cross-validation. However, with many individual settings, this also comes at a computational cost, as the model is trained and evaluated \(k \times m_1 \times m_2 \times \dots \times m_i\) times, where \(k\) is the number of folds in the cross-validation and \(m_1...m_i\) are the number of values that need to be tested for each of the i hyperparameters of the model.

For example, with two hyperparameters, the grid search results could look something like the plot below, which shows a heatmap of the average accuracy achieved with each hyperparameter combination of a model in the cross-validation:

While sklearn’s grid search method tells us directly what the best hyperparameter combination is out of the ones it tested (marked with a red star in the plot), it is important to check the complete set of results to verify that we have covered the whole range of possible hyperparameter values that could give good results. For example, in the plot above, we see a peak in the middle with the results getting worse to the sides, i.e., we know that better hyperparameter values are unlikely to lie outside of the range we’ve tested.
It is generally a good idea to first start with a large range of values and then zoom in to the area that seems most promising. And of course knowledge about the different algorithms helps a lot in choosing reasonable settings as well.

Besides the basic grid search, there also exist other, more advanced hyperparameter tuning routines. For example, sklearn additionally implements a randomized search, and other dedicated libraries provide even fancier approaches, such as Bayesian optimization.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV