A Practitioner’s Guide to Machine Learning

Ensemble Methods

What is better than one model? Multiple models!

Main idea: Train multiple models & combine their predictions (regression: average; classification: most frequent class).

→ Different types of models.
→ Same type of model but with different hyperparameter settings (this can also include the random seed used when initializing the model, e.g., for neural networks).
→ Models trained on different subsets of the data (different selections of samples and/or features).
→ Boosting: models are trained sequentially and each additional model focuses on those data points that the previous models got wrong.

Pros

More stable prediction (tip: use individual models that on their own overfit a bit).
Get an estimate of how certain the prediction is → how many models agree?

Careful

Computationally expensive (depending on the models used).

Popular example

Random Forest: Multiple decision trees trained on random subsamples of the data, thereby exploiting the fact that decision trees can be sensitive to small variations in the dataset.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

For more advanced approaches, check out the voting ensemble and boosting methods from sklearn, with which arbitrary models can be combined into an ensemble.

Ensemble methods like random forests and gradient boosting trees give very good results on real world structured datasets and dominate the leader boards for many competitions at Kaggle, a website where companies can upload datasets for data scientists to benchmark themselves against each other and even win prize money.