ML with Python

The exercises accompanying this book use the programming language Python.

Why Python?

How?

If you’re unfamiliar with Python, have a look at this Python tutorial specifically written to teach you the basics needed for the examples in this book. This cheat sheet additionally provides a summary of the most important steps when developing a machine learning solution, incl. code snippets using the libraries introduced below.

If you want to learn more about software engineering best practices in general, you might also like my other book Research Software Engineering: A Primer.

Overview of Python Libraries for ML

Note

The libraries are always imported with specific abbreviations (e.g., np or pd). It is highly recommended that you stick to these conventions and you will also see this in many code examples online (e.g., on StackOverflow).

numpy (& scipy):
everything needed for scientific computing, incl. random numbers, linear algebra, basic statistics, and optimization. The main data structure to represent vectors and matrices is the numpy array (e.g., np.array([1,2])).

import numpy as np

pandas:
higher level data manipulation with data stored in a DataFrame table similar to R; very useful for loading data, cleaning, and some exploration with different plots

import pandas as pd

matplotlib (& seaborn):
create plots (e.g., plt.plot(), plt.scatter(), plt.imshow()).

import matplotlib.pyplot as plt

plotly:
create interactive plots (e.g., px.parallel_coordinates())

import plotly.express as px

scikit-learn:
includes a lot of (non-deep learning) machine learning algorithms, preprocessing tools, and evaluation functions with an unified interface, i.e., all models (depending on their type) have these .fit(), .transform(), and/or .predict() methods, which makes it very easy to switch out models in the code by just changing the line where the model was initialized

# import the model class from the specific submodule
from sklearn.xxx import Model
from sklearn.metrics import accuracy_score

# initialize the model (usually we also set some parameters here)
model = Model()

# preprocessing/unsupervised learning methods:
model.fit(X)  # only pass feature matrix X
X_transformed = model.transform(X)  # e.g., the StandardScaler would return a scaled feature matrix

# supervised learning methods:
model.fit(X, y)  # pass features and labels for training
y_pred = model.predict(X_test)  # generate predictions for new points
# evaluate the model (the internal score function uses the model's prefered evaluation metric)
print("The model is this good:", model.score(X_test, y_test))  # .score() internally calls .predict()
print("Equivalently:", accuracy_score(y_test, y_pred))

torch (or keras):
neural network models (more details on these libraries in the section on deep learning)

 import torch
(from tensorflow import keras)

Additional useful libraries to publish and share your work:

  • FastAPI (easy way to create APIs, e.g., so that your models can be queried through an endpoint on the web)
  • streamlit (create interactive dashboards and web apps from simple Python scripts)
  • papermill (parametrize Jupyter notebooks, e.g., to create reports from templates)

Additional useful Natural Language Processing (NLP) libraries:

  • transformers (Hugging Face: pre-trained neural network models for different tasks)
  • spacy (modern & fast NLP tools)
  • nltk (traditional NLP tools)
  • gensim (topic modeling)
  • beautifulsoup (for parsing websites)