ML with Python
The exercises accompanying this book use the programming language Python.
Why Python?
- free & open source (unlike, e.g., MatLab)
- easy; fast prototyping
- general purpose language (unlike, e.g., R): easy to incorporate ML into regular applications or web apps
- fast: many numerical operations are backed with C libraries
- a lot of open source ML libraries with a very active community!!
How?
- regular scripts (i.e., normal text files ending in
.py
), especially useful for function definitions that can be reused in different projects - iPython shell: interactive console to execute code
- Jupyter Notebooks (i.e., special files ending in
.ipynb
): great for experimenting & sharing work with others (also works with other programming languages: Jupyter stands for Julia, Python, and R; you can even mix languages in the same notebook)
If you’re unfamiliar with Python, have a look at this Python tutorial specifically written to teach you the basics needed for the examples in this book. This cheat sheet additionally provides a summary of the most important steps when developing a machine learning solution, incl. code snippets using the libraries introduced below.
If you want to learn more about software engineering best practices in general, you might also like my other book Research Software Engineering: A Primer.
Overview of Python Libraries for ML
The libraries are always imported with specific abbreviations (e.g., np
or pd
). It is highly recommended that you stick to these conventions and you will also see this in many code examples online (e.g., on StackOverflow).
numpy
(& scipy
):
everything needed for scientific computing, incl. random numbers, linear algebra, basic statistics, and optimization. The main data structure to represent vectors and matrices is the numpy array (e.g., np.array([1,2])
).
import numpy as np
pandas
:
higher level data manipulation with data stored in a DataFrame
table similar to R; very useful for loading data, cleaning, and some exploration with different plots
import pandas as pd
matplotlib
(& seaborn
):
create plots (e.g., plt.plot()
, plt.scatter()
, plt.imshow()
).
import matplotlib.pyplot as plt
plotly
:
create interactive plots (e.g., px.parallel_coordinates()
)
import plotly.express as px
scikit-learn
:
includes a lot of (non-deep learning) machine learning algorithms, preprocessing tools, and evaluation functions with an unified interface, i.e., all models (depending on their type) have these .fit()
, .transform()
, and/or .predict()
methods, which makes it very easy to switch out models in the code by just changing the line where the model was initialized
# import the model class from the specific submodule
from sklearn.xxx import Model
from sklearn.metrics import accuracy_score
# initialize the model (usually we also set some parameters here)
= Model()
model
# preprocessing/unsupervised learning methods:
# only pass feature matrix X
model.fit(X) = model.transform(X) # e.g., the StandardScaler would return a scaled feature matrix
X_transformed
# supervised learning methods:
# pass features and labels for training
model.fit(X, y) = model.predict(X_test) # generate predictions for new points
y_pred # evaluate the model (the internal score function uses the model's prefered evaluation metric)
print("The model is this good:", model.score(X_test, y_test)) # .score() internally calls .predict()
print("Equivalently:", accuracy_score(y_test, y_pred))
torch
(or keras
):
neural network models (more details on these libraries in the section on deep learning)
import torch
from tensorflow import keras) (
Additional useful libraries to publish and share your work:
FastAPI
(easy way to create APIs, e.g., so that your models can be queried through an endpoint on the web)streamlit
(create interactive dashboards and web apps from simple Python scripts)papermill
(parametrize Jupyter notebooks, e.g., to create reports from templates)
Additional useful Natural Language Processing (NLP) libraries:
transformers
(Hugging Face: pre-trained neural network models for different tasks)spacy
(modern & fast NLP tools)nltk
(traditional NLP tools)gensim
(topic modeling)beautifulsoup
(for parsing websites)