ML with Python

The exercises accompanying this book use the programming language Python.

Why Python?
  • free & open source (unlike, e.g., MatLab)

  • easy; fast prototyping

  • general purpose language (unlike, e.g., R): easy to incorporate ML into regular applications or web apps

  • fast: many numerical operations are backed with C libraries

  • a lot of open source ML libraries with a very active community!!

  • regular scripts (i.e., normal text files ending in .py), especially useful for function definitions that can be reused in different projects

  • iPython shell: interactive console to execute code

  • Jupyter Notebooks (i.e., special files ending in .ipynb): great for experimenting & sharing work with others (also works with other programming languages: Jupyter stands for Julia, Python, and R; you can even mix languages in the same notebook)

If you’re unfamiliar with Python, have a look at this Python tutorial specifically written to teach you the basics needed for the examples in this book. This cheat sheet additionally provides a summary of the most important steps when developing a machine learning solution, incl. code snippets using the libraries introduced below.

Overview of Python Libraries for ML

The libraries are always imported with specific abbreviations (e.g., np or pd). It is highly recommended that you stick to these conventions and you will also see this in many code examples online (e.g., on StackOverflow).
numpy (& scipy)

everything needed for scientific computing, incl. random numbers, linear algebra, basic statistics, and optimization. The main data structure to represent vectors and matrices is the numpy array (e.g., np.array([1,2])).

  import numpy as np

higher level data manipulation with data stored in a DataFrame table similar to R; very useful for loading data, cleaning, and some exploration with different plots

  import pandas as pd
matplotlib (& seaborn)

create plots (e.g., plt.plot(), plt.scatter(), plt.imshow()).

  import matplotlib.pyplot as plt

create interactive plots (e.g., px.parallel_coordinates())

  import as px

includes a lot of (non-deep learning) machine learning algorithms, preprocessing tools, and evaluation functions with an unified interface, i.e., all models (depending on their type) have these .fit(), .transform(), and/or .predict() methods, which makes it very easy to switch out models in the code by just changing the line where the model was initialized

  # import the model class from the specific submodule
  from import Model
  from sklearn.metrics import accuracy_score

  # initialize the model (usually we also set some parameters here)
  model = Model()

  # preprocessing/unsupervised learning methods:  # only pass feature matrix X
  X_transformed = model.transform(X)  # e.g., the StandardScaler would return a scaled feature matrix

  # supervised learning methods:, y)  # pass features and labels for training
  y_pred = model.predict(X_test)  # generate predictions for new points
  # evaluate the model (the internal score function uses the model's prefered evaluation metric)
  print("The model is this good:", model.score(X_test, y_test))  # .score() internally calls .predict()
  print("Equivalently:", accuracy_score(y_test, y_pred))
torch (or keras)

neural network models (more details on these libraries in the section on neural networks)

   import torch
  (from tensorflow import keras)
Additional useful Natural Language Processing (NLP) libraries: