A Practitioner’s Guide to Machine Learning

Data Preprocessing

Now that we better understand our data and verified that it is (hopefully) of good quality, we can get it ready for our machine learning algorithms.

Raw Data: can come in many different forms, e.g., sensor measurements, pixel values, text (e.g., HTML page), SAP database, …
→ n data points, stored as rows in an excel sheet, as individual files, etc.

What constitutes one data point? It’s always important to be really clear about what one data point actually is, i.e., what the inputs look like and what we want back as a result from the model for each sample / observation. Think of this in terms of how you plan to integrate the ML model with the rest of your workflow: what data is generated in the previous step and can be used as input for the ML part, and what is needed as an output for the following step?

Preprocessing

transforming and enriching the raw data before applying ML, for example:

remove / correct missing or wrongly entered data (e.g., misplaced decimal point)
exclude zero variance features (i.e., variables with always the same value) and nonsensical variables (e.g., IDs)
feature extraction: transform into numerical values (e.g., unstructured data like text)
feature engineering: compute additional/better features from the original variables

⇒ feature matrix \(\,X \in \mathbb{R}^{n\times d}\): n data points; each represented as a d-dimensional vector (i.e., with d features)

Prediction Targets?

→ label vector \(\,\mathbf{y}\) : n-dimensional vector with one target value per data point