A Practitioner’s Guide to Machine Learning

ML use cases

The inputs that the ML algorithms operate on can come in many forms…

Structured vs. unstructured data: Data can come in various forms and while some data types require additional preprocessing steps, in principle ML algorithms can be used with all kinds of data.

The main distinction when characterizing data is made between structured data, which is any dataset that contains individual measurements / variables / attributes / features that represent unique quantities, and unstructured data, which can not be subdivided into meaningful variables. For example, in images “first pixel from the left” or in texts “10th word in the second paragraph” is not what we would call a variable, while “size in square meters” and “number of bedrooms” are useful quantities to describe an apartment. Structured data is often heterogeneous, since the different variables in a dataset typically stand for very different things. For example, when working with sensor data, a dataset normally does not consist of only temperature measurements, but additionally it could contain, e.g., pressure and flow values, which have different units and measurement scales. Unstructured data, on the other hand, is homogeneous, e.g., there is no qualitative difference between the 10th and the 100th pixel in an image.

…but our goal, i.e., the desired outputs, determines the type of algorithm we should use for the task:

If our goal is to simply understand the dataset a bit better and get an overview of it, then a dimensionality reduction algorithm from the area of unsupervised learning can be used to visualize it. If we want to discover patterns in the data, then we would also choose an algorithm from the field of unsupervised learning, but here it depends on which kinds of patterns we want to find: Anomaly detection algorithms identify individual data points that deviate from the rest (e.g., a malfunctioning machine or a fraudulent credit card transaction). Clustering algorithms identify groups of similar data points (e.g., for customer segmentation). If our goal is to make specific predictions, i.e., given an input get a corresponding output (e.g., predict whether a product will be faulty if it is produced under certain conditions), then we need to use an algorithm from the area of supervised learning. Here the main distinction is between regression and classification models: in regression tasks, the target variable that should be predicted is continuous (e.g., number of users, price, etc.), while in classification tasks the target variable is discrete, i.e., can only take on one distinct value and there is no continuum between the different values (e.g., an animal in a picture can either be a cat or a dog, but not something in between). While unsupervised and supervised learning cover most general use cases, other algorithms exist for more specific applications, e.g., recommender systems.

Some example input → output tasks and what type of ML algorithm solves them:

Input \(X\)	Output \(Y\)	ML Algorithm Category
questionnaire answers	customer segmentation	clustering
sensor measurements	everything normal?	anomaly detection
past usage of a machine	remaining lifetime	regression
email	spam (yes/no)	classification (binary)
image	which animal?	classification (multi-class)
user’s purchases	products to show	recommender systems
search query	relevant documents	information retrieval
audio	text	speech recognition
text in English	text in French	machine translation

Input \(X\)

Output \(Y\)

ML Algorithm Category

questionnaire answers

customer segmentation

clustering

sensor measurements

everything normal?

anomaly detection

past usage of a machine

remaining lifetime

regression

spam (yes/no)

classification (binary)

image

which animal?

classification (multi-class)

user’s purchases

products to show

recommender systems

search query

relevant documents

information retrieval

audio

text

speech recognition

text in English

text in French

machine translation

To summarize (see also: overview table as PDF):

Existing ML solutions & corresponding output (for one data point):

Dimensionality Reduction: (usually) 2D coordinates (to create a visualization of the dataset)
Outlier/Anomaly Detection: anomaly score (usually a value between 0 and 1 indicating how likely it is that this point is an outlier)
Clustering: cluster index (a number between 0 and k-1 indicating to which of the k clusters a data point belongs (or -1 for outliers))
Regression: a continuous value (any kind of numeric quantity that should be predicted)
Classification: a discrete value (one of several mutually exclusive categories)
Deep Learning: unstructured output like a text or image (e.g., speech recognition, machine translation, image generation, or neural style transfer)
Recommender Systems & Information Retrieval: ranking of a set of items (recommender systems, for example, rank the products that a specific user might be most interested in; information retrieval systems rank other items based on their similarity to a given query item)
Reinforcement Learning: a sequence of actions (specific to the state the agent is in)

Let’s start with a more detailed look at the different unsupervised & supervised learning algorithms and what they are good for:

To apply unsupervised learning algorithms, we only need a feature matrix \(X\), while learning a prediction model with supervised learning algorithms additionally requires the corresponding labels \(\mathbf{y}\).

Even if our ultimate goal is to predict something (i.e., use supervised learning), it can still be helpful to first use unsupervised learning to get a better understanding of the dataset, for example, by visualizing the data with dimensionality reduction methods to see all samples and their diversity at a glance, by identifying outliers to clean the dataset, or, for classification problems, by first clustering the samples to check whether the given class labels match the naturally occurring groups in the data or if, e.g., two very similar classes could be combined to simplify the problem.

Same dataset, different use cases

To illustrate the usefulness of the five different types of unsupervised and supervised learning algorithms, lets apply them to this example dataset:

m²	# Bedr	# Bath	Renovated	…	Price	Sold
125	4	2	2000	…	500k	1
75	2	1	1990	…	350k	1
150	6	2	2010	…	750k	0
…	…	…	…	…	…	…
35	5	2	1999	…	620k	0
65	3	1	2015	…	220k	1
100	3	1	2003	…	450k	0

This is a small toy dataset with structured data about different apartments, which someone might have gathered from a real estate website. It includes the size of the apartment in square meters, the number of bedrooms, the number of bathrooms, the year it was last renovated, and finally the price of the listing and whether it was sold for this price (1) or not (0).

Dimensionality Reduction

Use Cases:

create a 2D visualization to explore the dataset as a whole, where we can often already visually identify patterns like samples that can be grouped together (clusters) or that don’t belong (outliers)
noise reduction and/or feature engineering as a data preprocessing step to improve the performance in the following prediction task

Example Unsupervised Learning: Dimensionality Reduction: Goal: Visualize the dataset

The first step when working with a new dataset is usually to visualize it, to get a better overview of all the samples and their diversity. This is done with a dimensionality reduction algorithm, which takes the original high dimensional data as input, where each column (= feature) in the table is one dimension, and outputs a lower dimensional representation of the samples, i.e., a new matrix with fewer columns (usually two for a visualization). With these two new features, here called \(z_1\) and \(z_2\), we can create a scatter plot of the dataset, where each sample / row (in this case each apartment) is represented as one point in this new 2D coordinate system. We can think of this plot as a map of our dataset that enables us to view all data points at a glance. This plot often shows interesting patterns, for example, groups of similar points, which would be located close to each other in this 2D map. Please note that for most dimensionality reduction methods, it is not possible to describe what is behind this new coordinate system. Specifically, these are not just the two most informative original features, but completely new dimensions that summarize the information of the original inputs. To better interpret these plots, it is helpful to color the dots afterwards by some variable, which can then reveal the driving factors behind the most salient patterns in the dataset. In this example, we could have used the price of each apartment to color the respective dot in the map, which might then reveal that similarly priced apartments are arranged next to each other.

Possible challenges:

transforming the data with dimensionality reduction methods constructs new features as a (non)linear combination of the original features, which decreases the interpretability of the subsequent analysis results

Anomaly Detection

Use Cases:

clean up the data, e.g., by removing samples with wrongly entered values, as a data preprocessing step to improve the performance in the following prediction task
create alerts for anomalies, for example:
- fraud detection: identify fraudulent credit card transaction in e-commerce
- monitor a machine to see when something out of the ordinary happens or the machine might require maintenance

Example Unsupervised Learning: Anomaly Detection: Goal: Find outliers in the dataset

Next, we can check the dataset for outliers and then subsequently correct or remove these samples. An outlier detection algorithm outputs for each sample an anomaly score, which indicates whether this data point deviates from the norm. We can use these scores to colorize the 2D map of the dataset generated in the previous step to see the anomalies in context. One drawback is that an anomaly detection algorithm does not tell us why it considers an individual point an outlier. A data scientist needs to examine the points identified as outlier to see, e.g., if these should be removed due to flawed measurements or if they constitute some interesting edge cases. In this example, the sample identified as an anomaly is an apartment that supposedly has a size of only 35 \(m^2\), but at the same time 5 bedrooms, i.e., most likely the person that originally entered the data made a mistake and the size of the listing should actually be 135 \(m^2\).

Possible challenges:

you should always have a good reason for throwing away data points — outliers are seldom random, sometimes they reveal interesting edge cases that should not be ignored

Clustering

Use Cases:

identify groups of related data points, for example:
- customer segmentation for targeted marketing campaign

Example Unsupervised Learning: Clustering: Goal: Find naturally occurring groups in the dataset

We can also check if the dataset contains naturally occurring groups. This is accomplished with a clustering algorithm, which returns a cluster index for each sample, where points with the same index are in the same cluster. Please note that these cluster indices are not ordered and when running the algorithm again, the samples might be assigned different numbers, however, the groups of samples that were assigned the same number should still be in a cluster together, i.e., this cluster might now just be called ‘5’ instead of ‘3’. These cluster indices can again be used to colorize the 2D map of the dataset to see the clusters in context. While a clustering algorithm groups similar points together, it does not tell us why the points were assigned to a cluster and what this cluster means. Therefore, the data scientist again needs to examine the results to try to describe the different clusters. In our example, the clusters might be “cheap studio apartments”, “large family apartments”, and “luxurious penthouses”. In unsupervised learning, there is no correct solution and a different algorithm might return different results. Just use the solution that is most helpful for your use case.

Possible challenges:

no ground truth: difficult to choose between different models and parameter settings → the algorithms will always find something, but whether this is useful (i.e., what the identified patterns mean) can only be determined by a human in a post-processing step
many of the algorithms rely on similarities or distances between data points, and it can be difficult to define an appropriate measure for this or know in advance which features should be compared (e.g., what makes two customers similar?)

Unsupervised learning has no ground truth

It is important to keep in mind that unsupervised learning problems have no right or wrong answers. Unsupervised learning algorithms simply recognize patterns in the data, which may or may not be meaningful for us humans.

For example, there exist a bunch of different unsupervised learning algorithms that group data points into clusters, each with a slightly different strategy and definition of what it means for two samples to be similar enough that they can be put into the same cluster.

The first instinct of a human is to group these images according to the fruit displayed on them, however, there is nothing inherently wrong with clustering the images based on a different characteristic, such as their background color, whether or not the fruit has a leaf attached, in which direction the stem is pointing, etc.

It is up to the data scientist to examine the results of an unsupervised learning algorithm and make sense of them. And if they don’t match our expectations, we can simply try a different algorithm.

Regression & Classification

Use Cases:

Learn a model to describe an input-output relationship and make predictions for new data points, for example:
- predict in advance whether a product produced under the proposed process conditions will be of high quality or would be a waste of resources
- churn prediction: identify customers that are about to cancel their contract (or employees that are about to quit) so you can reach out to them and convince them to stay
- price optimization: determine the optimal price for a product (often used for dynamic pricing, e.g., to adapt prices based on the device a customer uses (e.g., new iPhone vs old Android phone) when accessing a website)
- predictive maintenance: predict how long a machine component will last
- sales forecasts: predict revenue in the coming weeks and how much inventory will be required to satisfy the demand

Example Supervised Learning: Classification: Goal: Predict a discrete value for each data point

Next, we can predict whether an apartment will be sold for the listed price. Since the variable “sold” only takes on the discrete values ‘yes’ (1) or ‘no’ (0), this is a binary classification problem. A classification model uses the attributes of an apartment together with the listing’s price as inputs and predicts whether the apartment will be sold for this price. Since we have the true labels available for the initially collected dataset, we can evaluate how well the model performed by computing the number of wrong predictions it generated. This is the nice thing about supervised learning: We can objectively determine how good a solution is and benchmark different models against each other, while in unsupervised learning the data scientist needs to manually examine the results to make sense of them.

Example Supervised Learning: Regression: Goal: Predict a continuous value for each data point

Finally, we can predict a reasonable price for a listing. Since prices are continuous values, this is a regression problem, where the model uses as inputs the attributes of the apartments and predicts a suitable price. Again we have the true prices available and can compute the deviation of the regression model’s estimates from the original price set by a real estate agent.

Possible challenges:

success is uncertain: while it is fairly straightforward to apply the models, it is difficult to determine in advance whether there even exists any relation between the measured inputs and targets (→ beware of garbage in, garbage out!)
appropriate definition of the output/target/KPI that should be modeled, i.e., what does it actually mean for a process to run well and how might external factors influence this definition (e.g., can we expect the same performance on an exceptionally hot summer day?)
missing important input variables, e.g., if there exist other influencing factors that we haven’t considered or couldn’t measure, which means not all of the target variable’s variance can be explained
lots of possibly irrelevant input variables that require careful feature selection to avoid spurious correlations, which would result in incorrect ‘what-if’ forecasts since the true causal relationship between the inputs and outputs isn’t captured
often very time intensive data preprocessing necessary, e.g., when combining data from different sources and engineering additional features

Deep Learning

Use Cases:

automate tedious, repetitive tasks otherwise done by humans, for example (see also ML is everywhere!):
- text classification (e.g., identify spam / hate speech / fake news; forward customer support request to the appropriate department)
- sentiment analysis (subtask of text classification: identify if text is positive or negative, e.g., to monitor product reviews or what social media users are saying about your company)
- speech recognition (e.g., transcribe dictated notes or add subtitles to videos)
- machine translation (translate texts from one language into another)
- image classification / object recognition (e.g., identify problematic content (like child pornography) or detect street signs and pedestrians in autonomous driving)
- image captioning (generate text that describes what’s shown in an image, e.g., to improve the online experience for for people with visual impairment)
- predictive typing (e.g., suggest possible next words when typing on a smartphone)
- data generation (e.g., generate new photos/images of specific objects or scenes)
- style transfer (transform a given image into another style, e.g., make photos look like van Gogh paintings)
- separate individual sources of an audio signal (e.g., unmix a song, i.e., separate vocals and instruments into individual tracks)
replace classical simulation models with ML models: since exact simulation models are often slow, the estimation for new samples can be speed up by instead predicting the results with an ML model, for example:
- AlphaFold: generate 3D protein structure from amino acid sequence (to facilitate drug development)
- SchNet: predict energy and other properties of molecules given their configuration of atoms (to speed up materials research)

Possible challenges:

selecting a suitable neural network architecture & getting it to work properly; especially when replacing traditional simulation models it is often necessary to develop a completely new type of neural network architecture specifically designed for this task and inputs / outputs, which requires a lot of ML & domain knowledge, intuition, and creativity
computational resources (don’t train a neural network without a GPU!)
data quality and quantity: need a lot of consistently labeled data, i.e., many training instances labeled by human annotators who have to follow the same guidelines (but can be mitigated in some cases by pre-training the network using self-supervised learning)

Information Retrieval

Use Cases:

improve search results by identifying similar items: given a query, rank results, for example:
- return matching documents / websites given a search query
- show similar movies given the movie a user is currently looking at (e.g., same genre, director, etc.)

Possible challenges:

quality of results depends heavily on the chosen similarity metric; identifying semantically related items is currently more difficult for some data types (e.g., images) than others (e.g., text)

Recommender Systems

Use Cases:

personalized suggestions: given a sample from one type of data (e.g., user, protein structure), identify the most relevant samples from another type of data (e.g., movie, drug composition), for example:
- show a user movies that other users with a similar taste also liked
- recommend molecule structures that could fit into a protein structure involved in a certain disease

Possible challenges:

little / incomplete data, for example, different users might like the same item for different reasons and it is unclear whether, e.g., a user didn’t watch a movie because he’s not interested in it or because he just didn’t notice it yet

Reinforcement Learning

Use Cases:

Determine an optimal sequence of actions given changing environmental conditions, for example:
- virtual agent playing a (video) game
- robot with complex movement patterns, e.g., picking up differently shaped objects from a box

⇒ Unlike in regular optimization, where the optimal inputs given a single specific external condition are determined, here an “agent” (= the RL algorithm) tries to learn an optimal sequence of inputs to maximize the cumulative reward received over multiple time steps, where there can be a significant time delay between the inputs and the rewards that they generate (e.g., in a video game we might need to pick up a key in the beginning of a level, but the door that can be opened with it only comes several frames later).

Possible challenges:

usually requires a simulation environment for the agent to learn in before it starts acting in the real world, but developing an accurate simulation model isn’t easy and the agent will exploit any bugs if that results in higher rewards
can be tricky to define a clear reward function that should be optimized (imitation learning is often a better option, where the agent instead tries to mimic the decisions made by a human in some situation)
difficult to learn correct associations when there are long delays between critical actions and the received rewards
agent generates its own data: if it starts off with a bad policy, it will be tricky to escape from this (e.g., in a video game, if the agent always falls down a gap instead of jumping over it, it never sees the rewards that await on the other side and therefore can’t learn that it would be beneficial to jump over the gap)

Other

ML algorithms are categorized by the output they generate for each input. If you want to solve an ‘input → output’ problem with a different output than the ones listed above, you’ll likely have to settle in for a multi-year research project — if the problem can be solved with ML at all!

To solve complex problems, we might need multiple algorithms: Example: virtual assistant (e.g., Siri or Alexa): “Hey <smart speaker>, tell me a joke!” → a random joke

This might look like an input-output problem, but it would be very difficult and inefficient to solve it directly. Instead, we break the problem down into smaller subtasks that can be solved with existing algorithms:

Trigger word detection:
audio → “Hey <smart speaker>” (yes/no)?
Speech recognition:
audio → text
Intent classification:
text → (joke/timer/weather/…)?
Request-specific program (e.g., select random joke)
Speech generation:
text → audio

First, the smart speaker needs to know whether it was activated with a specific trigger word (e.g., “Hey Siri”). This is a simple binary classification task (trigger word: yes/no), which is usually performed on the device itself, since we don’t want that everything we say is continuously streamed into the cloud. Next, the spoken words that follow the trigger word are transcribed into text. Text is easier to handle, because, for example, variations due to different accents are removed. Based on this text, the intent is recognized, i.e., which of the different functionalities of the virtual assistant should be used (e.g., tell a joke, play music, set an alarm, etc.). This is a multi-class classification problem. The next step is to execute the request, which is not done with ML, but instead some task-specific program is run, e.g., to select a joke from a database or set a timer, etc., based on the apps installed on the device. Finally, the output of the program needs to be converted back into an audio signal. For this again an ML model can help to get smoothly spoken text — and in the near future maybe with the voice of Morgan Freeman or some other famous person like in "Deep Fake" applications.

⇒ It is generally advisable to first think about how a problem could be decomposed into easier-to-solve subproblems, especially since there might already be a large dataset or pre-trained ML model available for one of these subtasks. For example, speech recognition models can be trained on audio books and transcribed political speeches in addition to the data collected from the smart speaker users.

When one ML model receives as input the output of another ML model, this means as soon as we roll out a new version of the ML model at the beginning of the chain, we should also retrain the models following this one, since they might now receive slightly different inputs, i.e., experience a data drift.