Data is the new oil!?
Let’s take a step back. Because it all begins with data.
You’ve probably heard this claim before: “Data is the new oil!”. This suggests that data is valuable. But is it?
The reason why oil is considered valuable is because we have important use cases for it: powering our cars, heating our homes, and producing plastics or fertilizers.
Similarly, our data is only as valuable as what we make of it. So what can we use data for?
The main use cases belong to one of two categories:

Insights
We can generate insights either through continuous monitoring (“Are we on track?”) or a targeted analysis (“What’s wrong?”).
Monitoring
By visualizing important variables or metrics in reports or dashboards, we increase transparency of the status quo and quantify our progress towards some goal.
For this purpose, we often devise so called Key Performance Indicators (KPIs), i.e., custom metrics that tell us how well things are going. For example, if we’re working on a web application, one KPI we might want to track could be “user happiness”. Unfortunately, true user happiness is difficult to measure, but we can instead check the number of users returning to our site and how long they stay and then somehow combine these and other measurements into a proxy variable that we then call “user happiness”.
A KPI is only a reliable measure, if it is not simultaneously used to control people’s behavior, as they will otherwise try to game the system (Goodhart’s Law). For example, if our goal is high quality software, counting the number of bugs in our software is not a reliable measure for quality, if we simultaneously reward programmers for every bug they find and fix. |
Ideally, these metrics are combined with thresholds for alerts to automatically notify us if things go south and a corrective action becomes necessary. For example, we could establish some alert on the health of a system or machine to notify a technician when maintenance is necessary.
For example, one KPI for a customer service department could be the number of hours it takes for a customer request to be resolved. The target state could be ‘less than 48 hours’ and if the average exceeds 96 hours for more than a month, this could be a sign that they need to hire more service agents.
Unfortunately, what kind of corrective action will get us back on track is often not obvious and usually requires us to dig deeper into the data with an ad hoc analysis to identify the root cause of the problem.
Ad Hoc Analysis
An ad hoc data analysis can help us answer questions such as
-
Why are we not reaching our goal?
-
What should we do next?
Arriving at satisfactory answers is often more art than science. But we have a multitude of tools and techniques to help us with this.
Automation
As described in the following sections, machine learning models can be used to automate ‘input → output’ tasks otherwise requiring a human (expert). These tasks are usually easy for an (appropriately trained) human, for example:
-
Translating texts from one language into another
-
Sorting out products with scratches when they pass a checkpoint on the assembly line
-
Recommending movies to a friend
For this to work, the ML models need to be trained on a lot of historical data (e.g., texts in both languages, images of products with and without scratches, information about different users and which movies they watched).
The resulting software can then either be used to automate the task completely or we can keep a human in the loop that can intervene and correct the suggestions made by the model.