Data is the new oil!?

Let’s take a step back. Because it all begins with data. You’ve probably heard this claim before: “Data is the new oil!”. This suggests that data is valuable. But is it?
The reason why oil is considered valuable is because we have important use cases for it: powering our cars, heating our homes, and producing plastics or fertilizers. Similarly, our data is only as valuable as what we make of it. So what can we use data for?

The main use cases belong to one of two categories:

image

Insights

We can generate insights either through continuous monitoring (“Are we on track?”) or a targeted analysis (“What’s wrong?”).

Monitoring

By visualizing important variables or metrics in reports or dashboards, we increase transparency of the status quo and quantify our progress towards some goal.

For this purpose, we often devise so called Key Performance Indicators (KPIs), i.e., custom metrics that tell us how well things are going. For example, if we’re working on a web application, one KPI we might want to track could be “user happiness”. Unfortunately, true user happiness is difficult to measure, but we can instead check the number of users returning to our site and how long they stay and then somehow combine these and other measurements into a proxy variable that we then call “user happiness”.

A KPI is only a reliable measure, if it is not simultaneously used to control people’s behavior, as they will otherwise try to game the system (Goodhart’s Law). For example, if our goal is high quality software, counting the number of bugs in our software is not a reliable measure for quality, if we simultaneously reward programmers for every bug they find and fix.

Ideally, these metrics are combined with thresholds for alerts to automatically notify us if things go south and a corrective action becomes necessary. For example, we could establish some alert on the health of a system or machine to notify a technician when maintenance is necessary.

For every KPI we should define:

  • the target state, i.e., the value or corridor that indicates that operations are running smoothly

  • the alert threshold, i.e., when a corrective action is necessary

  • what corrective action could be taken by whom and who needs to be informed about it

For example, one KPI for a customer service department could be the number of hours it takes for a customer request to be resolved. The target state could be ‘less than 48 hours’ and if the average exceeds 96 hours for more than a month, this could be a sign that they need to hire more service agents.
Unfortunately, what kind of corrective action will get us back on track is often not obvious and usually requires us to dig deeper into the data with an ad hoc analysis to identify the root cause of the problem.

Ad Hoc Analysis

An ad hoc data analysis can help us answer questions such as

  • Why are we not reaching our goal?

  • What should we do next?

Arriving at satisfactory answers is often more art than science. But we have a multitude of tools and techniques to help us with this.

Exploratory Data Analysis

As described in more detail in the chapter on Data Analysis, in an exploratory analysis we dive deep into the data, e.g., by generating lots of plots to better understand the different variables, KPIs, and their relationships.
Example: By visualizing the number of customer requests we received over time, we notice a sharp increase around the same time a new feature of our software was released. After further investigation, we conclude that many users are confused by this new functionality and we need to make the user interface more self-explanatory.

Statistical Inference

Statistical inference enables us to draw conclusions that reach beyond the data at hand. Often we would like to make a statement about a whole population (e.g., all humans currently living on this earth), but we only have access to a few (hopefully representative) observations to draw our conclusion from. Statistical inference is about changing our mind under uncertainty: We start with a null hypothesis and then check if what we see in the sample dataset makes this null hypothesis look ridiculous, at which point we reject it and go with our alternative hypothesis instead.
Example: Your company has an online store and wants to roll out a new recommendation system, but you are unsure whether customers will find these recommendations helpful and buy more. Therefore, before going live with the new system, you perform an A/B test, where a percentage of randomly selected users see the new recommendations, while the others are routed to the original version of the online store. The null hypothesis is that the new version is no better than the original. But it turns out that the average sales volume of customers seeing the new recommendations is a lot higher than that of the customers browsing the original site. This difference is so large that in a world where the null hypothesis was true, it would be extremely unlikely that a random sample would give us these results. We therefore reject the null hypothesis and go with the alternative hypothesis, that the new recommendations generate higher sales.
Read this article to learn more about the difference between analysts and statisticians and why they should work on distinct splits of your dataset.

Predictive Analytics

Using historical data, we can generate a predictive model that makes predictions about future scenarios to aid with planning. These models are often created using supervised learning, a subfield of machine learning.
Example: Use sales forecasts to better plan inventory levels.

Interpreting Predictive Models

Given a model that makes accurate predictions for new data points, we can interpret this model and explain its predictions to understand root causes in a process.
Example: Given a model that predicts the quality of a product from the process conditions, identify which conditions result in lower quality products.

What-if Analysis

Given a model that makes accurate predictions for new data points, we can use this model in a “what-if” forecast to explore how a system might react to different conditions to make better decisions (but use with caution!).
Example: Given a model that predicts the remaining lifetime of a machine component under some process conditions, simulate how quickly this component would deteriorate if we changed the process conditions.

Going one step further, this model can also be used inside an optimization loop to automatically evaluate different inputs with the model systematically to find optimal settings.
Example: Given a model that predicts the quality of a product from the process conditions, automatically determine the best production settings for a new type of raw material.

Automation

As described in the following sections, machine learning models can be used to automate ‘input → output’ tasks otherwise requiring a human (expert). These tasks are usually easy for an (appropriately trained) human, for example:

  • Translating texts from one language into another

  • Sorting out products with scratches when they pass a checkpoint on the assembly line

  • Recommending movies to a friend

For this to work, the ML models need to be trained on a lot of historical data (e.g., texts in both languages, images of products with and without scratches, information about different users and which movies they watched).

The resulting software can then either be used to automate the task completely or we can keep a human in the loop that can intervene and correct the suggestions made by the model.