Solving problems with ML

Solving “input → output” problems with ML requires three main steps:

image

1. Identify a suitable problem

The first (and arguably most important) step is to identify where machine learning can (and should) be used in the first place.

ML project checklist

Motivation

  • What problem do you want to solve?
    Machine learning can help you in various ways by generating insights from large amounts of (possibly unstructured) data, improving decision making and planning processes by providing predictions about future events, or automating tedious tasks otherwise requiring human experts.
    Where do you see a lot of inefficiencies around you that could be mitigated by a better use of data? For example, you could look for opportunities to decrease wasted resources / time / costs or increase revenue / customer satisfaction / etc.
    To systematically identify problems or opportunities, it can be helpful to create a process map or customer journey map.

  • In what way(s) would this generate value for your organization?
    How could your organization make money on this or reduce costs?

    • Could this improve an internal process (e.g., maybe a process can be run more efficiently with the insights from an analysis or a tedious task that would otherwise require a human worker can be automated using an ML model)?

    • Could the ML model be integrated as a new feature within an existing product and thereby, e.g., make this product more appealing to customers?

    • Could the ML solution be sold as an entirely new product, e.g., offered as a Software-as-a-Service (SaaS) solution?

    Please note that how the ML solution will be used in the end might also be a strategic decision that can be different for every organization. For example, an ML solution that recognizes scratches in produced products might be used by one company to improve their internal production process, while another company that produces the machines that make the products could integrate this as a new feature in their machines, and a third company might offer this as a SaaS solution compatible with different production lines.

  • How do you know you’ve accomplished your goal?
    What would success look like, i.e., what’s your definition of ‘done’? Can you quantify the progress towards your goal with a KPI?
    What is the status quo, i.e., how far are your from your goal right now?

  • What impact would this project have?
    Think of the impact in terms of

    • Magnitude: Small improvement or revolution? Will the solution result in a strategic advantage?

    • Scale: How often will this be used? How many users/customers/employees will benefit?

      For example:

      • Small process optimization, but since this process is used everyday in the whole organization it saves countless hours

      • New feature that revolutionizes the product and sets you apart from the competition, but the market for it is tiny

    • Would this have any valuable side effects? What will be different? Any additional opportunities that could arise from this? Can you create synergies between departments that work with similar data?

Solution Outline

  • What are the deliverables?
    Does the solution consist of a piece of software that is deployed somewhere to continuously make predictions for new data points, or are you more interested in the insights gained from an one-off analysis?

  • Is there a simpler solution, i.e., without using ML?
    Use ML to learn unknown, complex rules from data.

  • Can the problem be solved with an existing ML algorithm?
    Ask an ML expert whether a similar problem has already been solved before. Instead of spending years on research to come up with a novel algorithm, it might also be possible to break the input-output problem down into simpler subproblems with known solutions.

  • What is the input data?

    • What is one data point / sample / observation?

    • What kind of inputs does the ML model receive (e.g., image / text / sensor measurements / etc.)?

  • In case of a software solution, how will the ML model be integrated with the existing setup?

    • From which system are the inputs for the ML model coming from? What happens to the outputs of the ML model?

    • How will the ML model be deployed (e.g., cloud or edge device — see step 3 below)? Does this require any special hardware?

    • What are the plans w.r.t. pipelines for future data collection, model monitoring, and automated retraining?

  • Should you build this yourself or can you buy it?
    Does the solution require unique subject matter expertise only available at your organization, e.g., because you’re analyzing data generated by your own specific processes/machines and/or will the solution be a key part of your business, e.g., a new feature that makes your product more attractive?
    Or is this a common (but complex) problem, for which a solution already exists (e.g., offered as a Software-as-a-Service (SaaS) product), that you could buy off the shelf? For example, extracting the relevant information from scanned invoices to automate bookkeeping processes is a relatively complex task for which many good solutions already exist, so unless you are working in a company building bookkeeping software and plan to sell a better alternative to these existing solutions, it probably doesn’t make sense to implement this yourself.

    Some general points you might want to consider when deciding whether to buy an ML solution or build it yourself:

    • How much effort would be required in terms of preprocessing your data before you could use the off-the-shelf ML solution?

    • How difficult would it be to integrate the output from the off-the-shelf ML solution into your general workflow? Does it do exactly what you need or would additional post-processing steps be required?

    • How reliable is the off-the-shelf ML solution? Are there any benchmarks available and/or can you test it with some common examples and edge cases yourself?

    • How difficult would it be to implement the ML solution yourself? For example, what kind of open source libraries exist that solve such a task? Do you have the necessary ML talent or would you need to hire, e.g., freelancers?

    • Can the off-the-shelf ML solution be deployed in-house or does it run on an external server and would this bring with it any data privacy issues?

    • How high are the on-going licensing fees and what is included in terms of maintenance (e.g., how frequently are the models retrained)?

    Unless the ML solution will be an integral part of your business, in the end it will probably come down to comparing costs for developing, implementing, running, and maintaining the system yourself vs. costs for integrating the off-the-shelf solution into your existing workflow (incl. necessary data preprocessing) and on-going licensing fees.

Challenges & Risks

  • Is there enough high-quality data available?
    ML models need the right inputs and learn best from unambiguous targets.

    • How much data was already collected (including rare events)? Is this the right data or do you need, e.g., additional labels?
      → Ask a subject matter expert whether she thinks all the relevant input data is available to compute the desired output. This is usually easy to determine for unstructured data such as images — if a human can see the thing in the image, ML should too. But for structured data, such as a spreadsheet with hundreds of columns of sensor measurements, this might be impossible to tell before doing any analysis on the data.

    • How difficult is it to get access to all of the data and combine it neatly in one place? Who would you talk to, to set up / improve the data infrastructure?

    • How much preprocessing is necessary (e.g., feature engineering, i.e., computing new variables from the existing measurements)? What should be the next steps to improve data quality and quantity?

  • What would be the worst case scenario when the model is wrong?
    Your ML system (like humans) will make mistakes. This is especially true since your input data will probably change over time and users might even try to intentionally deceive the system (e.g., spammers come up with more sophisticated messages if their original ones are caught by the spam filter).
    What would be the worst case scenario and how much risk are you willing to take?
    Instead of going all in with ML from day 1, is there a way your system can be monitored in the beginning while still providing added value (i.e., human-in-the-loop solution)?

  • What else could go wrong (e.g., legal issues / ethical concerns)?
    Are there any concerns w.r.t. data privacy? What about accountability, i.e., do the decisions of the machine learning model need to be transparent, for example, if someone is denied credit because of an algorithmically generated credit score? Why might users get frustrated with the solution?

For more details check out this blog article.

2. Devise a working solution

Once a suitable “input → output” problem as been identified, historical data needs to be gathered and the right ML algorithm needs to be selected and applied to obtain a working solution. This is what the next chapters are all about.

To solve a concrete problem using ML, we follow a workflow like this:

image
We always start with some kind of question or problem that should be solved with ML. And to solve it, we need data, which we most likely have to clean before we can work with it (e.g., merge different excel files, fix missing values, etc.). Then it’s time for an exploratory analysis to better understand what we’re dealing with. Depending on the type of data, we also need to extract appropriate features or engineer additional ones, for which domain knowledge / subject matter expertise is invaluable. All these steps are grouped under "preprocessing" (red box) and the steps are not linear, as we often find ourselves jumping back and forth between them. For example, by visualizing the dataset, we realize that the data contains some outliers that need to be removed, or after engineering new features, we go back and visualize the dataset again. Next comes the ML part (green box): we normally start with some simple model, evaluate it, try a more complex model, experiment with different hyperparameters, …​ and at some point realize, that we’ve exhausted our ML toolbox and are still not happy with the performance. This means we need to go back and either engineer better features or, if this also doesn’t help, collect more and/or better data (e.g., more samples, data from additional sensors, cleaner labels, etc.). Finally, when we’re confident in the model’s predictions, there are two routes we can take: Either the data science route, where we communicate our findings to the stakeholders (which most likely results in further questions). Or the ML software route, where the final model is deployed in production. Here it is important to continuously monitor the model’s performance and collect new data such that the model can be retrained, especially as the inevitable data or concept drifts occur. Above all, working on a machine learning project is a very iterative process.

Unfortunately, due to a lack of standardized data infrastructure in many companies, the sad truth is that usually (at least) about 90% of a Data Scientist’s time is spent collecting, cleaning, and otherwise preprocessing the data to get it into a format where the ML algorithms can be applied:

image

While sometimes frustrating, the time spent cleaning and preprocessing the data is never wasted, as only with a solid data foundation the ML algorithms can achieve decent results.

3. Get it ready for production

When the prototypical solution has been implemented and meets the required performance level, this solution then has to be deployed, i.e., integrated into the general workflow and infrastructure so that it can actually be used to improve the respective process in practice (as a piece of software that continuously makes predictions for new data points). There are generally two strategies for how to do this:

  1. The ML model runs on an “edge” device, i.e., on each individual machine (e.g., mobile phone) where the respective data is generated and the output of the model is used in subsequent process steps. This is often the best strategy when results need to be computed in real time and / or a continuous Internet connection can not be guaranteed, e.g., in self-driving cars. However, the downside of this is that, depending on the type of ML model, comparatively expensive computing equipment needs to be installed in each machine, e.g., GPUs for neural network models.

  2. The ML model runs in the “cloud”, i.e., on a central server, e.g., in the form of a web application that receives data from individual users, processes it, and sends back the results. This is often the more efficient solution, if a response within a few seconds is sufficient for the use case. However, processing personal information in the cloud also raises privacy concerns. One of the major benefits of this solution is that it is easier to update the ML model, for example, when more historical data becomes available or if the process changes and the model now has to deal with slightly different inputs (we’ll discuss this further in later chapters).

→ As these decisions heavily depend on your specific use case, they go beyond the scope of this book. Search online for “MLOps” or read the book Designing Machine Learning Systems to find out more about these topics and hire a machine learning or data engineer to set up the required infrastructure in your company.