A Practitioner’s Guide to Machine Learning

Conclusion

Now that you’ve learned a lot about the machine learning (ML) theory, especially the different algorithms:

…it is time for a reality check.

Hype vs. Reality

In the introduction, we’ve seen a lot of examples that contribute to the ML hype. However, especially when applying ML in the manufacturing industry, for example, the reality often looks quite different and not every idea might work out as hoped:

Hype: Big Data, Deep Learning	Reality:
Database with millions of examples	150 manual entries in an excel sheet
Homogeneous unstructured data (e.g., pixels, sound, text)	Measurements from different sources with different scales (e.g., temperature, flow, pressure sensors)
Fancy deep learning architectures	Neural networks are tricky to train and even more difficult to explain → Need to understand and trust predictions to make business decisions

Hype: Big Data, Deep Learning

Reality:

Database with millions of examples

150 manual entries in an excel sheet

Homogeneous unstructured data (e.g., pixels, sound, text)

Measurements from different sources with different scales (e.g., temperature, flow, pressure sensors)

Fancy deep learning architectures

Neural networks are tricky to train and even more difficult to explain
→ Need to understand and trust predictions to make business decisions

But it can be done! A good example comes from the startup alcemy, which uses ML to optimize the production of CO2-reduced cement. They describe how they overcame the above mentioned challenges in this talk.

What if I do have Big Data?: Practical definition:
Big Data is what does not fit in your RAM anymore.

Solutions:

Do you really have Big Data?
Measurements from 100 sensors every second for one year:
❌ \(100\times 365 \times 24 \times 60 \times 60 \times 64 \text{ bits} \approx 25\text{ gb}\)
Process changes slowly → take hourly averages:
✅ \(100\times 365 \times 24 \times 64 \text{ bits} \approx 7\text{ mb}\)
Get more RAM! (E.g., through a cloud service like AWS).
Use an ensemble method, i.e.,
a) split the data in equal (RAM-sized) chunks,
b) train a model on each chunk,
c) combine the predictions of all models.

This is also what the ‘big data’ libraries do internally, e.g., the MapReduce approach. These frameworks are especially useful when the data doesn’t fit onto a single hard drive anymore.

Machine Learning is just the tip of the iceberg: You were already warned that in their day-to-day operations, data scientists usually spend only about 10% of their time doing the fun machine learning stuff, while the bulk of their work consists of gathering and cleaning data. This is true for an individual ML project. If your goal is to become a data-driven enterprise that uses AI in production for a wide range of applications, there are some additional challenges that should be addressed — but which would typically not be the responsibility of a data scientist (alone):

See also: Sculley, David, et al. “Hidden technical debt in machine learning systems.” Advances in Neural Information Processing Systems. 2015.

On the plus side, things like a centralized data infrastructure and clear governance process only need to be set up once and then all future ML projects will benefit from them.
Domain knowledge is key!: In the introduction, you’ve seen the Venn diagram showing that ML lies at the intersection of math and computer science. However, this is actually not the complete picture. In the previous chapters, you’ve hopefully picked up on the fact that in order to build trustworthy models that use meaningful features to arrive at robust conclusions, it is necessary to combine ML with some domain knowledge and understanding of the business problems, what is then often referred to as Data Science:

As we will argue in the next section, it is unrealistic to expect an individual data scientist to be an expert in all three areas, and we therefore instead propose three data-related roles to divide responsibilities in an organization.

Take Home Messages

ML is and will be transforming all areas of our lives incl. work.
ML has limitations:
- Performance: Some problems are hard.
- Data Quality & Quantity: Garbage in, garbage out!
- Causality & Adversarial Attacks ⇒ Explainability!!
Combine ML with subject matter expertise!
It’s an iterative process:
- Don’t expect ML to work right away!
- Monitor and update after initial release.

Also:

Be clear about what you want to do (inputs & outputs; model type; evaluation metric).
Data preprocessing and feature engineering are at least as important as the “real” ML stuff.
Fancy deep learning methods might not work with the data in your excel sheet.
But linear models and decision trees are great too (with the right features).
Always be careful when evaluating your models; manually examine some errors.
KNOW YOUR DATA!