Data Science Cornwall: Practical Steps for Building Intelligent Systems

This month we have a talk encouraging us to think more broadly than a specific algorithm or library and consider wider questions of the understanding the problem, evaluating solutions, understanding data and bias, measuring performance and accuracy, and the benefits of a solution.

The slides for this talk are here: [pdf].

A video recording of the session is on the group youtube channel: [link].

The Challenge

A naive approach to building a solution to a machine learning problem is to pick a learning algorithm, train the model with data, and then use it.

This is naive for many reasons. Here are just some of them:

We didn't think about the suitability of the data we're training with
We didn't understand the accuracy of the trained model
We didn't consider alternative learning algorithms
We didn't understand the value of the solution, comparing benefits to costs, not all of which are technical

We were very lucky to have Aleksandra Osipova, a data scientist at Headforwards, to lead this session, encouraging us to think about these broader important questions in the context of a methodology for developing machine learning solutions.

Aleksandra has a masters in Complex System Modelling from Kings College London, where she worked on computational neuroscience.

Overview

The following diagram summarises the key areas covered by Aleksandra's talk.

She started by outlining a set of key questions we should start asking from the start of a machine learning problem:

why - what's the problem
what - what are the possible solutions
how - machine learning techniques
how well - robustness, accuracy, scalability
value - benefits vs costs including operational and people
risks - data bias and incompleteness, ethics

The Problem

Aleksandra rightly emphasised a still too common enthusiasm for a technical solution which doesn't actually solve the problem at hand, or more essentially, addresses the user needs.

User-centric approaches encouraged by agile and other approaches can really help here by enforcing a solution-agnostic analysis of the problem.

Sometimes the best solution is not a complex machine learning model. And your process needs to give you the ability to find these.

Data

Data is how we train our machine learning models. The importance of data is critical as it is to a very large extent the only thing that we can learn from.

Our analysis should identify which data we need to solve a problem. Often we don't have ideal or perfect data, and so the challenge is to understand how useful the data we do have is.

This is the step that carries most risk. Failure modes include:

poor data quality that isn't identified
insufficient data to actually learn from
bias in data

As with identifying the problem, Aleksandra underlined the importance of speaking with domain experts - people who understand or have experience of the problem domain.

Domain knowledge can help select data and shape the learning process, and can make intractable problem feasible, and difficult problems easier.

It is a common saying that "80% of a data scientists work is understanding the data". We'd extend this to understanding the problem domain.

One area that Aleksandra picked up on, and is too often overlooked by others, is that data itself can be dynamic. It is therefore important to explore how data behaves dynamically indifferent phases. A good example is EEG brain wave data is dynamic, but widening our view, we can see it can follow different categories of dynamic behaviour. Looking at individual data points would be too narrow a view.

Machine Learning

Before diving into a solution or technique, Aleksandra recommended we apply the discipline of developing and testing hypotheses.

Hypothesis testing is a mature, if not always well understood, field of statistics. It provides assurance that we haven't incorrectly selected a hypothesis when in fact there is statistical evidence to support alternative or inverse hypotheses. The wikipedia page provides good illustrations of hypothesis testing, with the criminal trial being particularly educational.

Staying broad in context, Aleksandra explored key questions around the machine learning step, many non-technical. For example:

is there a cost to obtaining and using the necessary data?
what is the computational cost of the given machine learning model?
does the machine learning model need to be updated online, even in real-time, or is it trained off-line?
is updating the model with new data possible, feasible?

As discussed before, data is key. Aleksandra again underlined the need to formalise the data preparation steps - collection, cleaning, perhaps normalising it, transformation rules if needed. In theory these seem trivial but in practical deployments, these become key steps for monitoring data quality, and triggering a fault with the incoming data.

During the data exploration phase, domain knowledge or other techniques can lead to a reduction of data, either by removal, or by combining variables to more meaningful ones which more effectively train a model.

During the training phase, it is a very well known approach to train on a subset of the data, and then to verify the model on a a different subset of the data mot previously seen by the model. This makes a lot of sense - don't test on data you've already seen.

The group discussed strategies for how this split can be made, and the conversation was inconclusive. It likely depends on the domain and model more than any theory about an optimal split.

Predicting the scalability of training or using a model is difficult. At one level, we can understand the computational complexity of the algorithms, but in practice these estimates can be disrupted by hardware and software effects, for example the characteristics of network connected storage and saturation of a network. The best approach is to predict as best we can but to test on representative infrastructure to avoid surprises when the impact of a scaling failure would be significant.

Evaluation

A trained model can give us answers which are correct. This is insufficient assurance in real world scenarios.

In the real world, we often need to know how often a trained model is correct/incorrect. Better yet, we should have an idea of how far wrong or right the answers are.

Aleksandra presented an excellent definition of precision and accuracy, both providing useful insights into how well a model works. The wikipedia page has a good illustration of the difference.

The group discussed the relative merits of loss function often used in different kinds of learning model.

At one level, the shape of the loss function drives the learning (down an error gradient), but at another level of sophistication, the choice of loss function can effect the efficiency or robustness of learning. In reality, even if the theory is robust, it is not a substitute for testing against your own domain and data. There was also a discussion on whether loss functions are the same as regression metrics like mean squared error.

The terminology cross-validation is used often in the field, and simply refers to strategies for splitting a limited labelled data set into a training and test set to check that a trained model hasn't over- or under-fitted. Over-fitting is when a model simply learns to memorise the training data, and appears to perform very well, but against previously unseen data performs badly. A better model will generalise and perform well against unseen data even if it doesn't quite perform so well on the training data itself.

The following diagram from an accessible article on cross-validation illustrates the difference between validation error and training error.

As an aside, I remember ROC analysis as being very good at summarising various modes of failure in a simple visual form:

Value, Benefits and Costs

Alexandra talked about the non-technical implications of model. She explored how different approaches have different needs in terms of skills and roles, and this can be either a cost or an organisational inertia.

This discussion was in the broader context of value. Not only is value related to computation, infrastructure and people cost, it must also be related to how well a solution solves a problem.

Members of the group discussed examples of huge efforts to tiny gains or real world effects. The disconnect between pure statistical measures and practical value was highlighted, and led to the conclusion that the correctness of a solution can't be understood without the context of the real world impact.

Risks

The group discussion was vibrant and continued after the session. A key strand of discussion was around the risks around naive use of a trained model:

biased outputs from biased training data
biased outputs from the learning algorithm / model itself
ethical issues around loss of privacy from learned insights, unaware to data subjects when they provided their data
ethical questions around "just because you can, should you?"

Reference

Aleksandra recommended the Tour of The Most Popular Machine Learning Algorithms as a good overview of methods and the kinds of problems they can be use for: