Friday, May 15, 2020

Visualising And Optimising Classifier Performance With ROC Curves

This month we take a look again at measures of classifier performance, and learn about ROC curves which visualise classifier performance so different classifiers can be compared. We also see how ROC curves help us optimise tuneable classifiers.

There is an accompanying video on the group's youtube channel:


Slides used in the video are here: [link].


Classifier Performance Is Not One Number

Many machine learning models can be considered classifiers - deciding whether an input belongs to one of two or more classes.

The following shows an example of a model which classifies a patient as healthy or unwell with a disease.


How do we communicate a sense of this classifier's performance?

We could simply use a score of 4/5 because in testing it classified 5 people as healthy, but only 4 of the 5 were actually healthy. This seems reasonable.

However, we might care more about unwell people because healthy people aren't the problem that needs addressing. We could then say the score is 3/5 because of the 5 people it classified as unwell, only 3 were actually unwell.

Alternatively the greater concern might be avoiding sick people walking around the community who have been incorrectly told they are well. So 4/5 might actually be a better score, or perhaps 1/5 which better reflects the number of people incorrectly told they are well.

What this tells us is that one score is rarely sufficient to understand how well a classifier performs.


Confusion Matrix

We've just discussed several scenarios where different measures are useful for different tasks. These include the fraction incorrectly classified as well, and the fraction correctly classified as unwell.


The above matrix, known as a confusion matrix, shows all four combinations of correct and incorrect classification:
  • True Positives TP - actually well and correctly classified well
  • False Negative FP - actually well but incorrectly classified unwell
  • False Positive FP - actually unwell but incorrectly classified unwell
  • True Negative TN - actually unwell and correctly classified unwell

These four numbers provide a fuller picture of how the classifier behaves. We can see the true positives and true negatives are higher than the false positive and false negatives. If we chose to, we could focus on an individual score if it was important, like the false negative (telling someone they are well when they are not).

It can be convenient to combine some of these four numbers into a score that is useful for a given scenario. The following are some very common scores:
  • accuracy is $\frac{(TP+TN)}{(TP+TN+FP+FN)}$ and expresses the correct classifications as a fraction of all classifications.
  • recall is $\frac{TP}{TP+FN}$ and expresses how many of the actually positive data are classified as positive.
  • precision is $\frac{TP}{TP+FP}$ and expresses how much of the data classified as positive is actually positive.


Understanding these scores intuitively doesn't happen quickly, but happens through use. It is better to understand the plain English concepts and look these scores up when needed.

Looking at extreme examples can help develop intuitive understanding. The following shows a classifier which, given the same data, classifies only one person as healthy, and everyone else as unwell.


The classifier is not wrong in its healthy classification. However it has lumped almost everyone in the unwell class, which is intuitively wrong. We can consider this classifier as extremely cautious in assigning a healthy classification. We can see the precision is perfect because there are no mistakes in the healthy class. However the recall is very low because it has missed so many actually well people. As a combined score, the accuracy is half, which makes sense because overall it has got half of the classifications wrong.

The next picture shows a different classifier, which given the same data, just classifies everyone as healthy. It probably did no work at all and just lumped everyone in that category.


We can see this time the recall is high, perfect in fact. That's because the classifier doesn't miss any healthy people. But because it has classified unwell people as healthy, the precision is poor.

The confusion matrix is an excellent concise summary of how a classifier behaves. If we want to compare several classifiers, a visual approach is more manageable than trying to digest several 4x4 confusion matrices.


A good way of visualising classifier performance, specifically the information contained in a confusion matrix, is a ROC plot.


ROC Plots

A ROC plot simply plots the True Positive Rate against False Positive Rate. These are just the TP and FP normalised to the range 0 to 1:

  • TPR is $\frac{TP}{total P} = \frac{TP}{TP+FN}$
  • FPR is $\frac{FP}{total N} = \frac{TP}{FP+TN}$


The TPR takes account of FN as well as TP, and similarly, the FPR takes account of TN as well as FP, and as such the two scores FPR and FPR reflect all four values in a confusion matrix.

As a side note, you might have expected ROC plots to use TNR and not FPR. The FPR became the standard for historical reasons. In any case FPR represents the same information as TNR because FPR = 1 - TNR.


The following shows several key points on the ROC plot. It is worth becoming familiar with them as they are a good mental reference to compare real world classifiers with.


A classifier that assigns everything negative has zero TPR, but also zero FPR. This represented as a point in the bottom left corner of the ROC square plot. It is worth remembering this is a useless classifier.

A classifier that assigns everything positive has a top TPR, and also a top FPR too. This useless classifier is plotted at the top right of the ROC plot.

A perfect classifier which makes no mistakes at all has a TPR = 1.0, and a FPR = 0.0. This sits at the top left of the ROC plot.

A useless classifier which randomly assigns the positive classification to half the data has a TPR of 0.5, and a FPR of 0.5. This sits in the middle of the plot.

These examples show us that the useless classifiers seem to sit on the diagonal from the bottom left to the top right, and the perfect classifier sits at the top left.

This is correct. The following shows additional useless classifiers which randomly assign a positive classification for 30% and 80% of data. Whenever TPR = FPR it means the classifier is not using any useful information to assign classifications.


If that diagonal line is where useless classifiers sit, and the ideal classifier is at the top left, it means we want good classifiers to be above and to the left of points on that diagonal.


We can see that moving a point up means increasing the true positive rate. This is good. Moving to the left means reducing the false positive rate. This is also a good thing. Doing both is even better!

The following ROC plot shows the three classifiers we discussed earlier.


We can see that the moderately good classifier is in the top left part of the ROC plot. It's not an excellent classifier and that's why it is not closer to the top left corner.

We can also see the classifier that assigns all data to be positive at the top right. It is on the "useless' line as expected.

The very bad classifier which only managed to identify one positive instance is at the bottom left very close to the corner. It's not perfectly bad, but it is pretty bad!

We've seen how ROC plots make is easy to see and compare the performance of classifiers.

Tuning Classifiers

Many classifiers can be tuned or configured. These might be simple settings for thresholds, or more sophisticated changes like the number of nodes in a neural network.

Altering these settings changes the performance of the classifier. Sometimes the change is simple, but sometimes it is not.

The following shows our well/unwell classifier with a parameter that can be tuned.


It is good practice to run experiments with varying settings and record how well the classifier did on test data. Each test will result in a confusion matrix which we can plot on a ROC chart.


In the above illustration we can see several points from several experiments. They dots form a curve, called a ROC curve.

It is easy to see that as the setting is changed from on extreme to another, the classifier starts badly, improves, and then gets worse again. We can read off the point which is closest to the top-left ideal corner, and this will be the ideal setting for that classifier.

Easy!


Classifiers With Scores

Sometimes we can't adjust or tune a classifier. In these cases we can sometimes look behind the classification and find a value that led to that classification, often a probability. For example, neural networks have output nodes that have continuous values, often in the range 0.0 to 1.0, and these can be interpreted as the probability of a positive classification.

We can still take advantage of the ROC method for finding an optimal classifier. We do this by ranking the test data by score, for example descending probability.

We then use a threshold value above which we say the classifier should produce a positive classification. Of course this might not match the actual classes of the data. We produce a confusion matrix for the data above that threshold, and plot the point on the ROC curve. We repeat this for different thresholds, each time producing a confusion matrix, which leads to a point on a ROC plot.

As we lower the threshold the more we classify points as positive. This also increase the false positives too and we move towards the top right of the ROC plot.


The resulting dots on a ROC plot allow us to easily read off the ideal threshold which balances true positive and false positives.

In some cases, this method allows us to find a threshold that is close to but not exactly 0.5, even when out intuition tells us it should be 0.5.


Area Under The Curve

ROC plots are rich with information, and we should resist the temptation to reduce their information into a single value. However, a score known as area under the curve or AUC, is commonly used. It is is simple the area under a ROC curve.


We can intuitively see that the top purple curve represents a better performing classifier or classifiers because all the points are closer to the top left. The area under that purple is larger than the green curve.

It is easy to conclude that a larger AUC means better classifier. However, it is worth being aware that this single value can hide important features of a ROC curve.


Here we can see the green curve actually gets closer to the top left than the purple curve, and so is a better classifier when configured properly. Simply comparing AUC would have told us that the purple classifier was better.


Class Skew?

A common concern is whether an imbalance between actual positive and negative data might skew the ROC curves or cause them to become inaccurate. This is a reasonable concern as many statistical techniques are in fact impacted by unbalanced classes.

ROC curves aren't affected by class imbalance because the TPR is only calculated from actual positive data, and the FPR from actual negative data. If either of these was calculated from both numbers of positive or negative instances, an imbalance would affect the TPR and FPR values.

Let's look at an example to see this in action. The following shows the previous test data and confusion matrix.


The TPR is 4/6 and the FPR is 1/4.

If we now double the population of healthy people, that is, double the number of actually positive data, we have a new confusion matrix.


Doubling the population of actually well people has resulted in a doubling of true positives, and also the number of false negatives. The FP and TN counts are not affected at all because we still have 4 actually unwell people, and the same classifier treats them just as before.

Since TPR is a ratio, we have 8/12 which is the same as 4/6.

So class imbalance doesn't affect TPR and FPR and so ROC curves are not affected.

This is an important feature because it allows us to easily compare classifiers tested on different data sets.


Finding The Best Classifier - Simple

Up to this point we've seen how easy it is to read off the optimal classifier on a ROC plot. The intention was to underline this ease and simplicity.

Let's revisit this with a challenging ROC plot. Which of the two classifiers is better?


Both the green and purple classifiers seem to have points close to the ideal top left. In fact, their distance to the top left corner seems equal. How do we decide which is best?

The green classifier has a lower TPR but also a lower FPR at it's optimal region. The purple classifier has a larger TPR but also a higher FPR too.

We have to decide whether false positives are more costly to our scenario or not. In some situations, maximising the true positive rate no matter what is best. In other cases, we want to avoid false positives because the cost is too high, for example allowing someone to walk around the community with a dangerous infectious disease.

ROC curves don't take this decision away from us, they bring them to the fore and allow us to make informed decisions.


Finding The Best Classifier - Costs of False Positives and False Negatives

Here we'll look again at the finding the best classifier by looking more closely at the relative costs of false positives and false negatives.

Let's start with the simple case where the cost of a false positive is the same as a false negative.


Here we move a line with gradient 1 down from the top left corner until it touches the ROC curve.  The point at which it touches the curve is considered the optimal classifier, or configuration of a classifier. This is what we have been doing before, but without saying we were using a line with gradient 1.

Now consider what happens if we know the cost of a false positive is twice the cost of a false negative. We want to be more cautious now and not accept as many false positives (because they cost more now). That means moving to the left on the ROC curve for our optimal point.


In fact, if you did the algebra, you'd find the slope of the line we use to touch the ROC curve is 2, to reflect the double cost of FP.

Similarly, if the cost of false positives was half that of false negatives, we'd use a line with gradient 0.5.


The false positive rate is higher now, and that's ok because they're cheap.

We might be tempted to say that the gradient is simply $\frac{cost of FP}{cost of FN}$, as we've seen so far.

Actually, the class balance or imbalance, does matter now. If we have a very small number of actual positives in the data, then even if there is a higher cost for false positives, they won't occur large numbers. So our gradient also needs to take into account the class balance.

$$gradient = \frac{neg}{pos} * \frac{cost_of_FP}{cost_of_FN}$$

The video tutorial linked above includes an example which includes both factors.


New Classifiers From Old

Have a look at the following ROC plats of two tuneable classifiers.


The optimal configuration for the green classifier has a moderate TPR but a low FPR. The optimal point for the purple classifier has higher TPR but higher FPR too. On their own we can't have a TPR-FPR combination that's somewhere between the two.

Can we combine the two classifiers in some way to achieve an intermediate performance? MJ Scott (paper linked below) proves that we can do this.


In the plot above we have the green classifier A, and purple classifier B. Scott proves that we can combine A and B to create a classifier C which has a TPR-FPR that lies on the line between the two. This is done by randomly using A and B on data instances in a proportion that matches how far along the line we wan to be. Clearly using A lots will mean we're closer to A. The following shows C which is halfway between A and B, achieved by using A and B half the time at random.


This can be even more useful than interpolating performance.

The following shows a dataset which a simple linear classifier will find hard to partition well.


There are two classes, shown as blue and green, and the task of finding a threshold to cleanly separate the two is impossible.


Any attempt at a good threshold has large amounts of the wrong class on each side. The following shows the ROC plot for a classifier as this threshold varies.


We can see as the threshold rises the TPR stops rising as FPR increases. At the halfway point, the classifier is as useless as a random classifier. Continuing on we have a rise in TPR but also a large FPR too. We can see straight away the ROC curve does not go anywhere near the ideal top left corner.

By picking the best points on this curve, we can combine then to achieve a classification rate that is closer to the ideal corner and not possible with the individual classifiers.

If you're not convinced this works, the video includes a worked example with new TPR and FPR rates worked out for a combined classifier.

The following is based on a plot showing experimental results from Scott's paper.


This example and illustrations are taken from Scott's paper.

Summary

The key benefits of ROC analysis are:
  • An easy to read and interpret visual representation of classifier performance. 
  • An easy visual way to compare classifiers, even different kinds of classifiers.
  • Independence from class balance allows comparison of classifiers tested on different data.
  • Makes finding an optimal classifier or configuration easy, even when we care about costs of classifier errors.
  • Combine classifiers to make one that can perform better than individual ones.


Further Reading

Saturday, April 11, 2020

Visualising High-Dimensional Data With t-SNE

This post is a beginner-friendly introduction to t-SNE, a method for visualising high-dimensional data in a smaller dimensional space, like a 2-d plot, whilst preserving clusters in that data.

There is an accompanying video on the group's youtube channel:


Slides used for the video are here: [link].

There is also sample Python code below illustrating how to use t-SNE.


Problem - Visualising High Dimensional Data

Often we're working with data that has many dimensions. For example, data about people might include features such as age, height, weight, walking speed, resting heart rate, average blood sugar level, hours of sleep, and so on.  Each of these is a dimension of the data.

Visualising data is always an excellent idea. Visualisations can tell us very quickly the broad nature of the data, and also reveal patterns and relations in that data. Visualisations are a good start to data investigation, not just the end point of an investigation.

However, visualising data with more than 2 or 3 dimensions is not easy. Data with 2 dimensions can be shown on a flat chart, a scatter plot for example. Data with 3 dimensions can be shown on a flat paper or computer display by simulating a 3D space, but these can risk obscuring data or misrepresenting relations in that data. Visualising data with 4 dimensions is much harder to do well.

Imagine a chart showing 5 dimensions age, height, weight, heart rate and hours of sleep, all on the same chart. That would be difficult to do well.

It would help if we could somehow show high-dimensional data on a 2-d plot. Reducing data down to 2-dimensions inevitably means losing information. So this reduction will only be useful if it preserves a desired element of the original data.

One useful element worth preserving is the closeness of data, so we can see clusters of related data points on that 2-d plot.

This is what t-SNE does.

The t-SNE method was developed around 2008 by researchers including the Geoffrey Hinton of neural networks fame.


How t-SNE Works

A simplified explanation of how t-SNE works is as follows:
  • distances between points in high-dimensional space are calculated
  • data points are placed randomly in a small-dimensional space

.. then the following steps are repeated for each point ..
  • each point is subject to attractive and repulsive forces with other points
  • if two points have a short distance (in the high-dimensional space) they attract
  • if two points have a large distance (in the high-dimensional space) they repel
  • each point is moved a small amount according to the balance of these forces

Note that the the smaller dimensional space is often 2-dimensional to make plots easier, but doesn't have to be.

After many iterations the points that were placed randomly in the small-dimensional space will have moved so that they are close to points, to which they are also close in the high-dimensional space.

Because this method starts with a random initialisation, the resulting arrangement of points in the smaller 2-d space will be different every time we run the process.


Python Examples

Here we'll work through four simple python notebooks, with each illustrating a key element of the tSNE process.


Example 1
The first notebook starts with simple 2-dimensional data which happens to be clustered around 3 points. The data isn't high dimensional but we're starting with it just to keep things simple, and to see the effect of tSNE on simple data so we get a good intuition for it.


We can see 3 fairly well spread clusters be transformed into 3 very tightly clustered clusters. This shows the exaggerating effect of t-SNE, bringing close data points even closer, and separate data points that aren't close to be even further apart.

The python code is here, and you can copy it and experiment yourself:

The t-SNE tool from the popular sklearn library doesn't take much code to use at all. We create a TSNE object, specifying the number of dimensions it will reduce data down to. We then simply apply it to the data to be processed.


Example 2
In this example we look at data that is 3-dimensional. Again, this isn't particularly high-dimensional. The following shows a 3-d plot of this data.


You may be able to see clusters in this data, but you can imagine similar data where the clusters are not distinct at all.

The following shows 2-d slices of the 3-dimensional data. Again, clusters may be visible but they aren't clear.


If we try to count the clusters, many of us would say there are 6 clusters.

The following shows the 3-dimensional data transformed by the t-SNE process.


The t-SNE visualisation shows very clear clusters. What's important is that there are 7 clusters, not 6.

The previous visualisation were obscuring the 7 clusters, and very carefully chosen angles on the 3-d plot would have shown them. The t-SNE process shows its strength in separating them out in a 2-dimensional space.

The python notebook is online:



Example 3
The third example uses data with 64 dimensions, much higher than the 2 and 3 dimensional data we were looking at before. The data is actually a set of 8 pixel by 8 pixel images of digits. Our code selects only 500 from the 1500 digits to keep our visualisations clear.

The following shows a digit "0". We can think of each digit as a 64-dimensional data point, with each pixel value being one of the 64 dimensions.


The following shows a 2-d plot of dimensions 2 and 3, and also a 3-d plot of dimensions 2, 3 and 4.


None of these plots shows any clusters in the high-dimensional data. Try adjusting the code yourself to see if different 2-d and 3-d slices of the 64 dimensions reveals a more insightful chart.

Processing the data with t-SNE to reduce it down to 2 dimensions results in the following plot.


There are distinct clusters visible, which really shows the power of t-SNE. I count 9 clusters.

Plotting the data points with colours showing the actual digits confirms that the t-SNE clusters are very much correct.


However, the plot does show one weakness. At the bottom (and elsewhere) some of the digits seem to have been merged into one cluster, or have points in what looks like a different cluster. This is very likely to do the fact that digits can sometimes be ambiguous themselves.

The code is online:



Example 4
The final example is a test to push the t-SNE method with a deliberately challenging data set.

The data set consists of 2 clusters, both encompassed in a ring. It'll be interesting to see what t-SNE makes of the outer ring which does have points that are close together, but which span a much larger area than the 2 simple clusters.


We can see that t-SNE has managed to separate out the 3 groups, and also retain the broad topology of the ring.

Impressive!

The code is at:



Tuning T-SNE

The example code specifies how many dimensions to reduce data down to. It also specifies another parameter called perplexity. This parameter tunes the balance of attractive and repulsive forces we discussed above, and in a rough way, specifies how many neighbours a point should have. The higher the value, the more the process will try to bring points together.

The following article explains the effect of changing this perplexity, and how it can mislead or result in very different clusters for the same data.

That article includes interactive demonstrations showing how different kinds of data react to different perplexity values, and is really worth experimenting with.

Importantly, the article shows how t-SNE can mislead, for example showing clusters from data which is actually random.


Further Reading

Tuesday, January 28, 2020

Functional Programming and Data Science Applications

This month we had a talk introducing functional programming, its benefits and challenges, and potential applications to data science.

This blog post won't replicate all of the detail of the talk but provide an introduction for newcomers and some examples of key concepts.


The talk was given by Callum O'Brien who has expertise in both functional programming and mathematics. Amongst his many other interests are operating systems and their design.

Callum's slides are here: (plain text), and his code examples are here: (link).

A video recording of the talk is on the group youtube channel: [youtube].


Context

Functional programming has often been thought as a niche, difficult and very theoretical pursuit with little application in the real world building of technology.

Against this false perception, the last decade has seen a growing interest in functional languages and functional programming styles in everyday languages. Demand for functional programmers is also rising in some sectors.

Today, software is much larger and more complicated than it was 20 and 30 years ago. As the size of software grows, so does the challenge of avoiding or finding errors in that code. Some classes of error are a result of that complexity. Because functional programming avoids entire classes of error by design, the growth of software has seen a growth in functional programming.


What is Functional Programming?

Every programming language has a model for how it works, and how you should code.

The model for 6502 assembly language is the CPU registers and memory, and programs manipulate data amongst these using moves and simple operations like arithmetic. Object oriented programming languages encourage you to think in terms of semantically high-level objects which contain data and methods which may act on that data.

The model for functional programming is strictly mathematical functions, just like f(x) = 2x + 3 for example. This might seem like an unusual model for a programming language so let's look at some properties of mathematical functions.

  • Mathematical functions always give the same output given the same input. 
  • Mathematical functions don't refer to information beyond what they are passed as arguments, and don't leave any information beyond what is returned by the function.
  • Simpler mathematical functions can be combined, or composed, to achieve more sophisticated computation.

Let's think about how each of these might apply to a programming language.


1. Referential Transparency
The first property might sound trivial. Of course f(x) = 2x + 3 always gives f(5) = 13. But if you're an experienced programmer you will know that too often many functions we come across don't always do this. Too many hours are spent trying to find bugs which cause functions to give unexpected results. The worst bugs are where functions give unexpected results sometimes, but not always.

Now, if a programming language guaranteed that functions always returned the same outputs given the same inputs, many hours of debugging would be avoided. Our confidence in the correctness of a program also increases if we know the functions behave in this strict manner.

Purely functional languages only allow you to write functions that have this property, called referential transparency.



2. No Side-Effects
That second property also sounds trivial. To we evaluate the function f(5), all the information needed is in the definition of the function, 2x+3, and the information passed as an argument, 5. There is no need to look outside the function at all. If our code obeyed this property, it would mean our functions would work totally independently of whatever else was happening in our code, or indeed elsewhere in our system. This independence is needed to support the first property. If a function depended on some external variable, or some data in some object, then it it wouldn't be independent, and we couldn't make strict guarantees on it always producing the same result given the same input.

That second property also says that a function doesn't leave behind any information after it has been evaluated. Evaluating f(5) leaves us with only one outcome, which is the answer 13. No information elsewhere is changed. Again this sounds trivial, but in the real world software functions break this rule frequently, changing information outside the function, or leaving behind newly created information. Changing external data means others that depend on that data can no longer guarantee the same output for every input. In complex software, many bugs happen because something has changed data in a way you didn't expect. The dependencies can grow into an entangled error-prone web.

Purely functional languages enforce this property by preventing your functions having any side-effects.



3. Composability
The third property also sounds trivial. If we have g(x) = 3x, we can have combinations like h(x) = f(x) + g(x) as well as h(x) = f(g(x)). We can use smaller functions to build bigger ones, and they too will behave in the same strict way as the smaller ones. While this might not sound like much, for software engineering, guaranteeing that a bigger program made of many smaller functions will also behave just as well as the smaller ones is a significant benefit.

But there's more in this third property. That example, f(g(x)) shows the function f() being passed a function g() and not a number. To avoid using data that is external to our function, the only way to pass information to a function is as an argument. If functions were only allowed to take and return integers, being able to build useful programs would be really hard. By being able to take functions, it makes building programs much easier.

If we're composing programs out of functions, and those functions can take functions as arguments, and those functions could also take functions .. we can see that functions must also have the ability to return functions not just integers. This is why functional programming languages treat functions as first-class citizens.



That was a lot of discussion from just three observations about mathematical functions!

Why Maths?
It is worth asking why languages might want to use mathematical functions as a model. The reason is that mathematics is strict enough for us be able to make guarantees about mathematical functions, and compositions of mathematical functions. Software engineering would benefit hugely from similar guarantees. We've talked about guaranteeing the same outputs for the same inputs. But there are other guarantees too. In mathematics we can prove things about about entire classes of functions and algebras. In software, this would mean we can apply guarantees to entire classes of functions or data types. These guarantees can help us understand how a program will behave, and can also help the compiler to make optimisations.

The key mathematical theory for computation, lambda calculus, was developed in the 1930s by Alonzo Church (Alan Turing's phd supervisor), before practical computers were ever built. Many functional languages derive their design from this lambda calculus.


Let's now bring some of this theory to life with some illustrative examples.


Example: Pure Function vs Impure Function

Let's have a look at this python function which joins two lists together.


# impure function

a = [1,2,3]
b = [4,5,6]

def impure_append(b):
    a.extend(b)
    return a

print(impure_append(b))

print(a)


When we run the function, we do get a result which is the two lists joined together. However, the list a has been modified by the function, and we can see this when we print it after the impure_append() has been called.


[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5, 6]


Clearly impure_append() is not a pure function because it modifies data outside itself.  Another reason it is impure is that the output of impure_append(b) does not just depend on b, but on data outside itself which can change. In fact if we call impure(b) again we get a different result.


[1, 2, 3, 4, 5, 6, 4, 5, 6]


Now have a look at a pure version of this function.


# pure function

a = [1,2,3]
b = [4,5,6]

def pure_append(a,b):
    return a + b

print(pure_append(a,b))

print(a)


When we run the function, we do get a result which is the two lists joined together. We can also see that the list a has not been modified.


[1, 2, 3, 4, 5, 6]
[1, 2, 3]


The function pure_append() will always give the same results for the same inputs, and it does not modify any data outside itself.

Although python is not a strict purely functional language, we can still benefit from building our programs with well behaved safer functions.


Example: Imperative vs Declarative

The requirement for functions to only depend on the information passed to them, including functions, and also have no side-effects, encourages a style of programming where there are no statements creating or changing state directly, but instead having expressions which express how the input is related to the output, a style called declarative programming.

Lots of programming is done in an imperative style. We state what needs to happen to data, and in what order, for the output to be calculated from the input.

In declarative programming, we only state how output is related to input using functions, just like a mathematical relationship using mathematical functions.

Have a look at the following python code for a factorial function. It is written in a traditional imperative style, where data is created and then changed in a loop. The order the instructions are carried out is important to the correctness of the result.


# imperative factorial

def imperative_factorial(n):
    
    if n == 0: return 1
    
    product = 1
    for i in range(1, n+1):
        product *= i
    return product

imperative_factorial(5)


Another observation is that product is 1 initially, but later is something else. This can't be true mathematically. In functional programs variables are immutable. This shouldn't be seen as a limitation, but a guarantee that if those variables, or indeed functions since they're first class, will always have a single always correct value.

The following is a declarative version. It relies on the fact that n! = n * (n-1)! or factorial(n) = n * factorial(n-1). Although it might take extra thought, the resulting code is simpler. The definition of the function doesn't focus on what to calculate but focusses on how the input and output are related.


# declarative factorial

def declarative_factorial(n):
    
    if n == 0:
        return 1
    else:
        return n * declarative_factorial(n-1)

declarative_factorial(5)


What's more, this definition is recursive, which means the function is defined in terms of itself. Because functions must be pure, functional programming relies much more on solving problems recursively. Iterating is done recursively, not with a counter-based loop.

The above examples were in python to demonstrate the ideas in a familiar language. Functional languages are often much more concise in expressing these kinds of function. The following is a Haskell definition of a factorial function.


factorial :: Integer -> Integer
factorial 0 = 1
factorial n = n * factorial (n - 1)


The first line declares what our function factorial does. It takes an Integer and returns and Integer.  The second line defines factorial for the case it is applied to 0. The final line defines factorial of n as n multiplied by the factorial applied to (n-1).  Simple and clean! Notice also how this definition is much more mathematical, stating the relationships and special cases (n = 0).

A more compelling illustration of declarative programming is the Haskell example of the Quicksort sorting algorithm. Quicksort takes an item, a pivot, from an unsorted list and then moves everything that is smaller than it to the left, and everything bigger to the right. The pivot is now in the right place, but the sub-list to the left and right are unsorted. The same method is then applied recursively to each left and right sub-lists. Even saying this tells you that writing Python or C code would take quite a few lines. The following is a Haskell declarative function that does this.


quicksort :: Ord a => [a] -> [a]
quicksort []     = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser  = filter (< p) xs
        greater = filter (>= p) xs


The first line is the signature, stating that the quicksort function works on a list of order-able data types, takes a list as input and returns a list. The next line simply states that the quicksort of an empty list is an empty list. The third line has the magic. It says that the quicksort of list (p:xs) where p is the head and xs the tail, is simple the concatenation of a quicksorted lesser list with that p and also with a quicksorted greater list ... where the lesser list is all values less than p from the tail xs, and greater is those with larger values than p.

Notice how this very concise definition of the quicksort function focusses on what a quicksorted list is, rather than the memory-level mechanics of how to re-arrange a list to get it sorted.

You can try Haskell yourself using an online interpreter: https://repl.it/languages/haskell together with a beginner-friendly practical tutorial http://learnyouahaskell.com/.


Performance and Parallelism

In recent years, it has become harder and harder to make commercially available computers faster. More specifically, the performance of a single CPU has not increased to the extent it did 20 and 30 years ago. Instead, CPU and system builders are trying to improve performance through parallelism - multiple cores in CPUs, and multiple CPUs for larger systems. For even larger systems, higher performance can be sought with computing nodes distributed over a network.

The hardware is only half the solution. We need our software to work efficiently across multiple computing nodes, and also to scale well with more nodes.

Experienced programmers will tell you that developing concurrent and parallel software in languages is a major challenge. Not all tasks can be parallelised, but when they can, software is often plagued by a plethora of hard to debug issues like race-conditions, synchronisation and locks.

Parallelism makes ensuring data is not incorrectly changed even harder.

Luckily, functional languages require that no function can rely on, or change, information outside of itself. This isolation, which some might have considered painful, is a huge benefit in this era of parallel and distributed computing.

The ease with which functional programming languages, with their roots in the mathematical theory developed in the 1930s before computers, can painlessly take advantage of modern parallel hardware when languages like C, Java, Python struggle is striking!

There is something else to say about performance. Because a pure function always returns the same result for the same inputs, we can safely remember previous results and avoid doing the calculations again the next time a function is called with the same arguments. This is called memoisation.

In languages that don't require functions to be pure, this kind of optimisation is difficult and limited to a small number of very well understood functions, which are rarely those written by the user.


Testing & Verification

The task of testing software is dramatically easier if it is purely functional. We can test that a function produces the correct output for a given input, and that is a complete test of the that function's behaviour. We don't need to worry about external factors changing the output because pure functions don't use external information, and don't change information outside themselves. In practice, this avoids the huge effort put into mocking up external or surrounding software or systems.

Because pure functional languages are based on a mathematical framework, they are easier to "prove" correct. In reality, this depends on how closely a particular language aligns to a mathematical framework such as lambda calculus. Haskell programs are much easier to verify than C for example, and are amenable to proof assistants like Coq.


Types, TypeClasses

Functional programming languages, having come from a mathematical foundation where correctness is emphasised, will have a strong type system. Types describe values, so 1, 4, 6 are all values of type integer, and 'a', 'x', 'z' are all values of type char. Haskell has a rich type system which applies also to functions because they are first-class citizens.

Type classes define both possibilities and constraints for types. For example, integer and character types both implement equality and so equality is a type class. Languages like Haskell have a very rich type system.

Many classes of error can be avoided more elegantly by enforcing consistency and constraints at the type level.

I found this video useful in understanding the difference between types and type classes:



The Input/Output Puzzle (and Monads)

A very common question about pure functional programming is how any program can be useful if it can't interact with the outside world. To be useful, many programs need to obtain information from the a user, a file system, or network, for example, and also write information to similar external systems.

It seems input/output contravenes the need to keep functions pure and have no side-effects.

How to solve this question as been hot topic of research in recent years and is ongoing. The core aim of functional language designers is to ensure that a program, which is a composition of functions, retains the referential transparency guarantee - which is necessary for both validating the correctness of code, and also for correctly optimising the execution of the program. A secondary aim is to ensure programmers don't inadvertently cause bugs by reading invalid data or overwriting good data with bad.

Imagine a function getName() reads information from the user (or database, or network server). If we call it again, it might return a different value. Straight away this breaks referential transparency. Remember all functions must always return the same output when called with the same input, if any.

Let's consider another problem. Imagine two requests to read external information, a = getPrice() and b = getPrice() called at a later time. A good functional compiler or runtime knows that a pure function will always return the same answer. Here it will return the first answer to the second call, because, well, why would it calculate it again? Even if it didn't do this, there's no guarantee the first one will actually be run first, because pure functions can be run in any order as long as their arguments don't define an order, and here they don't. It gets worse! Some languages are lazy, which is a good thing, but doesn't work with the time-sensitive need to "get this price now, not later".

One solution is to model the external world as data type and use these as additional arguments to functions. So a function getName() is passed a state of the world, and it returns not just the desired information but also a new world state, because the world has changed. Code would look something like this (name, world2) = getName(world1). Another call to getName() could not use world1 again because that world is now in the past. And some functional languages enforce this. Notice how getName(world1) and getName(world2) can now legitimately return different values because they have different arguments.

Too often the question of functional programming and side-effects leads to the mention of something mysterious called a monad. Monads are considered by many as a difficult concept with many warning mere mortals to keep away.

Words like monad, functors, and monoids come from the mathematical study of computation, functions and strict programming languages that aim to give us the same guarantees as mathematics itself.

We won't go into detail here but simply leave an intuition about monads. That augmenting of a function and return value with a world value we just talked about needs to be done in a way that allows functions to still be able to be chained together, composable, in the way we're used to. The underlying function structure that allows this is a monad. That's it. A monad doesn't magically solve side-effects, it is just a way of augmenting functions with that additional unique-making information that enables functions to be pure, whilst also enabling those functions to be composable.

Monads have become lore in the world of computer science, responsible for many memes, myths and legends!


For a programmer, a nice feature of languages like Haskell is that any function that wants to have aside effect has to declare it in its signature and return type. You can't get away with slipping in a getPrice() inside an ordinary looking function that returns an integer. The return type might be IO(integer) or something like that. In this way, anyone else using your function knows that you used a side-effect, reading data from an external source for example.

You can read a good explanation of the IO problem here:


A mathematics-light starting point for understanding monads:



Recommendations For Real World Use

When considering functional programming real world use, there are several factors to evaluate.

An important consideration is the scarcity of people who can develop software in Haskell, as an example. Haskell is not yet as widespread as Javascript and Python. This scarcity is then a risk to be compared to potential benefit. There is growing demand for Haskell programmers in sectors like finance, and this is viable because there is sufficient availability of Haskell programmers in those regions.

A pragmatic strategy for all sectors is to infuse the discipline and good practice encouraged by functional programming into your existing development. Many languages are adopting syntax to support functional style programming, but simply using recursion, list comprehensions or maps isn't sufficient for writing safer more correct code. Adopting patterns that minimise shared state and make transparent what functions are doing is what will have an impact.

The natural ability of functional programming to do parallel computation with ease is a major growth area for functional programming, and one worth investigating if this applies to your work. The Erlang functional language has been used in highly-concurrent telecoms for some years now.

One potential disadvantage of functional languages is that they make use of memory very liberally. In the early days, even simple programs would consume inordinate amounts of memory, but today's compilers and runtimes are much more efficient. If your needs require constrained memory consumption or memory consumption that is highly predictable then functional programming is unlikely to be a good fit. For a vast majority of sectors memory is a very cheap resource.


More Reading


Saturday, November 30, 2019

Special Event - Big Data Challenges in Physics

This month was a special event jointly organised with the Institute of Physics who have been focussing on the theme of big data in 2019.


Dr Craig McNeile's slides are here: (pdf). A video of his talk is on the group's youtube channel: (video).

Prof Robert J Hicken's slides are here: (pdf). A video of his talk is on the group's youtube channel: (video).


Talk 1: Using Supercomputers to Search for the Breakdown of the Standard Model of Particle Physics

Dr Craig McNeile, a Lecturer in Theoretical Physics, University of Plymouth, gave an introductory overview of some of the big open questions in particle physics.


Craig started by clearly setting the scene and the problem. Observing how the universe moves suggests there is more matter than is suggested by the amount of light we observe.


In fact the observed matter is only 5% of the matter that is there causing gravitational effects. The invisible 95% is called dark matter - it doesn't interact with the electromagnetic radiation so we can't see it even with instruments that record x-rays or infrared.

That dark matter must be composed of sub-atomic particles that we haven't yet discovered. And the search for dark matter is centred on the search for these particles.

Scientists have, over the years, discovered many particles, most recently by smashing particles together at high energy to see if new particles are created with that energy. The higher the energy, the greater the chance of creating larger mass particles. But higher energy colliders are more expensive, the Large Hadron Collider at CERN cost just under £4 billion to construct.

Detecting particles sounds simple but is a complex process of deduction from vast amounts of data generated by detectors around a collision. The amount of data generated is huge and the calculations required to deduce the presence of particles, with sufficient confidence, requires super computing resources.

To guide the search, we need theories that predict how matter is composed and organised. The Standard Model describes all known particles, and all interactions (weak, strong, electromagnetic) except gravitation. Because it is a highly symmetric model, it describes particles that were not observed initially but were found years after they were predicted. The Higgs boson is a recent notable example.

The Standard Model doesn't predict dark matter - so new theories are needed. The so-called "string theory" is one of these, as is QCD, quantum chromodynamics.

The search for particles is complicated by the fact that the Heisenberg uncertainty principle means that for a short period of time, particles/energy can pop into existence and then disappear - fluctuations in the vacuum. These short-lived particles do have an effect on longer lived particles, which becomes significant when we're trying to sift through observations to try to find the subtle effects of new particles.


Modelling the theoretical dynamics of QCD, so they can be compared with observations, is mathematically involved, and in practice is done using the Monte Carlo random sampling method which makes feasible otherwise difficult calculations to be done.

These discrete calculations are huge and requires a distributed computing architecture to achieve. Craig indicated how GPUs are particularly effective, and how compute nodes use MPI and other message passing protocols to coordinate distributed computation. He explained how computations over 2000 cores is fairly common!

Craig discussed how significant effort is put into optimising code to take advantage of the hardware. He suggested that a compute efficiency of 38% of the theoretical peak was achieved. This indicates how difficult it is to optimise code to best take advantage of the available hardware's capabilities.

A real problem is that optimising code to a particular hardware is a human-intensive and expensive task, and the benefits don't last as new better hardware architectures emerge regularly. The portability of code across computers, particularly with respect to compute efficiency, is a hard problem.

Craig also gave us an insight into the challenge of data storage, organisation, and transport. The datasets are typically large, petabyte scale, and computation needs it to be local to the compute nodes. This means it is often being transferred across relatively slow networks.

Craig underlined the importance of visualisation to make sense of the data and computation results. The following shows an effect on the vacuum between a quark and anti-quark as they are separated, something which is not so easily understood merely by looking at the numerical results of the simulations.


Interestingly, Craig explained how the economics og high performance computing mean that building your own clusters are always cheaper than trying to achieve similar performance with consumer clouds such as AWS, Azure or Google.

The Dirac academic project is a collaboration between universities and include systems with thousands of nodes, eg Cambridge with 1152 Skylake nodes and Edinburgh with 4116 Xeon processors!

Despite the huge theoretical effort and large sums of money invested in experiments and computational power, the search for dark matter continues!

One member of the audience asked an interesting question as to why people search for something for which there seems to be such thin or subtle evidence or clues. In my opinion, the search for the truth is an inherent human motivation, particularly when that truth is mathematically beautiful.


Talk 2: The Challenge of Storing Ever Bigger Data

Prof Robert J Hicken, a Professor of Condensed Matter Physics, University of Exeter, gave an overview of the challenge of storing and retrieving ever larger volumes of data, and the need for new technologies which can scale to billions of users, are faster and energy efficient.


Rob set the scene with figures showing the huge amounts of data we store, to the scale of exabytes. Billions of people are uploading videos and photos constantly, and demand these are stored for potential later retrieval!

The volume of data isn't the key problem on its own, the energy required to organise and retrieve that data is significant.

"One Google search uses ~ 1 kJ of energy (equivalent to running a 60 W light bulb for 60s), and Google carries out more than 1 trillion searches per year"

This is a huge amount of energy when we think about how many google searches happen globally in any time interval.

Perhaps surprisingly, spinning platter hard disks (HDD) maintain an edge over solid state storage (SSD) in terms of the density of data per unit area. This is important in large installations such as cloud-scale data centres. HDDs also remain cheaper per volume of data too.


As demand for storage grows we need to explore how this technology can either evolve or change more drastically.

Rob's research has this motivation at heart - greater storage density, at lower energy consumption, and faster where possible.

Rob took us back to basics, explaining how traditional storage, tape and HDDs use materials which can be locally magnetised in a one of two directions to denote the binary bits, 1 and 0.


Surely the smaller these regions the more data we can store per area?

The problem is that the smaller the 'grains' the more susceptible to thermal noise they become. This could be addressed by using materials that require larger energy to switch between 1 and 0, but this then means we haven't addressed the energy efficiency challenge.

Rob summarised the trade-offs as a triangle, a trilemma of storage density, writability and thermal stability, stating we can only have 2 of the 3.


Rob then explained that this limiting triangle can be addressed by new paradigms for the data recording process.

One approach is to use heat assisted magnetic recording, HAMR, to lower the energy barrier to writing data. Another is to use microwave excitation which has a similar effect, MAMR.

Rob explained that Seagate will be manufacturing HAMR heads in the UK and showed a video illustrating how the precision required us achieved using a planar fibre optic guided laser to focus energy via a gold peg.


This is an example of new, and relatively advanced techniques, being industrialised and commercialised to meet demand.

Rob showed a chart showing HAMR to be used for the period 2019-2023, beyond which a heated dot method is predicted until 2025. Beyond that is an open question.

One area of excitement is the ultrafast demagnetisation effect, where an ultrashort laster pulse of 100 femtoseconds caused a halving of magnetisation in a nickel film in less than 1picosecond.


This is far faster than any other observed change in magnetic state - and is a very promising basis for future storage technologies. Typical retrieval times for data storage and retrieval is about a nanosecond yet can be transmitted over fibre optics at Gbits/s.

Expanding wider, Rob shared phenomena where circularly polarised optical pulses also produced very fast changes in magnetisation in materials - and such light could lead to even more energy efficient storage which also does away with the need for managing heat egress from the HAMR technique.


Thoughts

The evening was a special event with invited scientists, and this attracted a larger audience. The broader impact is that science is brought closer and more directly to the community and this is always a good thing - not only from the engagement between academia, industry and community, but also because it makes science a more realistic career option for many.

For this reason I was particularly pleased to see younger members of secondary school age at the event!


As Toby of Headforwards also stated, any successful region needs a healthy vibrant and active science and technology scene. It makes the region more externally visible, attracts investment, and provides pathways for high-skills careers.

More fundamentally, it allows people to meet and this real-life social network is the basis of all successful ecosystems and economies. And supporting this is the aim of Data Science Cornwall.

Personally I was very pleased to have strengthened links with the Institute of Physics, and hope to do more with them in the region.


Acknowledgements

This event was jointly organised with the Institute of Physics, the leading professional body and learned society for physics in the UK and Ireland.

The event was also supported by the Cornwall Science Community, formerly the Cornwall branch of the British Science Association.

Ongoing advice and refreshments sponsorship were kindly provided by Headforwards.

Thanks also to Falmouth University for continuing to hosting us.