Thursday, August 12, 2021

Proof, Provers, and the Lean Theorem Prover - Summer Special!

In this summer special we took a break from messy data, and - just for fun - took refuge in the reassuring consistency and cleanliness of logic and mathematics, by exploring proofs and proof assistants like LEAN.



We were very lucky to have James Arthur, a Cornwall based pure mathematician and mathematics communicator, give an introduction to proofs and the LEAN proof assistant, and a worked example.

The video for this talk is at our channel: [youtube]. Slides are here: [pdf or pdf].


What is a Proof Anyway?

James started by explaining that mathematics is different to science. Science starts with theories about how the world works, and tries to find evidence in support of those theories, until someone finds evidence that disproves or refines those theories. 

Mathematics, on the other hand, is built on logical arguments, with truths leading to new truths solely by following a system of consistent logic. You can't argue with mathematical discoveries if you agree with the truths they were built on, and the logic of the argument that connects the two. 

It is possible to work backwards to the very foundations on which branches of mathematics have grown - and find that you arrive at very fundamental simple statements that can't be decomposed further - these are the starting axioms of mathematics. Mathematicians will of course experiment to see where different axioms lead ... but that is a story for another day!

James went on to explain that lemmas are small, less important theories, and that both are traditionally structured as a statement and then a proof.

A proof can be deductive - simple statements leading to subsequent statements - but other strategies are possible. A common example is proof by contradiction, where a correct set of logical arguments leads to an evidently false conclusion which proves the starting assumptions were false, a surprisingly useful and common strategy. For example, proving $\sqrt(2)$ is irrational is done by contradiction (proof). 

Another strategy is induction, which is slightly more sophisticated. If a natural number $n$ belongs to a set, and we know that at least one number like 0 does too, then if we can show that the successor $n+1$ also belongs to that set, then like falling dominoes, so do all numbers. What makes this useful is if the set itself is interesting. Let's phrase this a different way to make sure we understand it:

  • If $\phi(0)$ is true,
  • And we can show that $\phi(n)$ is true implies $\phi(n+1)$ is also true,
  • Then by induction $\phi(n)$ is true for all natural numbers $n$.

You can follow a simple worked example (here).


What is a theorem prover, what is a proof assistant?

Developing proofs is not easy, and is a learned skill. 

It is too easy to make errors in logic as we assert new facts from previous facts, or we make a mistake about completeness of coverage, or we incorrectly assume a statement about integers is true for real numbers, ... the ways in which we might have errors or gaps in our proofs is overwhelmingly large.

A proof assistant helps us write correct proofs by telling us when there is an error in the logic.

Although the terms are sometimes mixed, a proof assistant is different to a theorem prover which tries to find a proof that connects starting facts to a goal.

The development of a theorem prover, and indeed a theorem generator, has been a holy grail of computer science and mathematics since the early days of digital computing. Great advances have been made but the problem has not been universally solved. 

The LEAN proof assistant is a relatively modern project, and has been enthusiastically embraced by mathematicians.

A unique element of LEAN is that it interactively shows you the state of a proof, including the state of sub-goals towards the overall proof goal. Watch James's example next to see this in action.


James's Worked Example: Area Of A Circle

James explained how LEAN proofs are written in a functional language, and proceeded to demonstrate with a live demonstration. Due to time constraints he didn't get the chance to complete it. 

He did recommend a separate demonstration he did with LEAN to prove the area of a circle.



Further Reading

Read a bit more about LEAN proof assistant:

Monday, August 2, 2021

Data Challenge Cornwall 2021

A discussion late last year led to the realisation that many data scientists, entrepreneurs and small businesses don't know about the challenges in established Cornish industries. This led to the idea of the Data Challenge Cornwall 2021



Aims

The aims are more about developing the local tech ecosystem, not really about solving a data challenge.

  • Widen awareness of key sectors and industries in Cornwall.

  • Inspire more people into high skilled careers in Cornwall.

  • Develop relationships between industry, businesses and community. 

  • Encourage new business activity around challenge solutions.


Diverse Sectors

There were data challenges from the following varied sectors:

  • Cornish Lithium - Optical Character Recognition of Historical Mining Records
  • NHS Cornwall - Visualising GP Surgery Data
  • Coastline Housing - Predicting Faults & Identifying Resident Support Needs
  • Crowdfunder - Realtime Data Visualisation of Pledge Activity and Trends
  • Smartline - Cluster Analysis for Creating Personas without Bias


Videos, Slides, Data

Recordings of the data challenge presentation are at the following playlist from our youtube channel:

The video descriptions contain links to slides and data.


Thanks

Special thanks to the Institute of Physics for supporting this first Data Challenge Cornwall 2021.



Monday, September 21, 2020

Periodogram vs FFT for Exploring X-Rays from Cygnus X-3

This month we ran data challenge, a chance to test and stretch our data exploration and detective skills. This article walks through one approach to finding a pattern in the data, underlining good data mining practice in the process.



A video walkthrough of the challenge and solution is here: [youtube]. The slides are here: [link]. 

The challenge guide is here: [pdf].



The Mysterious Cygnus X-3

Astronomers and astrophysicists work out what objects far away in deep space are by analysing the only thing that reaches us - radiation emitted by those objects. Some objects are easy to classify, some difficult. 

Cygnus X-3 has always been a difficult object to classify because it's emitted radiation doesn't easily fall into a known category. Over recent years evidence has grown that it is a binary system or two objects rotating around each other. It still gives off x-rays which are powerful and erratic, like explosions, suggesting a non-stable state.



For this challenge, a set of x-ray measurements from the SWIFT mission was provided to participants as a CSV file. Amongst other data, it contains measurements of received x-rays (counts per area per time), and associated timestamps.

The challenge was to find any pattern in the data, and if possible, infer what it meant for the physics of Cygnus X-3.

Given that there is still lots to understand about Cygnus X-3, it is entirely possible that we might uncover a new insight!



A First Look At The Data

It is always a good idea to look at any data we are intending to explore and apply tools to. In fact we should do this after every step of a data pipeline to ensure the data does look like what we expect it to.

You can follow the python notebook we used here:


The following summarises the shape of our data.

RangeIndex: 67220 entries, 0 to 67219
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    67220 non-null  float64
 1   rate    67220 non-null  float64
dtypes: float64(2)
memory usage: 1.0 MB

We can see there are 67,220 data rows, organised as two columns labelled time and rate (amount of radiation).

The following shows the very first view of our data.


We can see straight away that there are some very large outliers. Before we rush to remove that data from our set,  we should consider whether they are genuine measurements recording a flare up of radiation from Cygnus X-3. This is a valid hypothesis shouldn't be discounted immediately.

As data exploration should be exploration, we can see what happens if we do remove those very large values. We can consider a hypothesis that a radiation sensor that measures photon counts should never give negative readings, and that huge spike is a large negative number. It might be too simplistic to remove the smaller negative values because there may be a systemic shift in the data as it is recorded. One common approach is to keep values within 1 or 3 standard deviations. Again, just because everyone does it, is false comfort. The larger values might be the most informative.

We'll proceed with a brutal cutting off of negative values, and keeping the option open to revisit this decision.

The following shows the data after it has had negative values removed.


We can now see more detail in the data. There appear to be a train of spikes, with much of the data within an upper magnitude of 0.1 approximately. One participant suggested exploring the train of spikes to see if there was a layering of progressively larger spikes following an exponential pattern. 

Int64Index: 64403 entries, 0 to 67219
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    64403 non-null  float64
 1   rate    64403 non-null  float64
dtypes: float64(2)
memory usage: 1.5 MB

The cleaning process removed 2817 data rows, leaving 64,403 rows,


Fourier Analysis

A common first strategy is to see if data contains repeating elements. The go-to tool for uncovering a periodic pattern is the Fourier transform.

A Fourier transform switches the view of data from time to frequency. The following illustrates the idea.


We can see a simple pure sine wave sin(x) has only one frequency, and so has only one peak in the frequency time. Similarly a faster pure sine wave sin(2x) has only frequency, but a higher frequency, and so the peak is further to the right at a larger value.

The following shows signals that are made of two pure sine waves combined.


Although the first signal looks less simple, a Fourier analysis shows that it is made up of just two frequencies, that is sin(x) + sin(2x). The final signal is similar but this time the larger frequency component has a larger magnitude, that is, sin(x) + 2sin(2x).

You can experiment with building signals yourself:

Fourier transforms are very useful and very powerful. We can see how they can reveal the presence of dominant frequencies if periodic patterns do indeed exist in the data.

The following shows the results of the discrete cosine transform DCT, a slight simplification of the fast Fourier transform FFT.


Looking the results, and also zooming into the region near the origin, shows no apparent periodicity in the data at all. The large first spoke is the presence of a zero-frequency "DC" component in the data, and can be ignored here.

Should we give up on trying to find any periodicity in the data?


Irregular Data Sampling

Before giving up, let's revisit the data. We assumed the radiation measurements were taken at regular intervals. Looking closely at the data shows this isn't true. There are many ways to show this, such as viewing the differences between consecutive time stamps. 

The following shows the time stamps plotted one after another. If they were regularly spaced apart, they would follow the straight blue line - we can see they don't.


The popular FFT (and the simpler DCT) assume data is regularly sampled. That means we applied them inappropriately here.


Re-Sampling Data

One approach we can take to remedy the irregularity of the data is to resample it into regular time intervals. Python's rich libraries provide many data processing tools, including the ability to resample data, see the notebook for example code.

The following shows the DCTs of the data resampled to 1 day, 1 week and 1 month intervals.


Again, there doesn't appear to be any standout peaks in frequency. 

The process of resampling loses information by definition, which may have destroyed any weak periodic signal in the data. 


Periodograms

Looking back at the data again, we see that we have 64403 rows over about 13.9 years. That about 10 samples per day on average. That seems very sparse, and perhaps too infrequent to find any periodic data. 

The challenges seem stacked against us.

Luckily we can use a different algorithm for finding periodicities in data, called a periodogram. There are several variants of these methods, but essentially they are a brute-force search for which frequency in a given range best fits the data. 


They are much simpler than the very optimised FFT which also makes assumptions about the data. For us, the key benefit is that they don't require data to be regularly sampled.

The following shows the results of a Lomb-Scargle periodogram after the algorithm was configured to search across a rather wide range of frequencies.


This time there is a definite peak in the data. It corresponds to 4.79 hours


Cygnus X-3 Rotation Period

Through several and distinct methods, scientific consensus is that Cygnus X-3 is a binary system where the two bodies rotate around each other with a period of 4.79 hours

It is remarkable that we arrived so accurately at the same conclusion with this very simple exploration of the data.


Principles

As we worked through this data challenge we were reminded of good principles for working with data:



Further Reading

Although simple in concept, there is a lot of complexity in effectively using Fourier transform algorithms.  The following additional resources includes guides to avoiding the pitfalls.

Saturday, July 18, 2020

Typesetting Beautiful Mathematics in LaTeX, Lyx and MathJax

In this session we introduce LaTeX which is used pervasively in science, mathematics and computing to both typeset documents and render equations for use on the web.


The video for this talk is on [youtube].

The accompanying slides are online [here].


Why LaTeX?

It is worth setting out the value of learning, or at least knowing about, LaTeX. LaTeX is pervasive in the mathematics, science and computing community, where it is used to typeset documents such as papers, journal articles and books, as well as for rendering equations on the web.

Although LaTEX itself is quite old, it is still widely used, and the language itself is assumed in many other platforms and tools.

This is aside from the fact that LaTEX renders mathematics to a very high standard, and is used almost exclusively for publication quality rendering.


The following are sample pages from Ian Goodfellow's seminal 2014 paper on GANs showing extensive typesetting of mathematics that would challenge ordinary word processors.


Typesetting Equations

The following shows the same equation rendered during the PDF export from Google Docs, MS Word and LaTeX.


The top right shows Google Docs' rendering of the equation, and it is clearly the worst, with elements that are not just ugly but illegible. The top left from MS Word is acceptable. The bottom rendering by LaTeX is the most sophisticated, not just clear but also well balanced in terms of spacing and stroke weights.


This set of images shows a a deliberately challenging equation. The LaTeX rendering at the bottom succeeds in laying out glyphs for a multi-level expression. Furthermore, it shows LaTeX can make use of a much wider range of symbols which are impossible or not easy with the other systems, in this case the "maps to" right arrow.


History and Concepts

It is very helpful to understand how LaTeX works - and doesn't work. This avoids the common confusion when it is assumed it works like common word processors.

Donald Knuth is one of the leading computer scientists and mathematicians of the last 100 years. He is mostly known for his encyclopaedic and authoritative The Art of Computer Programming series.


During the 1970s many organisations automated and mechanised, and his publisher replaced the extremely skilled human craftspeople who manually typeset sophisticated mathematics for publication. Dissatisfied with the quality of the new system, he embarked on a journey talking to typesetters and learning as much as he could about their art. In 1978 he released TeX, his software for typesetting text and mathematics, which encapsulated as much of the expertise of typesetters as he could.


Today we consider TeX to be fairly low-level and is not often used directly. Instead, we use other software which provides higher-level abstractions, such as title, chapter, paragraph heading, and so on. LaTeX is the most popular such software, and is in effect, a set of macros around TeX.

TeX, LaTEX are open source, and part of a very rich and active ecosystem, with a vibrant global community.

An important feature of this ecosystem is packages which provide macros for typesetting things much more easily than with plain TeX. Packages exist for a wide range of tasks as small as fractions and tables, and as big as entire document classes. Document classes define a set of standards for particular types of documents, such as journal articles and academic books. Some publishers, such as the IEEE, will provide a document class which define precisely the design for their own journal articles.

Although we don't typically use plain TeX directly, the accompanying video provides a demonstration, to show the nature of the code, and the typographic refinement of the output. The following shows well-judged spacing between the letters F in 'efficiency' and the ligature that replaces FI in 'fiasco'.


The video also demonstrates a minimal LaTeX document (source). The similarities to HTML and other markup languages are significant. We can see how the document class is defines, how additional packages are imported, and how sections like the title and author are tagged.


The output PDF shows a pleasantly typeset page with headings and some mathematics aligned according to the equals sign.


Most practitioners use LaTeX directly, and it is just another language like HTML.

However, not everyone enjoys writing code to write their documents. For them, tools exist which hide the underlying LaTeX and TeX. A leading example is Lyx.


Lyx

Lyx is an open source tool, available for Linux, MacOS and Windows. It looks like a normal word processor, and is intended to be used without ever editing LaTeX source code.

The following shows Lyx being used to create a document, and we can see the main body text, section and chapter headers, and an included graphic, as well as a display equation.


The video walks through the steps creating a new blank document, entering text and also creating both inline and display (paragraph) equations. We also look at a more complex document, a book, which has include child documents for each chapter, and makes use of automated references to chapters and figures.

The video also demonstrates an important principle of TeX, LaTex and Lyx - separation of content from layout. The editors encourage us to write the content, and not manually fiddle with layout and style. The idea is that the TeX engine works out the correct page layout, using the content we provide. In Lyx, this is enforced by removing multiple spaces and line breaks - they are removed as soon as we type them. Lyx is not a WYSIWYG editor, by design. Actually, it is possible to override Lyx and LaTeX's document class standards, but this is not encouraged.

The following shows sample pages from a work-in-progress book, developed with Lyx. The document is a book class, and the defaults are mostly retained. The chapter heading typeface has been changed using Lyx's options.


The quality of layout and typography, thanks to Donald Knuth's TeX, gives a distinct professional impression.


Ease Of Use vs Capability

Word processors like Google Docs and MS Word are easy to use. So we have to be clear when we should invest in learning to use LaTeX or Lyx.

The following a chart often seen is discussions about LaTeX.


The chart shows that for low-complexity documents, word processors like Google Docs and MS Word are adequate and easy to use. As the complexity increases, both in terms of size and richness of content, especially mathematical content, those word processors reach their limits pretty soon, and just before they do, they become more difficult to use.

LaTEX on the other hand has a higher initial barrier because it isn't trivial to create trivial documents. However, as complexity rises, LaTeX shows the capabilities it was designed for - large and rich documents.

Lyx provides a medium sweet spot. It is easy to use for simple documents, and it provides an easy to use user interface covering the majority of uses that LaTeX is otherwise put to. That is, many kinds of documents can be created with Lyx, journal articles and books, letters and reports, without ever needing to write code. However, because it is a simplification, there will be things it can't do. Even in this instance, Lyx provides an "in-house" ability to use selected LaTeX within a Lyx document when only minor amounts of code are required.


Maths On The Web

A second, and growing, use case for rendering mathematics is on the web. We are increasingly writing mathematical content in blogs and online articles.

Modern web browsers implement many web technologies, from webGL to crypto, from webRTC to webXR. One of these is MathML for rendering mathematics. Sadly this standard is not very human friendly, and is considered something other software should create.


For this reason, libraries such as MathJax have emerged. They convert LaTeX style expressions into MathML, or where that isn't available, other rendering mechanisms like svg or html with css.

This is just one example of a tool which assumes a basic knowledge of LaTeX, emphasising the benefits of becoming familiar with it.

MathJax can be used with many web platforms and custom developments too. The video demonstrates how MathJax is integrated into blogger. The tutorial shows how simple inline equations can be created as well as display equations, as well as a sampling of more complex structures like arrays.

This blog has MathJax included by adding the javascript library to the template. The following is an inline equation, created by typing LaTeX between dollar signs, $y=ax^2 + bx + c$.

The following is an example of display mathematics, which uses double dollar signs to indicate a display formula.

$$ y-ax^2 +bx + c $$

The following shows a similar example which uses LaTeX code for creating a coloured bounding box, which is particularly effective on the web, moreso than in print.

$$ \bbox[yellow,5px]
{
e^x=\lim_{n\to\infty} \left( 1+\frac{x}{n} \right)^n
}
$$

The video shows this and other equations being coded and rendered. A really useful online LaTeX equation renderer is at codecogs, and is a good way to experiment or practice.  The following shows a more interesting formula being written in LaTeX and rendered, ready for download to be used elsewhere.



Common Examples of Maths LaTEX

The following presents some LaTeX commands for creating commonly used symbols. Remember to use enclosing dollar signs for inline maths, and double dollar signs for display (paragraph) maths.

LaTeX Code Result
x^2 $$x^2$$
\frac{1}{x} $$\frac{1}{x}$$
\sum_{1}^{n} x^2 $$\sum_{1}^{n} x^2$$
\int_{1}^{\infty} x\,  dx $$\int_{1}^{\infty} x \, dx$$
\ln(x) + \pi  \alpha \beta +
\gamma \rho \sigma + \delta \epsilon
$$\ln(x) + \pi  \alpha \beta + \gamma \rho \sigma + \delta \epsilon$$
\prod_{p \, \text{prime}}  \left( \frac{1}{p} \right) $$\prod_{p \, \text{prime}}  \left( \frac{1}{p} \right)$$
< \ldots  > \leq \sim \rightarrow \subset \supset \subseteq
\supseteq \times \cdot
$$< \ldots > \leq \sim \rightarrow \subset \supset \subseteq \supseteq \times \cdot $$

Fuller references are linked below.


References

This tutorial is intended as an introduction to the concepts behind LaTeX, and not an exhaustive reference to the (very many) capabilities of TeX, LaTeX and the most popular extension packages. There are many good references online and in print.

A very useful collection of MathJax commands, a subset of LaTeX, is collated in a stackexchange post.

Overleaf's explanation of maths specific LaTeX is helpful:

Documentation for Lyx isn't as well organised, but some have produced their own comprehensive guides, including this one (pdf).


Recommended LaTeX distributions:

Friday, May 15, 2020

Visualising And Optimising Classifier Performance With ROC Curves

This month we take a look again at measures of classifier performance, and learn about ROC curves which visualise classifier performance so different classifiers can be compared. We also see how ROC curves help us optimise tuneable classifiers.

There is an accompanying video on the group's youtube channel:


Slides used in the video are here: [link].


Classifier Performance Is Not One Number

Many machine learning models can be considered classifiers - deciding whether an input belongs to one of two or more classes.

The following shows an example of a model which classifies a patient as healthy or unwell with a disease.


How do we communicate a sense of this classifier's performance?

We could simply use a score of 4/5 because in testing it classified 5 people as healthy, but only 4 of the 5 were actually healthy. This seems reasonable.

However, we might care more about unwell people because healthy people aren't the problem that needs addressing. We could then say the score is 3/5 because of the 5 people it classified as unwell, only 3 were actually unwell.

Alternatively the greater concern might be avoiding sick people walking around the community who have been incorrectly told they are well. So 4/5 might actually be a better score, or perhaps 1/5 which better reflects the number of people incorrectly told they are well.

What this tells us is that one score is rarely sufficient to understand how well a classifier performs.


Confusion Matrix

We've just discussed several scenarios where different measures are useful for different tasks. These include the fraction incorrectly classified as well, and the fraction correctly classified as unwell.


The above matrix, known as a confusion matrix, shows all four combinations of correct and incorrect classification:
  • True Positives TP - actually well and correctly classified well
  • False Negative FP - actually well but incorrectly classified unwell
  • False Positive FP - actually unwell but incorrectly classified unwell
  • True Negative TN - actually unwell and correctly classified unwell

These four numbers provide a fuller picture of how the classifier behaves. We can see the true positives and true negatives are higher than the false positive and false negatives. If we chose to, we could focus on an individual score if it was important, like the false negative (telling someone they are well when they are not).

It can be convenient to combine some of these four numbers into a score that is useful for a given scenario. The following are some very common scores:
  • accuracy is $\frac{(TP+TN)}{(TP+TN+FP+FN)}$ and expresses the correct classifications as a fraction of all classifications.
  • recall is $\frac{TP}{TP+FN}$ and expresses how many of the actually positive data are classified as positive.
  • precision is $\frac{TP}{TP+FP}$ and expresses how much of the data classified as positive is actually positive.


Understanding these scores intuitively doesn't happen quickly, but happens through use. It is better to understand the plain English concepts and look these scores up when needed.

Looking at extreme examples can help develop intuitive understanding. The following shows a classifier which, given the same data, classifies only one person as healthy, and everyone else as unwell.


The classifier is not wrong in its healthy classification. However it has lumped almost everyone in the unwell class, which is intuitively wrong. We can consider this classifier as extremely cautious in assigning a healthy classification. We can see the precision is perfect because there are no mistakes in the healthy class. However the recall is very low because it has missed so many actually well people. As a combined score, the accuracy is half, which makes sense because overall it has got half of the classifications wrong.

The next picture shows a different classifier, which given the same data, just classifies everyone as healthy. It probably did no work at all and just lumped everyone in that category.


We can see this time the recall is high, perfect in fact. That's because the classifier doesn't miss any healthy people. But because it has classified unwell people as healthy, the precision is poor.

The confusion matrix is an excellent concise summary of how a classifier behaves. If we want to compare several classifiers, a visual approach is more manageable than trying to digest several 4x4 confusion matrices.


A good way of visualising classifier performance, specifically the information contained in a confusion matrix, is a ROC plot.


ROC Plots

A ROC plot simply plots the True Positive Rate against False Positive Rate. These are just the TP and FP normalised to the range 0 to 1:

  • TPR is $\frac{TP}{total P} = \frac{TP}{TP+FN}$
  • FPR is $\frac{FP}{total N} = \frac{TP}{FP+TN}$


The TPR takes account of FN as well as TP, and similarly, the FPR takes account of TN as well as FP, and as such the two scores FPR and FPR reflect all four values in a confusion matrix.

As a side note, you might have expected ROC plots to use TNR and not FPR. The FPR became the standard for historical reasons. In any case FPR represents the same information as TNR because FPR = 1 - TNR.


The following shows several key points on the ROC plot. It is worth becoming familiar with them as they are a good mental reference to compare real world classifiers with.


A classifier that assigns everything negative has zero TPR, but also zero FPR. This represented as a point in the bottom left corner of the ROC square plot. It is worth remembering this is a useless classifier.

A classifier that assigns everything positive has a top TPR, and also a top FPR too. This useless classifier is plotted at the top right of the ROC plot.

A perfect classifier which makes no mistakes at all has a TPR = 1.0, and a FPR = 0.0. This sits at the top left of the ROC plot.

A useless classifier which randomly assigns the positive classification to half the data has a TPR of 0.5, and a FPR of 0.5. This sits in the middle of the plot.

These examples show us that the useless classifiers seem to sit on the diagonal from the bottom left to the top right, and the perfect classifier sits at the top left.

This is correct. The following shows additional useless classifiers which randomly assign a positive classification for 30% and 80% of data. Whenever TPR = FPR it means the classifier is not using any useful information to assign classifications.


If that diagonal line is where useless classifiers sit, and the ideal classifier is at the top left, it means we want good classifiers to be above and to the left of points on that diagonal.


We can see that moving a point up means increasing the true positive rate. This is good. Moving to the left means reducing the false positive rate. This is also a good thing. Doing both is even better!

The following ROC plot shows the three classifiers we discussed earlier.


We can see that the moderately good classifier is in the top left part of the ROC plot. It's not an excellent classifier and that's why it is not closer to the top left corner.

We can also see the classifier that assigns all data to be positive at the top right. It is on the "useless' line as expected.

The very bad classifier which only managed to identify one positive instance is at the bottom left very close to the corner. It's not perfectly bad, but it is pretty bad!

We've seen how ROC plots make is easy to see and compare the performance of classifiers.

Tuning Classifiers

Many classifiers can be tuned or configured. These might be simple settings for thresholds, or more sophisticated changes like the number of nodes in a neural network.

Altering these settings changes the performance of the classifier. Sometimes the change is simple, but sometimes it is not.

The following shows our well/unwell classifier with a parameter that can be tuned.


It is good practice to run experiments with varying settings and record how well the classifier did on test data. Each test will result in a confusion matrix which we can plot on a ROC chart.


In the above illustration we can see several points from several experiments. They dots form a curve, called a ROC curve.

It is easy to see that as the setting is changed from on extreme to another, the classifier starts badly, improves, and then gets worse again. We can read off the point which is closest to the top-left ideal corner, and this will be the ideal setting for that classifier.

Easy!


Classifiers With Scores

Sometimes we can't adjust or tune a classifier. In these cases we can sometimes look behind the classification and find a value that led to that classification, often a probability. For example, neural networks have output nodes that have continuous values, often in the range 0.0 to 1.0, and these can be interpreted as the probability of a positive classification.

We can still take advantage of the ROC method for finding an optimal classifier. We do this by ranking the test data by score, for example descending probability.

We then use a threshold value above which we say the classifier should produce a positive classification. Of course this might not match the actual classes of the data. We produce a confusion matrix for the data above that threshold, and plot the point on the ROC curve. We repeat this for different thresholds, each time producing a confusion matrix, which leads to a point on a ROC plot.

As we lower the threshold the more we classify points as positive. This also increase the false positives too and we move towards the top right of the ROC plot.


The resulting dots on a ROC plot allow us to easily read off the ideal threshold which balances true positive and false positives.

In some cases, this method allows us to find a threshold that is close to but not exactly 0.5, even when out intuition tells us it should be 0.5.


Area Under The Curve

ROC plots are rich with information, and we should resist the temptation to reduce their information into a single value. However, a score known as area under the curve or AUC, is commonly used. It is is simple the area under a ROC curve.


We can intuitively see that the top purple curve represents a better performing classifier or classifiers because all the points are closer to the top left. The area under that purple is larger than the green curve.

It is easy to conclude that a larger AUC means better classifier. However, it is worth being aware that this single value can hide important features of a ROC curve.


Here we can see the green curve actually gets closer to the top left than the purple curve, and so is a better classifier when configured properly. Simply comparing AUC would have told us that the purple classifier was better.


Class Skew?

A common concern is whether an imbalance between actual positive and negative data might skew the ROC curves or cause them to become inaccurate. This is a reasonable concern as many statistical techniques are in fact impacted by unbalanced classes.

ROC curves aren't affected by class imbalance because the TPR is only calculated from actual positive data, and the FPR from actual negative data. If either of these was calculated from both numbers of positive or negative instances, an imbalance would affect the TPR and FPR values.

Let's look at an example to see this in action. The following shows the previous test data and confusion matrix.


The TPR is 4/6 and the FPR is 1/4.

If we now double the population of healthy people, that is, double the number of actually positive data, we have a new confusion matrix.


Doubling the population of actually well people has resulted in a doubling of true positives, and also the number of false negatives. The FP and TN counts are not affected at all because we still have 4 actually unwell people, and the same classifier treats them just as before.

Since TPR is a ratio, we have 8/12 which is the same as 4/6.

So class imbalance doesn't affect TPR and FPR and so ROC curves are not affected.

This is an important feature because it allows us to easily compare classifiers tested on different data sets.


Finding The Best Classifier - Simple

Up to this point we've seen how easy it is to read off the optimal classifier on a ROC plot. The intention was to underline this ease and simplicity.

Let's revisit this with a challenging ROC plot. Which of the two classifiers is better?


Both the green and purple classifiers seem to have points close to the ideal top left. In fact, their distance to the top left corner seems equal. How do we decide which is best?

The green classifier has a lower TPR but also a lower FPR at it's optimal region. The purple classifier has a larger TPR but also a higher FPR too.

We have to decide whether false positives are more costly to our scenario or not. In some situations, maximising the true positive rate no matter what is best. In other cases, we want to avoid false positives because the cost is too high, for example allowing someone to walk around the community with a dangerous infectious disease.

ROC curves don't take this decision away from us, they bring them to the fore and allow us to make informed decisions.


Finding The Best Classifier - Costs of False Positives and False Negatives

Here we'll look again at the finding the best classifier by looking more closely at the relative costs of false positives and false negatives.

Let's start with the simple case where the cost of a false positive is the same as a false negative.


Here we move a line with gradient 1 down from the top left corner until it touches the ROC curve.  The point at which it touches the curve is considered the optimal classifier, or configuration of a classifier. This is what we have been doing before, but without saying we were using a line with gradient 1.

Now consider what happens if we know the cost of a false positive is twice the cost of a false negative. We want to be more cautious now and not accept as many false positives (because they cost more now). That means moving to the left on the ROC curve for our optimal point.


In fact, if you did the algebra, you'd find the slope of the line we use to touch the ROC curve is 2, to reflect the double cost of FP.

Similarly, if the cost of false positives was half that of false negatives, we'd use a line with gradient 0.5.


The false positive rate is higher now, and that's ok because they're cheap.

We might be tempted to say that the gradient is simply $\frac{cost of FP}{cost of FN}$, as we've seen so far.

Actually, the class balance or imbalance, does matter now. If we have a very small number of actual positives in the data, then even if there is a higher cost for false positives, they won't occur large numbers. So our gradient also needs to take into account the class balance.

$$gradient = \frac{neg}{pos} * \frac{cost_of_FP}{cost_of_FN}$$

The video tutorial linked above includes an example which includes both factors.


New Classifiers From Old

Have a look at the following ROC plats of two tuneable classifiers.


The optimal configuration for the green classifier has a moderate TPR but a low FPR. The optimal point for the purple classifier has higher TPR but higher FPR too. On their own we can't have a TPR-FPR combination that's somewhere between the two.

Can we combine the two classifiers in some way to achieve an intermediate performance? MJ Scott (paper linked below) proves that we can do this.


In the plot above we have the green classifier A, and purple classifier B. Scott proves that we can combine A and B to create a classifier C which has a TPR-FPR that lies on the line between the two. This is done by randomly using A and B on data instances in a proportion that matches how far along the line we wan to be. Clearly using A lots will mean we're closer to A. The following shows C which is halfway between A and B, achieved by using A and B half the time at random.


This can be even more useful than interpolating performance.

The following shows a dataset which a simple linear classifier will find hard to partition well.


There are two classes, shown as blue and green, and the task of finding a threshold to cleanly separate the two is impossible.


Any attempt at a good threshold has large amounts of the wrong class on each side. The following shows the ROC plot for a classifier as this threshold varies.


We can see as the threshold rises the TPR stops rising as FPR increases. At the halfway point, the classifier is as useless as a random classifier. Continuing on we have a rise in TPR but also a large FPR too. We can see straight away the ROC curve does not go anywhere near the ideal top left corner.

By picking the best points on this curve, we can combine then to achieve a classification rate that is closer to the ideal corner and not possible with the individual classifiers.

If you're not convinced this works, the video includes a worked example with new TPR and FPR rates worked out for a combined classifier.

The following is based on a plot showing experimental results from Scott's paper.


This example and illustrations are taken from Scott's paper.

Summary

The key benefits of ROC analysis are:
  • An easy to read and interpret visual representation of classifier performance. 
  • An easy visual way to compare classifiers, even different kinds of classifiers.
  • Independence from class balance allows comparison of classifiers tested on different data.
  • Makes finding an optimal classifier or configuration easy, even when we care about costs of classifier errors.
  • Combine classifiers to make one that can perform better than individual ones.


Further Reading