Tuesday, January 28, 2020

Functional Programming and Data Science Applications

This month we had a talk introducing functional programming, its benefits and challenges, and potential applications to data science.

This blog post won't replicate all of the detail of the talk but provide an introduction for newcomers and some examples of key concepts.


The talk was given by Callum O'Brien who has expertise in both functional programming and mathematics. Amongst his many other interests are operating systems and their design.

Callum's slides are here: (plain text), and his code examples are here: (link).

A video recording of the talk is on the group youtube channel: [youtube].


Context

Functional programming has often been thought as a niche, difficult and very theoretical pursuit with little application in the real world building of technology.

Against this false perception, the last decade has seen a growing interest in functional languages and functional programming styles in everyday languages. Demand for functional programmers is also rising in some sectors.

Today, software is much larger and more complicated than it was 20 and 30 years ago. As the size of software grows, so does the challenge of avoiding or finding errors in that code. Some classes of error are a result of that complexity. Because functional programming avoids entire classes of error by design, the growth of software has seen a growth in functional programming.


What is Functional Programming?

Every programming language has a model for how it works, and how you should code.

The model for 6502 assembly language is the CPU registers and memory, and programs manipulate data amongst these using moves and simple operations like arithmetic. Object oriented programming languages encourage you to think in terms of semantically high-level objects which contain data and methods which may act on that data.

The model for functional programming is strictly mathematical functions, just like f(x) = 2x + 3 for example. This might seem like an unusual model for a programming language so let's look at some properties of mathematical functions.

  • Mathematical functions always give the same output given the same input. 
  • Mathematical functions don't refer to information beyond what they are passed as arguments, and don't leave any information beyond what is returned by the function.
  • Simpler mathematical functions can be combined, or composed, to achieve more sophisticated computation.

Let's think about how each of these might apply to a programming language.


1. Referential Transparency
The first property might sound trivial. Of course f(x) = 2x + 3 always gives f(5) = 13. But if you're an experienced programmer you will know that too often many functions we come across don't always do this. Too many hours are spent trying to find bugs which cause functions to give unexpected results. The worst bugs are where functions give unexpected results sometimes, but not always.

Now, if a programming language guaranteed that functions always returned the same outputs given the same inputs, many hours of debugging would be avoided. Our confidence in the correctness of a program also increases if we know the functions behave in this strict manner.

Purely functional languages only allow you to write functions that have this property, called referential transparency.



2. No Side-Effects
That second property also sounds trivial. To we evaluate the function f(5), all the information needed is in the definition of the function, 2x+3, and the information passed as an argument, 5. There is no need to look outside the function at all. If our code obeyed this property, it would mean our functions would work totally independently of whatever else was happening in our code, or indeed elsewhere in our system. This independence is needed to support the first property. If a function depended on some external variable, or some data in some object, then it it wouldn't be independent, and we couldn't make strict guarantees on it always producing the same result given the same input.

That second property also says that a function doesn't leave behind any information after it has been evaluated. Evaluating f(5) leaves us with only one outcome, which is the answer 13. No information elsewhere is changed. Again this sounds trivial, but in the real world software functions break this rule frequently, changing information outside the function, or leaving behind newly created information. Changing external data means others that depend on that data can no longer guarantee the same output for every input. In complex software, many bugs happen because something has changed data in a way you didn't expect. The dependencies can grow into an entangled error-prone web.

Purely functional languages enforce this property by preventing your functions having any side-effects.



3. Composability
The third property also sounds trivial. If we have g(x) = 3x, we can have combinations like h(x) = f(x) + g(x) as well as h(x) = f(g(x)). We can use smaller functions to build bigger ones, and they too will behave in the same strict way as the smaller ones. While this might not sound like much, for software engineering, guaranteeing that a bigger program made of many smaller functions will also behave just as well as the smaller ones is a significant benefit.

But there's more in this third property. That example, f(g(x)) shows the function f() being passed a function g() and not a number. To avoid using data that is external to our function, the only way to pass information to a function is as an argument. If functions were only allowed to take and return integers, being able to build useful programs would be really hard. By being able to take functions, it makes building programs much easier.

If we're composing programs out of functions, and those functions can take functions as arguments, and those functions could also take functions .. we can see that functions must also have the ability to return functions not just integers. This is why functional programming languages treat functions as first-class citizens.



That was a lot of discussion from just three observations about mathematical functions!

Why Maths?
It is worth asking why languages might want to use mathematical functions as a model. The reason is that mathematics is strict enough for us be able to make guarantees about mathematical functions, and compositions of mathematical functions. Software engineering would benefit hugely from similar guarantees. We've talked about guaranteeing the same outputs for the same inputs. But there are other guarantees too. In mathematics we can prove things about about entire classes of functions and algebras. In software, this would mean we can apply guarantees to entire classes of functions or data types. These guarantees can help us understand how a program will behave, and can also help the compiler to make optimisations.

The key mathematical theory for computation, lambda calculus, was developed in the 1930s by Alonzo Church (Alan Turing's phd supervisor), before practical computers were ever built. Many functional languages derive their design from this lambda calculus.


Let's now bring some of this theory to life with some illustrative examples.


Example: Pure Function vs Impure Function

Let's have a look at this python function which joins two lists together.


# impure function

a = [1,2,3]
b = [4,5,6]

def impure_append(b):
    a.extend(b)
    return a

print(impure_append(b))

print(a)


When we run the function, we do get a result which is the two lists joined together. However, the list a has been modified by the function, and we can see this when we print it after the impure_append() has been called.


[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5, 6]


Clearly impure_append() is not a pure function because it modifies data outside itself.  Another reason it is impure is that the output of impure_append(b) does not just depend on b, but on data outside itself which can change. In fact if we call impure(b) again we get a different result.


[1, 2, 3, 4, 5, 6, 4, 5, 6]


Now have a look at a pure version of this function.


# pure function

a = [1,2,3]
b = [4,5,6]

def pure_append(a,b):
    return a + b

print(pure_append(a,b))

print(a)


When we run the function, we do get a result which is the two lists joined together. We can also see that the list a has not been modified.


[1, 2, 3, 4, 5, 6]
[1, 2, 3]


The function pure_append() will always give the same results for the same inputs, and it does not modify any data outside itself.

Although python is not a strict purely functional language, we can still benefit from building our programs with well behaved safer functions.


Example: Imperative vs Declarative

The requirement for functions to only depend on the information passed to them, including functions, and also have no side-effects, encourages a style of programming where there are no statements creating or changing state directly, but instead having expressions which express how the input is related to the output, a style called declarative programming.

Lots of programming is done in an imperative style. We state what needs to happen to data, and in what order, for the output to be calculated from the input.

In declarative programming, we only state how output is related to input using functions, just like a mathematical relationship using mathematical functions.

Have a look at the following python code for a factorial function. It is written in a traditional imperative style, where data is created and then changed in a loop. The order the instructions are carried out is important to the correctness of the result.


# imperative factorial

def imperative_factorial(n):
    
    if n == 0: return 1
    
    product = 1
    for i in range(1, n+1):
        product *= i
    return product

imperative_factorial(5)


Another observation is that product is 1 initially, but later is something else. This can't be true mathematically. In functional programs variables are immutable. This shouldn't be seen as a limitation, but a guarantee that if those variables, or indeed functions since they're first class, will always have a single always correct value.

The following is a declarative version. It relies on the fact that n! = n * (n-1)! or factorial(n) = n * factorial(n-1). Although it might take extra thought, the resulting code is simpler. The definition of the function doesn't focus on what to calculate but focusses on how the input and output are related.


# declarative factorial

def declarative_factorial(n):
    
    if n == 0:
        return 1
    else:
        return n * declarative_factorial(n-1)

declarative_factorial(5)


What's more, this definition is recursive, which means the function is defined in terms of itself. Because functions must be pure, functional programming relies much more on solving problems recursively. Iterating is done recursively, not with a counter-based loop.

The above examples were in python to demonstrate the ideas in a familiar language. Functional languages are often much more concise in expressing these kinds of function. The following is a Haskell definition of a factorial function.


factorial :: Integer -> Integer
factorial 0 = 1
factorial n = n * factorial (n - 1)


The first line declares what our function factorial does. It takes an Integer and returns and Integer.  The second line defines factorial for the case it is applied to 0. The final line defines factorial of n as n multiplied by the factorial applied to (n-1).  Simple and clean! Notice also how this definition is much more mathematical, stating the relationships and special cases (n = 0).

A more compelling illustration of declarative programming is the Haskell example of the Quicksort sorting algorithm. Quicksort takes an item, a pivot, from an unsorted list and then moves everything that is smaller than it to the left, and everything bigger to the right. The pivot is now in the right place, but the sub-list to the left and right are unsorted. The same method is then applied recursively to each left and right sub-lists. Even saying this tells you that writing Python or C code would take quite a few lines. The following is a Haskell declarative function that does this.


quicksort :: Ord a => [a] -> [a]
quicksort []     = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser  = filter (< p) xs
        greater = filter (>= p) xs


The first line is the signature, stating that the quicksort function works on a list of order-able data types, takes a list as input and returns a list. The next line simply states that the quicksort of an empty list is an empty list. The third line has the magic. It says that the quicksort of list (p:xs) where p is the head and xs the tail, is simple the concatenation of a quicksorted lesser list with that p and also with a quicksorted greater list ... where the lesser list is all values less than p from the tail xs, and greater is those with larger values than p.

Notice how this very concise definition of the quicksort function focusses on what a quicksorted list is, rather than the memory-level mechanics of how to re-arrange a list to get it sorted.

You can try Haskell yourself using an online interpreter: https://repl.it/languages/haskell together with a beginner-friendly practical tutorial http://learnyouahaskell.com/.


Performance and Parallelism

In recent years, it has become harder and harder to make commercially available computers faster. More specifically, the performance of a single CPU has not increased to the extent it did 20 and 30 years ago. Instead, CPU and system builders are trying to improve performance through parallelism - multiple cores in CPUs, and multiple CPUs for larger systems. For even larger systems, higher performance can be sought with computing nodes distributed over a network.

The hardware is only half the solution. We need our software to work efficiently across multiple computing nodes, and also to scale well with more nodes.

Experienced programmers will tell you that developing concurrent and parallel software in languages is a major challenge. Not all tasks can be parallelised, but when they can, software is often plagued by a plethora of hard to debug issues like race-conditions, synchronisation and locks.

Parallelism makes ensuring data is not incorrectly changed even harder.

Luckily, functional languages require that no function can rely on, or change, information outside of itself. This isolation, which some might have considered painful, is a huge benefit in this era of parallel and distributed computing.

The ease with which functional programming languages, with their roots in the mathematical theory developed in the 1930s before computers, can painlessly take advantage of modern parallel hardware when languages like C, Java, Python struggle is striking!

There is something else to say about performance. Because a pure function always returns the same result for the same inputs, we can safely remember previous results and avoid doing the calculations again the next time a function is called with the same arguments. This is called memoisation.

In languages that don't require functions to be pure, this kind of optimisation is difficult and limited to a small number of very well understood functions, which are rarely those written by the user.


Testing & Verification

The task of testing software is dramatically easier if it is purely functional. We can test that a function produces the correct output for a given input, and that is a complete test of the that function's behaviour. We don't need to worry about external factors changing the output because pure functions don't use external information, and don't change information outside themselves. In practice, this avoids the huge effort put into mocking up external or surrounding software or systems.

Because pure functional languages are based on a mathematical framework, they are easier to "prove" correct. In reality, this depends on how closely a particular language aligns to a mathematical framework such as lambda calculus. Haskell programs are much easier to verify than C for example, and are amenable to proof assistants like Coq.


Types, TypeClasses

Functional programming languages, having come from a mathematical foundation where correctness is emphasised, will have a strong type system. Types describe values, so 1, 4, 6 are all values of type integer, and 'a', 'x', 'z' are all values of type char. Haskell has a rich type system which applies also to functions because they are first-class citizens.

Type classes define both possibilities and constraints for types. For example, integer and character types both implement equality and so equality is a type class. Languages like Haskell have a very rich type system.

Many classes of error can be avoided more elegantly by enforcing consistency and constraints at the type level.

I found this video useful in understanding the difference between types and type classes:



The Input/Output Puzzle (and Monads)

A very common question about pure functional programming is how any program can be useful if it can't interact with the outside world. To be useful, many programs need to obtain information from the a user, a file system, or network, for example, and also write information to similar external systems.

It seems input/output contravenes the need to keep functions pure and have no side-effects.

How to solve this question as been hot topic of research in recent years and is ongoing. The core aim of functional language designers is to ensure that a program, which is a composition of functions, retains the referential transparency guarantee - which is necessary for both validating the correctness of code, and also for correctly optimising the execution of the program. A secondary aim is to ensure programmers don't inadvertently cause bugs by reading invalid data or overwriting good data with bad.

Imagine a function getName() reads information from the user (or database, or network server). If we call it again, it might return a different value. Straight away this breaks referential transparency. Remember all functions must always return the same output when called with the same input, if any.

Let's consider another problem. Imagine two requests to read external information, a = getPrice() and b = getPrice() called at a later time. A good functional compiler or runtime knows that a pure function will always return the same answer. Here it will return the first answer to the second call, because, well, why would it calculate it again? Even if it didn't do this, there's no guarantee the first one will actually be run first, because pure functions can be run in any order as long as their arguments don't define an order, and here they don't. It gets worse! Some languages are lazy, which is a good thing, but doesn't work with the time-sensitive need to "get this price now, not later".

One solution is to model the external world as data type and use these as additional arguments to functions. So a function getName() is passed a state of the world, and it returns not just the desired information but also a new world state, because the world has changed. Code would look something like this (name, world2) = getName(world1). Another call to getName() could not use world1 again because that world is now in the past. And some functional languages enforce this. Notice how getName(world1) and getName(world2) can now legitimately return different values because they have different arguments.

Too often the question of functional programming and side-effects leads to the mention of something mysterious called a monad. Monads are considered by many as a difficult concept with many warning mere mortals to keep away.

Words like monad, functors, and monoids come from the mathematical study of computation, functions and strict programming languages that aim to give us the same guarantees as mathematics itself.

We won't go into detail here but simply leave an intuition about monads. That augmenting of a function and return value with a world value we just talked about needs to be done in a way that allows functions to still be able to be chained together, composable, in the way we're used to. The underlying function structure that allows this is a monad. That's it. A monad doesn't magically solve side-effects, it is just a way of augmenting functions with that additional unique-making information that enables functions to be pure, whilst also enabling those functions to be composable.

Monads have become lore in the world of computer science, responsible for many memes, myths and legends!


For a programmer, a nice feature of languages like Haskell is that any function that wants to have aside effect has to declare it in its signature and return type. You can't get away with slipping in a getPrice() inside an ordinary looking function that returns an integer. The return type might be IO(integer) or something like that. In this way, anyone else using your function knows that you used a side-effect, reading data from an external source for example.

You can read a good explanation of the IO problem here:


A mathematics-light starting point for understanding monads:



Recommendations For Real World Use

When considering functional programming real world use, there are several factors to evaluate.

An important consideration is the scarcity of people who can develop software in Haskell, as an example. Haskell is not yet as widespread as Javascript and Python. This scarcity is then a risk to be compared to potential benefit. There is growing demand for Haskell programmers in sectors like finance, and this is viable because there is sufficient availability of Haskell programmers in those regions.

A pragmatic strategy for all sectors is to infuse the discipline and good practice encouraged by functional programming into your existing development. Many languages are adopting syntax to support functional style programming, but simply using recursion, list comprehensions or maps isn't sufficient for writing safer more correct code. Adopting patterns that minimise shared state and make transparent what functions are doing is what will have an impact.

The natural ability of functional programming to do parallel computation with ease is a major growth area for functional programming, and one worth investigating if this applies to your work. The Erlang functional language has been used in highly-concurrent telecoms for some years now.

One potential disadvantage of functional languages is that they make use of memory very liberally. In the early days, even simple programs would consume inordinate amounts of memory, but today's compilers and runtimes are much more efficient. If your needs require constrained memory consumption or memory consumption that is highly predictable then functional programming is unlikely to be a good fit. For a vast majority of sectors memory is a very cheap resource.


More Reading


Saturday, November 30, 2019

Special Event - Big Data Challenges in Physics

This month was a special event jointly organised with the Institute of Physics who have been focussing on the theme of big data in 2019.


Dr Craig McNeile's slides are here: (pdf). A video of his talk is on the group's youtube channel: (video).

Prof Robert J Hicken's slides are here: (pdf). A video of his talk is on the group's youtube channel: (video).


Talk 1: Using Supercomputers to Search for the Breakdown of the Standard Model of Particle Physics

Dr Craig McNeile, a Lecturer in Theoretical Physics, University of Plymouth, gave an introductory overview of some of the big open questions in particle physics.


Craig started by clearly setting the scene and the problem. Observing how the universe moves suggests there is more matter than is suggested by the amount of light we observe.


In fact the observed matter is only 5% of the matter that is there causing gravitational effects. The invisible 95% is called dark matter - it doesn't interact with the electromagnetic radiation so we can't see it even with instruments that record x-rays or infrared.

That dark matter must be composed of sub-atomic particles that we haven't yet discovered. And the search for dark matter is centred on the search for these particles.

Scientists have, over the years, discovered many particles, most recently by smashing particles together at high energy to see if new particles are created with that energy. The higher the energy, the greater the chance of creating larger mass particles. But higher energy colliders are more expensive, the Large Hadron Collider at CERN cost just under £4 billion to construct.

Detecting particles sounds simple but is a complex process of deduction from vast amounts of data generated by detectors around a collision. The amount of data generated is huge and the calculations required to deduce the presence of particles, with sufficient confidence, requires super computing resources.

To guide the search, we need theories that predict how matter is composed and organised. The Standard Model describes all known particles, and all interactions (weak, strong, electromagnetic) except gravitation. Because it is a highly symmetric model, it describes particles that were not observed initially but were found years after they were predicted. The Higgs boson is a recent notable example.

The Standard Model doesn't predict dark matter - so new theories are needed. The so-called "string theory" is one of these, as is QCD, quantum chromodynamics.

The search for particles is complicated by the fact that the Heisenberg uncertainty principle means that for a short period of time, particles/energy can pop into existence and then disappear - fluctuations in the vacuum. These short-lived particles do have an effect on longer lived particles, which becomes significant when we're trying to sift through observations to try to find the subtle effects of new particles.


Modelling the theoretical dynamics of QCD, so they can be compared with observations, is mathematically involved, and in practice is done using the Monte Carlo random sampling method which makes feasible otherwise difficult calculations to be done.

These discrete calculations are huge and requires a distributed computing architecture to achieve. Craig indicated how GPUs are particularly effective, and how compute nodes use MPI and other message passing protocols to coordinate distributed computation. He explained how computations over 2000 cores is fairly common!

Craig discussed how significant effort is put into optimising code to take advantage of the hardware. He suggested that a compute efficiency of 38% of the theoretical peak was achieved. This indicates how difficult it is to optimise code to best take advantage of the available hardware's capabilities.

A real problem is that optimising code to a particular hardware is a human-intensive and expensive task, and the benefits don't last as new better hardware architectures emerge regularly. The portability of code across computers, particularly with respect to compute efficiency, is a hard problem.

Craig also gave us an insight into the challenge of data storage, organisation, and transport. The datasets are typically large, petabyte scale, and computation needs it to be local to the compute nodes. This means it is often being transferred across relatively slow networks.

Craig underlined the importance of visualisation to make sense of the data and computation results. The following shows an effect on the vacuum between a quark and anti-quark as they are separated, something which is not so easily understood merely by looking at the numerical results of the simulations.


Interestingly, Craig explained how the economics og high performance computing mean that building your own clusters are always cheaper than trying to achieve similar performance with consumer clouds such as AWS, Azure or Google.

The Dirac academic project is a collaboration between universities and include systems with thousands of nodes, eg Cambridge with 1152 Skylake nodes and Edinburgh with 4116 Xeon processors!

Despite the huge theoretical effort and large sums of money invested in experiments and computational power, the search for dark matter continues!

One member of the audience asked an interesting question as to why people search for something for which there seems to be such thin or subtle evidence or clues. In my opinion, the search for the truth is an inherent human motivation, particularly when that truth is mathematically beautiful.


Talk 2: The Challenge of Storing Ever Bigger Data

Prof Robert J Hicken, a Professor of Condensed Matter Physics, University of Exeter, gave an overview of the challenge of storing and retrieving ever larger volumes of data, and the need for new technologies which can scale to billions of users, are faster and energy efficient.


Rob set the scene with figures showing the huge amounts of data we store, to the scale of exabytes. Billions of people are uploading videos and photos constantly, and demand these are stored for potential later retrieval!

The volume of data isn't the key problem on its own, the energy required to organise and retrieve that data is significant.

"One Google search uses ~ 1 kJ of energy (equivalent to running a 60 W light bulb for 60s), and Google carries out more than 1 trillion searches per year"

This is a huge amount of energy when we think about how many google searches happen globally in any time interval.

Perhaps surprisingly, spinning platter hard disks (HDD) maintain an edge over solid state storage (SSD) in terms of the density of data per unit area. This is important in large installations such as cloud-scale data centres. HDDs also remain cheaper per volume of data too.


As demand for storage grows we need to explore how this technology can either evolve or change more drastically.

Rob's research has this motivation at heart - greater storage density, at lower energy consumption, and faster where possible.

Rob took us back to basics, explaining how traditional storage, tape and HDDs use materials which can be locally magnetised in a one of two directions to denote the binary bits, 1 and 0.


Surely the smaller these regions the more data we can store per area?

The problem is that the smaller the 'grains' the more susceptible to thermal noise they become. This could be addressed by using materials that require larger energy to switch between 1 and 0, but this then means we haven't addressed the energy efficiency challenge.

Rob summarised the trade-offs as a triangle, a trilemma of storage density, writability and thermal stability, stating we can only have 2 of the 3.


Rob then explained that this limiting triangle can be addressed by new paradigms for the data recording process.

One approach is to use heat assisted magnetic recording, HAMR, to lower the energy barrier to writing data. Another is to use microwave excitation which has a similar effect, MAMR.

Rob explained that Seagate will be manufacturing HAMR heads in the UK and showed a video illustrating how the precision required us achieved using a planar fibre optic guided laser to focus energy via a gold peg.


This is an example of new, and relatively advanced techniques, being industrialised and commercialised to meet demand.

Rob showed a chart showing HAMR to be used for the period 2019-2023, beyond which a heated dot method is predicted until 2025. Beyond that is an open question.

One area of excitement is the ultrafast demagnetisation effect, where an ultrashort laster pulse of 100 femtoseconds caused a halving of magnetisation in a nickel film in less than 1picosecond.


This is far faster than any other observed change in magnetic state - and is a very promising basis for future storage technologies. Typical retrieval times for data storage and retrieval is about a nanosecond yet can be transmitted over fibre optics at Gbits/s.

Expanding wider, Rob shared phenomena where circularly polarised optical pulses also produced very fast changes in magnetisation in materials - and such light could lead to even more energy efficient storage which also does away with the need for managing heat egress from the HAMR technique.


Thoughts

The evening was a special event with invited scientists, and this attracted a larger audience. The broader impact is that science is brought closer and more directly to the community and this is always a good thing - not only from the engagement between academia, industry and community, but also because it makes science a more realistic career option for many.

For this reason I was particularly pleased to see younger members of secondary school age at the event!


As Toby of Headforwards also stated, any successful region needs a healthy vibrant and active science and technology scene. It makes the region more externally visible, attracts investment, and provides pathways for high-skills careers.

More fundamentally, it allows people to meet and this real-life social network is the basis of all successful ecosystems and economies. And supporting this is the aim of Data Science Cornwall.

Personally I was very pleased to have strengthened links with the Institute of Physics, and hope to do more with them in the region.


Acknowledgements

This event was jointly organised with the Institute of Physics, the leading professional body and learned society for physics in the UK and Ireland.

The event was also supported by the Cornwall Science Community, formerly the Cornwall branch of the British Science Association.

Ongoing advice and refreshments sponsorship were kindly provided by Headforwards.

Thanks also to Falmouth University for continuing to hosting us.

Saturday, September 28, 2019

Practical Steps for Building Intelligent Systems

This month we have a talk encouraging us to think more broadly than a specific algorithm or library and consider wider questions of the understanding the problem, evaluating solutions, understanding data and bias, measuring performance and accuracy, and the benefits of a solution.


The slides for this talk are here: [pdf].

A video recording of the session is on the group youtube channel: [link].


The Challenge

A naive approach to building a solution to a machine learning problem is to pick a learning algorithm, train the model with data, and then use it.

This is naive for many reasons. Here are just some of them:

  • We didn't think about the suitability of the data we're training with
  • We didn't understand the accuracy of the trained model
  • We didn't consider alternative learning algorithms
  • We didn't understand the value of the solution, comparing benefits to costs, not all of which are technical


We were very lucky to have Aleksandra Osipova, a data scientist at Headforwards, to lead this session, encouraging us to think about these broader important questions in the context of a methodology for developing machine learning solutions.

Aleksandra has a masters in Complex System Modelling from Kings College London, where she worked on computational neuroscience.


Overview

The following diagram summarises the key areas covered by Aleksandra's talk.


She started by outlining a set of key questions we should start asking from the start of a machine learning problem:

  • why - what's the problem
  • what - what are the possible solutions
  • how - machine learning techniques
  • how well - robustness, accuracy, scalability
  • value - benefits vs costs including operational and people
  • risks - data bias and incompleteness, ethics



The Problem

Aleksandra rightly emphasised a still too common enthusiasm for a technical solution which doesn't actually solve the problem at hand, or more essentially, addresses the user needs.

User-centric approaches encouraged by agile and other approaches can really help here by enforcing a solution-agnostic analysis of the problem.

Sometimes the best solution is not a complex machine learning model. And your process needs to give you the ability to find these.


Data

Data is how we train our machine learning models. The importance of data is critical as it is to a very large extent the only thing that we can learn from.

Our analysis should identify which data we need to solve a problem. Often we don't have ideal or perfect data, and so the challenge is to understand how useful the data we do have is.

This is the step that carries most risk. Failure modes include:

  • poor data quality that isn't identified
  • insufficient data to actually learn from
  • bias in data


As with identifying the problem, Aleksandra underlined the importance of speaking with domain experts - people who understand or have experience of the problem domain.

Domain knowledge can help select data and shape the learning process, and can make intractable problem feasible, and difficult problems easier.

It is a common saying that "80% of a data scientists work is understanding the data". We'd extend this to understanding the problem domain.

One area that Aleksandra picked up on, and is too often overlooked by others, is that data itself can be dynamic. It is therefore important to explore how data behaves dynamically indifferent phases. A good example is EEG brain wave data is dynamic, but widening our view, we can see it can follow different categories of dynamic behaviour. Looking at individual data points would be too narrow a view.


Machine Learning

Before diving into a solution or technique, Aleksandra recommended we apply the discipline of developing and testing hypotheses.

Hypothesis testing is a mature, if not always well understood, field of statistics. It provides assurance that we haven't incorrectly selected a hypothesis when in fact there is statistical evidence to support alternative or inverse hypotheses. The wikipedia page provides good illustrations of hypothesis testing, with the criminal trial being particularly educational.

Staying broad in context, Aleksandra explored key questions around the machine learning step, many non-technical. For example:

  • is there a cost to obtaining and using the necessary data?
  • what is the computational cost of the given machine learning model?
  • does the machine learning model need to be updated online, even in real-time, or is it trained off-line?
  • is updating the model with new data possible, feasible?


As discussed before, data is key. Aleksandra again underlined the need to formalise the data preparation steps - collection, cleaning, perhaps normalising it, transformation rules if needed. In theory these seem trivial but in practical deployments, these become key steps for monitoring data quality, and triggering a fault with the incoming data.

During the data exploration phase, domain knowledge or other techniques can lead to a reduction of data, either by removal, or by combining variables to more meaningful ones which more effectively train a model.

During the training phase, it is a very well known approach to train on a subset of the data, and then to verify the model on a a different subset of the data mot previously seen by the model. This makes a lot of sense - don't test on data you've already seen.

The group discussed strategies for how this split can be made, and the conversation was inconclusive. It likely depends on the domain and model more than any theory about an optimal split.

Predicting the scalability of training or using a model is difficult. At one level, we can understand the computational complexity of the algorithms, but in practice these estimates can be disrupted by hardware and software effects, for example the characteristics of network connected storage and saturation of a network. The best approach is to predict as best we can but to test on representative infrastructure to avoid surprises when the impact of a scaling failure would be significant.


Evaluation

A trained model can give us answers which are correct. This is insufficient assurance in real world scenarios.

In the real world, we often need to know how often a trained model is correct/incorrect. Better yet, we should have an idea of how far wrong or right the answers are.

Aleksandra presented an excellent definition of precision and accuracy, both providing useful insights into how well a model works. The wikipedia page has a good illustration of the difference.


The group discussed the relative merits of loss function often used in different kinds of learning model.

At one level, the shape of the loss function drives the learning (down an error gradient), but at another level of sophistication,  the choice of loss function can effect the efficiency or robustness of learning. In reality, even if the theory is robust, it is not a substitute for testing against your own domain and data. There was also a discussion on whether loss functions are the same as regression metrics like mean squared error.

The terminology cross-validation is used often in the field, and simply refers to strategies for splitting a limited labelled data set into a training and test set to check that a trained model hasn't over- or under-fitted. Over-fitting is when a model simply learns to memorise the training data, and appears to perform very well, but against previously unseen data performs badly. A better model will generalise and perform well against unseen data even if it doesn't quite perform so well on the training data itself.

The following diagram from an accessible article on cross-validation illustrates the difference between validation error and training error.


As an aside, I remember ROC analysis as being very good at summarising various modes of failure in a simple visual form:




Value, Benefits and Costs

Alexandra talked about the non-technical implications of model. She explored how different approaches have different needs in terms of skills and roles, and this can be either a cost or an organisational inertia.

This discussion was in the broader context of value. Not only is value related to computation, infrastructure and people cost, it must also be related to how well a solution solves a problem.

Members of the group discussed examples of huge efforts to tiny gains or real world effects. The disconnect between pure statistical measures and practical value was highlighted, and led to the conclusion that the correctness of a solution can't be understood without the context of the real world impact.


Risks

The group discussion was vibrant and continued after the session. A key strand of discussion was around the risks around naive use of a trained model:

  • biased outputs from biased training data
  • biased outputs from the learning algorithm / model itself
  • ethical issues around loss of privacy from learned insights, unaware to data subjects when they provided their data
  • ethical questions around "just because you can, should you?"



Reference

Aleksandra recommended the Tour of The Most Popular Machine Learning Algorithms as a good overview of methods and the kinds of problems they can be use for:



There is also a semi-humorous collection of loss functions which demonstrate either faulty methodologies or very challenging training scenarios:





Friday, July 26, 2019

Hands-On Introduction to PyTorch

This month we ran a hands-on introductory tutorial on PyTorch.


The slides are online (link).


Machine Learning

We started with a quick overview of machine learning, and very simple illustrative example.

The following shows a machine which is being trained to convert kilometres to miles. Although we know how to do that conversion, let's pretend we don't and that we've built a machine that needs to learn to do that from examples.

When we build that machine, we make some assumptions about the model. Here the assumption is that miles and kilometres are related by a multiplicative factor. We don't know what that parameter is, so we start with a randomly chosen number 0.5.


When we're training that machine, we need to know what the correct answer should be to a given question. It is with examples of correct pairs of kilometres and miles that we train the machine - the training data.

When we ask the machine to convert 100 km to miles, the answer it gives us is 100 * 0.5 = 50 miles. We know this is wrong.


The difference between the machine's answer 50 and the correct answer 62.127 is 12.127. That's the error.

We can use this error to guide how we adjust that parameter inside the machine. We need to increase that parameter to make the output larger and closer to the correct answer. Let's try 0.6.


This time the machine tells us that 100 km is 60 miles, which is much closer to the correct 62.127. The error has been reduce to 2.137.

If repeat this process of trying different known-good examples, and after each one tune the parameter in response to the error, the machine should get better and better at doing the conversion. The error should fall.

This is essentially how many machine learning systems work, with key common elements being using known-good questions and answers as a training dataset, and using the error to guide the refinement of the machine learning model.


Neural Networks

Combining several small learning units intuitively allows more complex datasets to be learned.

Animal brains and nervous systems are also collections of neurons working together to learn and perform complex tasks.


Historically, research in machine learning took inspiration from nature, and the design of neural networks is indeed influenced by animal physiology.

The following shows a simple neural network, made of 3 layers, with each node connected to every other node in preceding and subsequent layers.


In modern neural networks the adjustable parameter is the strength of the connections between nodes, not a parameter inside the node like our first kilometres-to-miles example.

As signals pass through a network, they are combined when connections bring several signals into a node. As they emerge from a node, they are subject to a threshold function. This is how animal neurons works, they only pass on a signal once it reaches a certain strength. In artificial neural networks, the non-linearity of this threshold function, or activation function, is essential for a network's ability to learn complex data.

If we write a mathematical expression which relates the output of the network to the incoming signal and all the many link weights, it will be a rather scary function. Instead of trying to solve this scary function, we instead take a simpler more approximate approach.

The following shows the error (the difference between the correct answer and the actual output) as it relates to the link weights in a network. As the weights are varied, the error can go up or down. We want to vary the weights to minimise the error. We can do this by working out the local slope and taking small steps down it. Over many iterations we should find ourselves moving down the error to a minimum.


This is called gradient descent, and is an ideal for learning iteratively from a large training dataset.

We didn't spend a lot of time on how a neural network works and is trained, as the focus of the session was on using PyTorch. If you want to learn about how a neural network actually works, there are many good tutorials online and textbooks. Make Your Own Neural Network, which I wrote, is designed to be as accessible as possible.


PyTorch

We could write code to implement a neural network from scratch. That would be a good educational experience, and I do recommend you try it once for a very simple network. When you do that, you realise one of the most laborious steps is doing the algebra for working out the gradient descent for each of the many links in a neural network. If we change the design of the network, we have to do this all over again.

Luckily, frameworks like Tensorflow and PyTorch are designed to make this easy. One of the key things they do is to automate the gradient calculations when given a network architecture.

PyTorch is considered to be more community-led in its design and evolution, and also more pythonic in its idioms. Being pythonic it also has much easier to interpret error messages, a real issue for new Tensorflow users.


PyTorch Variables


In the session, we started by seeing how PyTorch variables are familiar and do work like normal python variables. We also saw how PyTorch remembers how one variable depends on others, and uses these remembered relationships (called a computation graph) to automatically work out the derivatives of one variable with respect to another.

The following shows how a simple PyTorch variable, named x, is created and given a value of 3.5. You can see that we set an option to enable automatic gradient calculation.


x = torch.tensor(3.5, requires_grad=True)


PyTorch's variables are called tensors, and are similar to python numpy arrays.

The following defines a new variable y, which is calculated from x.


y = (x-1) * (x-2) * (x-3)


PyTorch will assign the value 1.8750 to y, which is a simple calculation using x = 3.5. But in addition to this, PyTorch will remember that y depends on x, and use the definition of y to work out the gradient of y with respect to x.

We can ask PyTorch to work out the gradients and print it out:


# work out gradients
y.backward()

# what is gradient at x = 3.5
x.grad


PyTorch correctly give us a gradient of 5.75 at x=3.5.


You can try the code yourself:

This ability to do automatic gradient calculations is really helpful in training neural networks as we perform gradient descent down the error function.


GPU Acceleration with PyTorch

We then looked at how PyTorch makes it really easy to take advantage of GPU acceleration. GPUs are very good at performing massively parallel calculations. You can think of a CPU as a single-lane road which can allow fast traffic, but a GPU as a very wide motorway with many lanes, which allows even more traffic to pass.

Without GPU acceleration many application of machine learning would not have been possible.

The only downside is that it is only Nvidia's GPUs and software framework, known as CUDA, that is mature and well-adopted in both research and industry. Sadly the competition, for example from AMD, has been late and isn't as mature or sufficiently featured.

Twenty years ago, using GPUs to accelerate calculations was painful compared to how easy it is today. PyTorch makes it really easy.

The following code checks to see if cuda acceleration is available, and if so, sets the default tensor type to be a floating point tensor on the GPU:


if torch.cuda.is_available():
  torch.set_default_tensor_type(torch.cuda.FloatTensor)
  print("using cuda:", torch.cuda.get_device_name(0))
  pass


The following shows that Google's free colab hosted python notebook service gives us access to a Tesla T4, a very expensive GPU!


If we now create tensors, they are by default on the GPU. The following shows us testing where a tensor is located:


We can see the tensor x is on the GPU.

You can try the code yourself, and see an example of a matrix multiplication being done on the GPU:


MNIST Challenge

Having looked at the basics of PyTorch we then proceeded to build a neural network that we hoped would learn to classify images of hand-written digits.

People don't write in a precise and consistent way, and this makes recognising hand-written digits a difficult task, even for human eyes!


This challenge has become a standard test for machine learning researchers, and goes by the name MNIST.

Learning is often best done with physical action, rather than passive listening or reading. For this reason we spent a fair bit of time typing in code to implement the neural network.

We started by downloading the training and test datasets. The training dataset is used to train the neural network, and the test dataset is intentionally separate and used to test how well a trained neural network performs. The images in the test set will not have been seen by the neural network and so there can't be any memorising of images.

We started by exploring the contents of the dataset files, which are in csv format. We used the pandas framework to load the data and preview it. We confirmed the training data had 60,000 rows of 785 numbers.

The first number is the actual number the image represents and the rest of the 784 numbers are the pixel values for the 28x28 images.

We used matplotlib's imshow() to view images of selected rows. The following shows the top of the training dataset and highlights the 5th row (index 4 because we start at 0). The label is 9, and further down we can see a 9 when we draw the pixel values as an image.


Having explored the data directly, we proceeded to build a PyTorch dataset class.


# dataset class

class MnistDataset(torch.utils.data.Dataset):
    
    def __init__(self, csv_file):
        self.data_df = pandas.read_csv(csv_file, header=None)
        pass
    
    def __len__(self):
        return len(self.data_df)
    
    def __getitem__(self, index):
        # image target (label)
        label = self.data_df.iloc[index,0]
        image_target = torch.zeros((10))
        image_target[label] = 1.0
        
        # image data, normalised from 0-255 to 0-1
        image_values = torch.FloatTensor(self.data_df.iloc[index,1:].values) / 255.0
        
        # return label, image data tensor and target tensor
        return label, image_values, image_target
    
    def plot_image(self, index):
        arr = self.data_df.iloc[index,1:].values.reshape(28,28)
        plt.title("label = " + str(self.data_df.iloc[index,0]))
        plt.imshow(arr, interpolation='none', cmap='Blues')
        pass
    
    pass


The dataset class is inherited from PyTorch and specialised for our own needs. In our example, we get it load data from the csv file when an object is initialised. We do need to tell it how big our data is, which in our case is simply the number of rows of the csv file, or in our code, the length of the pandas dataframe. We also need to tell the class how to get an item of data from the dataset. In our example we use the opportunity to normalise the pixel values from 0-255 to 0-1. We return the label, these normalised pixel values and also something we call image_target.

That image_target is a 1-hot vector of size 10, all zero except one set to 1.0 at the position that corresponds to the label. So if the image is a 0 then the vector has all zeros except the first one. If the image is a 9 then the vector is all zeros except the last one.

As an optional extra, I've added an image plotting function which will draw an image from the pixel values from a given record in the data set. The following shows a dataset object called mnist_dataset instantiated from the MnistDataset class we defined, and uses the plot_image function to draw the 11th record.


We then worked on the main neural network class, which is again inherited from PyTorch and key parts specified for our own needs.


# classifier class

class Classifier(nn.Module):
    
    def __init__(self):
        # initialise parent pytorch class
        super().__init__()
        
        # define neural network layers
        self.model = nn.Sequential(
            nn.Linear(784, 200),
            nn.Sigmoid(),
            nn.Linear(200, 10),
            nn.Sigmoid()
        )
        
        # create error function
        self.error_function = torch.nn.BCELoss()

        # create optimiser, using simple stochastic gradient descent
        self.optimiser = torch.optim.SGD(self.parameters(), lr=0.01)
        
        # counter and accumulator for progress
        self.counter = 0;
        self.progress = []
        pass
    
    
    def forward(self, inputs):
        # simply run model
        return self.model(inputs)
    
    
    def train(self, inputs, targets):
        # calculate the output of the network
        outputs = self.forward(inputs)
        
        # calculate error
        loss = self.error_function(outputs, targets)
        
        # increase counter and accumulate error every 10
        self.counter += 1;
        if (self.counter % 10 == 0):
            self.progress.append(loss.item())
            pass
        if (self.counter % 10000 == 0):
            print("counter = ", self.counter)
            pass
        

        # zero gradients, perform a backward pass, and update the weights.
        self.optimiser.zero_grad()
        loss.backward()
        self.optimiser.step()

        pass
    
    
    def plot_progress(self):
        df = pandas.DataFrame(self.progress, columns=['loss'])
        df.plot(ylim=(0, 1.0), figsize=(16,8), alpha=0.1, marker='.', grid=True, yticks=(0, 0.25, 0.5))
        pass
    
    pass


The class needs to be initialised by calling the initialisation of its parent. Beyond that we can add our own initialisation. We create a neural network model and the key lines of code define a 3-layer network with the first layer having 784 nodes, a hidden middle layer with 200 nodes, and a final output layer with 10 nodes. The 784 matches the number of pixel values for each image. The 10 outputs correspond to each of the possible digits 0-9.

The following diagram summarises these dimensions.


The activation function is set as Sigmoid() which is a simple and fairly popular s-shaped threshold.

We also define an error function which summarises the how far wrong the networks output is from what it should be. There are several options provided by PyTorch, and we've chosen the BCELoss which is often suitable for cases where only one of the output nodes should be 1, corresponding to a categorising network. Another common option is MSELoss which is a mean squared error.

We also define a method to do perform gradient descent steps. Here we've chosen the simple stochastic gradient descent, or SGD. Other options are available and you can find out more detail on the others here (link).

In the initialisation function I've also set a counter and an empty list progress which we'll use to monitor training progress.

We need to define a forward() function in our class as this is expected by PyTorch. In our case it is trivially simple, it simply passes the 784-sized data through the model and return the 10-sized output.

We have defined a train() function which does the real training. It takes both pixel data and that 1-hot target vector we saw earlier coming from the dataset class representing what the output should be. It passes the pixel data input to to the forward() function, and keeps the network's output as output. The chosen error function is used to calculate the error, often called a loss.

Finally the following key code does calculates the error gradients in the network from this error and uses chosen gradient descent method to take a step down the gradient for all the network weights. You'll see this key code in almost all PyTorch programs.


# zero gradients, perform a backward pass, and update the weights.
self.optimiser.zero_grad()
loss.backward()
self.optimiser.step()


You might be asking why the first line of code is needed, which appears to zero the gradients before they are calculated again? The reason is that more complex models can be built which combine gradients from multiple sources and to achieve that we don't want to always zero the gradients when calculating new ones.

In the train() function we've also added code to increase the counter at every call of the function, and then used this to add the current error (loss) to the list every 10 iterations. At every 10,000 iterations we print out the current count so we can visually see progress during training.

That's the core code. As an optional extra, I've added a function to plot a chart of the loss against training iteration so we can visualise how well the network trained.

The following simple code shows how easy it is to use this neural network class.


%%time

# create classifier

C = Classifier()

# train classifier

epochs = 3

for i in range(epochs):
    print('training epoch', i+1, "of", epochs)
    for label, image_data_tensor, target_tensor in mnist_dataset:
        C.train(image_data_tensor, target_tensor)
        pass
    pass


You can see we create an object from the PyTorch neural network class Classifier, here unimaginatively called C. We use a loop to work through all the data in the mnist_dataset 3 times, called epochs. You can see we take the label, pixel and target vector data for each image from the dataset and pass it to the C.train() function which does all the hard work we discussed above.

Working through the training dataset of 60,000 images, pushing the data through the neural network, calculating the error and using this to calculate the gradients inside the network, and taking a step down the gradient .. takes about 3 minutes on Google's infrastructure.

After training we plotted the chart of loss against training iteration. Everyone in the session had a plot where the error fell towards zero over time. A reducing error means the network is getting better at correctly classifying the images of digits.


Because our neural networks are initialised with random link weights, the plots for everyone in the room were slightly different in detail, but the same in the broad trend of diminishing error.

The following shows the trained neural network's forward() function being used to classify the 14th record in the dataset.


The plot shows it is a 6. The neural network's 10 outputs are shown as a bar chart. It is clear the node corresponding to 6 has a value close to 1.0 and the rest are very small. This means the network thinks the image is a 6 with high confidence.

There will be cases where an image is ambiguous and two or more of the network's output will have medium or high values. For example, some renditions of a 9 look like a 4.

We performed a very simplistic calculation of how well the network performs keeping a score when our trained network correctly classifies images from the test dataset. All of us in the session had networks which scored about 90%. That is about 9,000 out of the 10,000 test images were correctly classified.

That's impressive for a neural network kept very simple for educational purposes.

The above code doesn't run on the GPU. As a final step we set the default tensor type to be on the GPU and re-ran the code. A small bit of code in the dataset class was also needed to be changed to assert this tensor type on the pixel data as the current version of PyTorch didn't seem to apply the newly set default. No change was needed to the neural network code at all.

This time the training took about 2 minutes. Given the Tesla T4 costs over £2,000 this didn't seem like much of a gain.

We discussed how GPUs only really shine when their massive parallelism is employed. If we increase size of data so it is larger than 28*28=784 for each record, or increase the size of the network, then we'll see significant improvements over a CPU.

The following graph shows the effect of increasing the middle layer of our neural network from 200 up to 5000. We can see that training time increases with a CPU but stays fairly constant with a GPU.


You can see and run the full code used for the tutorial online:


Thoughts

Most of the attendees had never written a machine learning application, and some had little coding experience.

Despite this, everyone felt that they had developed an intuition for how machine learning and neural networks in particular work. The hands-on working with code also helped us understand the mechanics of using PyTorch, something which can't easily be developed just by listening passively or reading about it.

A member asked if we had implemented "AI". The answer was emphatically, yes! We did discuss how the word AI has been overused and mystified, and how the session demystified AI and neural networks.

Demystifying AI aside, we did rightly appreciate how the relatively simple idea of a neural network and our simple implementation still managed to learn to identify images with such a high degree of accuracy.