Saturday, September 28, 2019

Practical Steps for Building Intelligent Systems

This month we have a talk encouraging us to think more broadly than a specific algorithm or library and consider wider questions of the understanding the problem, evaluating solutions, understanding data and bias, measuring performance and accuracy, and the benefits of a solution.


The slides for this talk are here: [pdf].

A video recording of the session is on the group youtube channel: [link].


The Challenge

A naive approach to building a solution to a machine learning problem is to pick a learning algorithm, train the model with data, and then use it.

This is naive for many reasons. Here are just some of them:

  • We didn't think about the suitability of the data we're training with
  • We didn't understand the accuracy of the trained model
  • We didn't consider alternative learning algorithms
  • We didn't understand the value of the solution, comparing benefits to costs, not all of which are technical


We were very lucky to have Aleksandra Osipova, a data scientist at Headforwards, to lead this session, encouraging us to think about these broader important questions in the context of a methodology for developing machine learning solutions.

Aleksandra has a masters in Complex System Modelling from Kings College London, where she worked on computational neuroscience.


Overview

The following diagram summarises the key areas covered by Aleksandra's talk.


She started by outlining a set of key questions we should start asking from the start of a machine learning problem:

  • why - what's the problem
  • what - what are the possible solutions
  • how - machine learning techniques
  • how well - robustness, accuracy, scalability
  • value - benefits vs costs including operational and people
  • risks - data bias and incompleteness, ethics



The Problem

Aleksandra rightly emphasised a still too common enthusiasm for a technical solution which doesn't actually solve the problem at hand, or more essentially, addresses the user needs.

User-centric approaches encouraged by agile and other approaches can really help here by enforcing a solution-agnostic analysis of the problem.

Sometimes the best solution is not a complex machine learning model. And your process needs to give you the ability to find these.


Data

Data is how we train our machine learning models. The importance of data is critical as it is to a very large extent the only thing that we can learn from.

Our analysis should identify which data we need to solve a problem. Often we don't have ideal or perfect data, and so the challenge is to understand how useful the data we do have is.

This is the step that carries most risk. Failure modes include:

  • poor data quality that isn't identified
  • insufficient data to actually learn from
  • bias in data


As with identifying the problem, Aleksandra underlined the importance of speaking with domain experts - people who understand or have experience of the problem domain.

Domain knowledge can help select data and shape the learning process, and can make intractable problem feasible, and difficult problems easier.

It is a common saying that "80% of a data scientists work is understanding the data". We'd extend this to understanding the problem domain.

One area that Aleksandra picked up on, and is too often overlooked by others, is that data itself can be dynamic. It is therefore important to explore how data behaves dynamically indifferent phases. A good example is EEG brain wave data is dynamic, but widening our view, we can see it can follow different categories of dynamic behaviour. Looking at individual data points would be too narrow a view.


Machine Learning

Before diving into a solution or technique, Aleksandra recommended we apply the discipline of developing and testing hypotheses.

Hypothesis testing is a mature, if not always well understood, field of statistics. It provides assurance that we haven't incorrectly selected a hypothesis when in fact there is statistical evidence to support alternative or inverse hypotheses. The wikipedia page provides good illustrations of hypothesis testing, with the criminal trial being particularly educational.

Staying broad in context, Aleksandra explored key questions around the machine learning step, many non-technical. For example:

  • is there a cost to obtaining and using the necessary data?
  • what is the computational cost of the given machine learning model?
  • does the machine learning model need to be updated online, even in real-time, or is it trained off-line?
  • is updating the model with new data possible, feasible?


As discussed before, data is key. Aleksandra again underlined the need to formalise the data preparation steps - collection, cleaning, perhaps normalising it, transformation rules if needed. In theory these seem trivial but in practical deployments, these become key steps for monitoring data quality, and triggering a fault with the incoming data.

During the data exploration phase, domain knowledge or other techniques can lead to a reduction of data, either by removal, or by combining variables to more meaningful ones which more effectively train a model.

During the training phase, it is a very well known approach to train on a subset of the data, and then to verify the model on a a different subset of the data mot previously seen by the model. This makes a lot of sense - don't test on data you've already seen.

The group discussed strategies for how this split can be made, and the conversation was inconclusive. It likely depends on the domain and model more than any theory about an optimal split.

Predicting the scalability of training or using a model is difficult. At one level, we can understand the computational complexity of the algorithms, but in practice these estimates can be disrupted by hardware and software effects, for example the characteristics of network connected storage and saturation of a network. The best approach is to predict as best we can but to test on representative infrastructure to avoid surprises when the impact of a scaling failure would be significant.


Evaluation

A trained model can give us answers which are correct. This is insufficient assurance in real world scenarios.

In the real world, we often need to know how often a trained model is correct/incorrect. Better yet, we should have an idea of how far wrong or right the answers are.

Aleksandra presented an excellent definition of precision and accuracy, both providing useful insights into how well a model works. The wikipedia page has a good illustration of the difference.


The group discussed the relative merits of loss function often used in different kinds of learning model.

At one level, the shape of the loss function drives the learning (down an error gradient), but at another level of sophistication,  the choice of loss function can effect the efficiency or robustness of learning. In reality, even if the theory is robust, it is not a substitute for testing against your own domain and data. There was also a discussion on whether loss functions are the same as regression metrics like mean squared error.

The terminology cross-validation is used often in the field, and simply refers to strategies for splitting a limited labelled data set into a training and test set to check that a trained model hasn't over- or under-fitted. Over-fitting is when a model simply learns to memorise the training data, and appears to perform very well, but against previously unseen data performs badly. A better model will generalise and perform well against unseen data even if it doesn't quite perform so well on the training data itself.

The following diagram from an accessible article on cross-validation illustrates the difference between validation error and training error.


As an aside, I remember ROC analysis as being very good at summarising various modes of failure in a simple visual form:




Value, Benefits and Costs

Alexandra talked about the non-technical implications of model. She explored how different approaches have different needs in terms of skills and roles, and this can be either a cost or an organisational inertia.

This discussion was in the broader context of value. Not only is value related to computation, infrastructure and people cost, it must also be related to how well a solution solves a problem.

Members of the group discussed examples of huge efforts to tiny gains or real world effects. The disconnect between pure statistical measures and practical value was highlighted, and led to the conclusion that the correctness of a solution can't be understood without the context of the real world impact.


Risks

The group discussion was vibrant and continued after the session. A key strand of discussion was around the risks around naive use of a trained model:

  • biased outputs from biased training data
  • biased outputs from the learning algorithm / model itself
  • ethical issues around loss of privacy from learned insights, unaware to data subjects when they provided their data
  • ethical questions around "just because you can, should you?"



Reference

Aleksandra recommended the Tour of The Most Popular Machine Learning Algorithms as a good overview of methods and the kinds of problems they can be use for:



There is also a semi-humorous collection of loss functions which demonstrate either faulty methodologies or very challenging training scenarios:





Friday, July 26, 2019

Hands-On Introduction to PyTorch

This month we ran a hands-on introductory tutorial on PyTorch.


The slides are online (link).


Machine Learning

We started with a quick overview of machine learning, and very simple illustrative example.

The following shows a machine which is being trained to convert kilometres to miles. Although we know how to do that conversion, let's pretend we don't and that we've built a machine that needs to learn to do that from examples.

When we build that machine, we make some assumptions about the model. Here the assumption is that miles and kilometres are related by a multiplicative factor. We don't know what that parameter is, so we start with a randomly chosen number 0.5.


When we're training that machine, we need to know what the correct answer should be to a given question. It is with examples of correct pairs of kilometres and miles that we train the machine - the training data.

When we ask the machine to convert 100 km to miles, the answer it gives us is 100 * 0.5 = 50 miles. We know this is wrong.


The difference between the machine's answer 50 and the correct answer 62.127 is 12.127. That's the error.

We can use this error to guide how we adjust that parameter inside the machine. We need to increase that parameter to make the output larger and closer to the correct answer. Let's try 0.6.


This time the machine tells us that 100 km is 60 miles, which is much closer to the correct 62.127. The error has been reduce to 2.137.

If repeat this process of trying different known-good examples, and after each one tune the parameter in response to the error, the machine should get better and better at doing the conversion. The error should fall.

This is essentially how many machine learning systems work, with key common elements being using known-good questions and answers as a training dataset, and using the error to guide the refinement of the machine learning model.


Neural Networks

Combining several small learning units intuitively allows more complex datasets to be learned.

Animal brains and nervous systems are also collections of neurons working together to learn and perform complex tasks.


Historically, research in machine learning took inspiration from nature, and the design of neural networks is indeed influenced by animal physiology.

The following shows a simple neural network, made of 3 layers, with each node connected to every other node in preceding and subsequent layers.


In modern neural networks the adjustable parameter is the strength of the connections between nodes, not a parameter inside the node like our first kilometres-to-miles example.

As signals pass through a network, they are combined when connections bring several signals into a node. As they emerge from a node, they are subject to a threshold function. This is how animal neurons works, they only pass on a signal once it reaches a certain strength. In artificial neural networks, the non-linearity of this threshold function, or activation function, is essential for a network's ability to learn complex data.

If we write a mathematical expression which relates the output of the network to the incoming signal and all the many link weights, it will be a rather scary function. Instead of trying to solve this scary function, we instead take a simpler more approximate approach.

The following shows the error (the difference between the correct answer and the actual output) as it relates to the link weights in a network. As the weights are varied, the error can go up or down. We want to vary the weights to minimise the error. We can do this by working out the local slope and taking small steps down it. Over many iterations we should find ourselves moving down the error to a minimum.


This is called gradient descent, and is an ideal for learning iteratively from a large training dataset.

We didn't spend a lot of time on how a neural network works and is trained, as the focus of the session was on using PyTorch. If you want to learn about how a neural network actually works, there are many good tutorials online and textbooks. Make Your Own Neural Network, which I wrote, is designed to be as accessible as possible.


PyTorch

We could write code to implement a neural network from scratch. That would be a good educational experience, and I do recommend you try it once for a very simple network. When you do that, you realise one of the most laborious steps is doing the algebra for working out the gradient descent for each of the many links in a neural network. If we change the design of the network, we have to do this all over again.

Luckily, frameworks like Tensorflow and PyTorch are designed to make this easy. One of the key things they do is to automate the gradient calculations when given a network architecture.

PyTorch is considered to be more community-led in its design and evolution, and also more pythonic in its idioms. Being pythonic it also has much easier to interpret error messages, a real issue for new Tensorflow users.


PyTorch Variables


In the session, we started by seeing how PyTorch variables are familiar and do work like normal python variables. We also saw how PyTorch remembers how one variable depends on others, and uses these remembered relationships (called a computation graph) to automatically work out the derivatives of one variable with respect to another.

The following shows how a simple PyTorch variable, named x, is created and given a value of 3.5. You can see that we set an option to enable automatic gradient calculation.


x = torch.tensor(3.5, requires_grad=True)


PyTorch's variables are called tensors, and are similar to python numpy arrays.

The following defines a new variable y, which is calculated from x.


y = (x-1) * (x-2) * (x-3)


PyTorch will assign the value 1.8750 to y, which is a simple calculation using x = 3.5. But in addition to this, PyTorch will remember that y depends on x, and use the definition of y to work out the gradient of y with respect to x.

We can ask PyTorch to work out the gradients and print it out:


# work out gradients
y.backward()

# what is gradient at x = 3.5
x.grad


PyTorch correctly give us a gradient of 5.75 at x=3.5.


You can try the code yourself:

This ability to do automatic gradient calculations is really helpful in training neural networks as we perform gradient descent down the error function.


GPU Acceleration with PyTorch

We then looked at how PyTorch makes it really easy to take advantage of GPU acceleration. GPUs are very good at performing massively parallel calculations. You can think of a CPU as a single-lane road which can allow fast traffic, but a GPU as a very wide motorway with many lanes, which allows even more traffic to pass.

Without GPU acceleration many application of machine learning would not have been possible.

The only downside is that it is only Nvidia's GPUs and software framework, known as CUDA, that is mature and well-adopted in both research and industry. Sadly the competition, for example from AMD, has been late and isn't as mature or sufficiently featured.

Twenty years ago, using GPUs to accelerate calculations was painful compared to how easy it is today. PyTorch makes it really easy.

The following code checks to see if cuda acceleration is available, and if so, sets the default tensor type to be a floating point tensor on the GPU:


if torch.cuda.is_available():
  torch.set_default_tensor_type(torch.cuda.FloatTensor)
  print("using cuda:", torch.cuda.get_device_name(0))
  pass


The following shows that Google's free colab hosted python notebook service gives us access to a Tesla T4, a very expensive GPU!


If we now create tensors, they are by default on the GPU. The following shows us testing where a tensor is located:


We can see the tensor x is on the GPU.

You can try the code yourself, and see an example of a matrix multiplication being done on the GPU:


MNIST Challenge

Having looked at the basics of PyTorch we then proceeded to build a neural network that we hoped would learn to classify images of hand-written digits.

People don't write in a precise and consistent way, and this makes recognising hand-written digits a difficult task, even for human eyes!


This challenge has become a standard test for machine learning researchers, and goes by the name MNIST.

Learning is often best done with physical action, rather than passive listening or reading. For this reason we spent a fair bit of time typing in code to implement the neural network.

We started by downloading the training and test datasets. The training dataset is used to train the neural network, and the test dataset is intentionally separate and used to test how well a trained neural network performs. The images in the test set will not have been seen by the neural network and so there can't be any memorising of images.

We started by exploring the contents of the dataset files, which are in csv format. We used the pandas framework to load the data and preview it. We confirmed the training data had 60,000 rows of 785 numbers.

The first number is the actual number the image represents and the rest of the 784 numbers are the pixel values for the 28x28 images.

We used matplotlib's imshow() to view images of selected rows. The following shows the top of the training dataset and highlights the 5th row (index 4 because we start at 0). The label is 9, and further down we can see a 9 when we draw the pixel values as an image.


Having explored the data directly, we proceeded to build a PyTorch dataset class.


# dataset class

class MnistDataset(torch.utils.data.Dataset):
    
    def __init__(self, csv_file):
        self.data_df = pandas.read_csv(csv_file, header=None)
        pass
    
    def __len__(self):
        return len(self.data_df)
    
    def __getitem__(self, index):
        # image target (label)
        label = self.data_df.iloc[index,0]
        image_target = torch.zeros((10))
        image_target[label] = 1.0
        
        # image data, normalised from 0-255 to 0-1
        image_values = torch.FloatTensor(self.data_df.iloc[index,1:].values) / 255.0
        
        # return label, image data tensor and target tensor
        return label, image_values, image_target
    
    def plot_image(self, index):
        arr = self.data_df.iloc[index,1:].values.reshape(28,28)
        plt.title("label = " + str(self.data_df.iloc[index,0]))
        plt.imshow(arr, interpolation='none', cmap='Blues')
        pass
    
    pass


The dataset class is inherited from PyTorch and specialised for our own needs. In our example, we get it load data from the csv file when an object is initialised. We do need to tell it how big our data is, which in our case is simply the number of rows of the csv file, or in our code, the length of the pandas dataframe. We also need to tell the class how to get an item of data from the dataset. In our example we use the opportunity to normalise the pixel values from 0-255 to 0-1. We return the label, these normalised pixel values and also something we call image_target.

That image_target is a 1-hot vector of size 10, all zero except one set to 1.0 at the position that corresponds to the label. So if the image is a 0 then the vector has all zeros except the first one. If the image is a 9 then the vector is all zeros except the last one.

As an optional extra, I've added an image plotting function which will draw an image from the pixel values from a given record in the data set. The following shows a dataset object called mnist_dataset instantiated from the MnistDataset class we defined, and uses the plot_image function to draw the 11th record.


We then worked on the main neural network class, which is again inherited from PyTorch and key parts specified for our own needs.


# classifier class

class Classifier(nn.Module):
    
    def __init__(self):
        # initialise parent pytorch class
        super().__init__()
        
        # define neural network layers
        self.model = nn.Sequential(
            nn.Linear(784, 200),
            nn.Sigmoid(),
            nn.Linear(200, 10),
            nn.Sigmoid()
        )
        
        # create error function
        self.error_function = torch.nn.BCELoss()

        # create optimiser, using simple stochastic gradient descent
        self.optimiser = torch.optim.SGD(self.parameters(), lr=0.01)
        
        # counter and accumulator for progress
        self.counter = 0;
        self.progress = []
        pass
    
    
    def forward(self, inputs):
        # simply run model
        return self.model(inputs)
    
    
    def train(self, inputs, targets):
        # calculate the output of the network
        outputs = self.forward(inputs)
        
        # calculate error
        loss = self.error_function(outputs, targets)
        
        # increase counter and accumulate error every 10
        self.counter += 1;
        if (self.counter % 10 == 0):
            self.progress.append(loss.item())
            pass
        if (self.counter % 10000 == 0):
            print("counter = ", self.counter)
            pass
        

        # zero gradients, perform a backward pass, and update the weights.
        self.optimiser.zero_grad()
        loss.backward()
        self.optimiser.step()

        pass
    
    
    def plot_progress(self):
        df = pandas.DataFrame(self.progress, columns=['loss'])
        df.plot(ylim=(0, 1.0), figsize=(16,8), alpha=0.1, marker='.', grid=True, yticks=(0, 0.25, 0.5))
        pass
    
    pass


The class needs to be initialised by calling the initialisation of its parent. Beyond that we can add our own initialisation. We create a neural network model and the key lines of code define a 3-layer network with the first layer having 784 nodes, a hidden middle layer with 200 nodes, and a final output layer with 10 nodes. The 784 matches the number of pixel values for each image. The 10 outputs correspond to each of the possible digits 0-9.

The following diagram summarises these dimensions.


The activation function is set as Sigmoid() which is a simple and fairly popular s-shaped threshold.

We also define an error function which summarises the how far wrong the networks output is from what it should be. There are several options provided by PyTorch, and we've chosen the BCELoss which is often suitable for cases where only one of the output nodes should be 1, corresponding to a categorising network. Another common option is MSELoss which is a mean squared error.

We also define a method to do perform gradient descent steps. Here we've chosen the simple stochastic gradient descent, or SGD. Other options are available and you can find out more detail on the others here (link).

In the initialisation function I've also set a counter and an empty list progress which we'll use to monitor training progress.

We need to define a forward() function in our class as this is expected by PyTorch. In our case it is trivially simple, it simply passes the 784-sized data through the model and return the 10-sized output.

We have defined a train() function which does the real training. It takes both pixel data and that 1-hot target vector we saw earlier coming from the dataset class representing what the output should be. It passes the pixel data input to to the forward() function, and keeps the network's output as output. The chosen error function is used to calculate the error, often called a loss.

Finally the following key code does calculates the error gradients in the network from this error and uses chosen gradient descent method to take a step down the gradient for all the network weights. You'll see this key code in almost all PyTorch programs.


# zero gradients, perform a backward pass, and update the weights.
self.optimiser.zero_grad()
loss.backward()
self.optimiser.step()


You might be asking why the first line of code is needed, which appears to zero the gradients before they are calculated again? The reason is that more complex models can be built which combine gradients from multiple sources and to achieve that we don't want to always zero the gradients when calculating new ones.

In the train() function we've also added code to increase the counter at every call of the function, and then used this to add the current error (loss) to the list every 10 iterations. At every 10,000 iterations we print out the current count so we can visually see progress during training.

That's the core code. As an optional extra, I've added a function to plot a chart of the loss against training iteration so we can visualise how well the network trained.

The following simple code shows how easy it is to use this neural network class.


%%time

# create classifier

C = Classifier()

# train classifier

epochs = 3

for i in range(epochs):
    print('training epoch', i+1, "of", epochs)
    for label, image_data_tensor, target_tensor in mnist_dataset:
        C.train(image_data_tensor, target_tensor)
        pass
    pass


You can see we create an object from the PyTorch neural network class Classifier, here unimaginatively called C. We use a loop to work through all the data in the mnist_dataset 3 times, called epochs. You can see we take the label, pixel and target vector data for each image from the dataset and pass it to the C.train() function which does all the hard work we discussed above.

Working through the training dataset of 60,000 images, pushing the data through the neural network, calculating the error and using this to calculate the gradients inside the network, and taking a step down the gradient .. takes about 3 minutes on Google's infrastructure.

After training we plotted the chart of loss against training iteration. Everyone in the session had a plot where the error fell towards zero over time. A reducing error means the network is getting better at correctly classifying the images of digits.


Because our neural networks are initialised with random link weights, the plots for everyone in the room were slightly different in detail, but the same in the broad trend of diminishing error.

The following shows the trained neural network's forward() function being used to classify the 14th record in the dataset.


The plot shows it is a 6. The neural network's 10 outputs are shown as a bar chart. It is clear the node corresponding to 6 has a value close to 1.0 and the rest are very small. This means the network thinks the image is a 6 with high confidence.

There will be cases where an image is ambiguous and two or more of the network's output will have medium or high values. For example, some renditions of a 9 look like a 4.

We performed a very simplistic calculation of how well the network performs keeping a score when our trained network correctly classifies images from the test dataset. All of us in the session had networks which scored about 90%. That is about 9,000 out of the 10,000 test images were correctly classified.

That's impressive for a neural network kept very simple for educational purposes.

The above code doesn't run on the GPU. As a final step we set the default tensor type to be on the GPU and re-ran the code. A small bit of code in the dataset class was also needed to be changed to assert this tensor type on the pixel data as the current version of PyTorch didn't seem to apply the newly set default. No change was needed to the neural network code at all.

This time the training took about 2 minutes. Given the Tesla T4 costs over £2,000 this didn't seem like much of a gain.

We discussed how GPUs only really shine when their massive parallelism is employed. If we increase size of data so it is larger than 28*28=784 for each record, or increase the size of the network, then we'll see significant improvements over a CPU.

The following graph shows the effect of increasing the middle layer of our neural network from 200 up to 5000. We can see that training time increases with a CPU but stays fairly constant with a GPU.


You can see and run the full code used for the tutorial online:


Thoughts

Most of the attendees had never written a machine learning application, and some had little coding experience.

Despite this, everyone felt that they had developed an intuition for how machine learning and neural networks in particular work. The hands-on working with code also helped us understand the mechanics of using PyTorch, something which can't easily be developed just by listening passively or reading about it.

A member asked if we had implemented "AI". The answer was emphatically, yes! We did discuss how the word AI has been overused and mystified, and how the session demystified AI and neural networks.

Demystifying AI aside, we did rightly appreciate how the relatively simple idea of a neural network and our simple implementation still managed to learn to identify images with such a high degree of accuracy.

Friday, June 28, 2019

Design for Data Visualisation, and Introduction to Maps and Spatial Data

This month's we had a two part session one the design principles for information visualisation, and an introduction to maps, spatial data and working with the leading open source tool QGIS.


The slides are here: (pdf).


Information Design

Caroline Robinson FRGS, an experienced designer and cartographer, introduced some of the principles for designing effective information visualisations.

She focusses on two key elements - colour and typography.

Caroline explained that colour fidelity is not easy to ensure, one person's display and printer will not necessarily output the same colours as those on another's. In fact, colour accuracy and reproducibility is a very broad and deep area.

She explained that, in the past when many displays only supported 256 colours, an informal standard of so-called web-safe colours emerged. Today, this is less of a challenge, but where high assurance is required, the industry standard is the Pantone colour scheme. By referring to named colours, different people, organisations and technologies can be assured they mean the same colour as printed in reference swatches, like this one which Caroline which brought in to show the group.


Caroline also discussed the important issue of colour blindness, with about 1 in 10 people seeing colours in ways different to others. It was a surprise to me that a common colour combination, red and green, often chosen because they appear opposite, are in fact difficult for many colour blind people to distinguish.

In terms of cognitive load, Caroline recommended that chart keys or labels do not grow beyond 7 colours.


Caroline recommended that the best approach to colour design was to test with a range of users, not just for an ability to distinguish colours, but also for cultural interpretation too. For example, red in some cultures doesn't mean danger but is a sign of good luck.

On typography Caroline explained that serif fonts are easier to read, and is particularly suitable for body text. She explained that sans-serif variants are better suited to headings and titles.


She also explained that the size and clarity of the "round circles", shown above, are key in aiding the readability of text.



Using examples and props, Caroline got us to think about the human side of design - how some people can or can't see colours, how design choices can have different cultural significance, and also the broader challenge of bias in the data use.


Geospatial Information and Maps

For the second part of the evening, Caroline led us through a hands-on introduction to maps and geospatial data using QGIS, a leading GIS tool which also happens to be open source.


One of the great developments over the last few years is the open source maps and map data. Open Street Map is a leading example of collaborative contributions together building a free to use map of the world. Proprietary alternatives are either expensive or have constrained terms of use.

The following shows the QGIS using OpenStreetMap to show the Penryn campus where we were meeting.


Caroline stressed several times the importance of making sure the right projection is used when using maps, and especially when combining multiple maps or data with maps. If this isn't done right, features will be placed at the wrong location.

The root of this complexity is that the earth is a sphere and most maps are flat surfaces, and there are many choices for how the earth's surface is projected onto flat maps. Some projections aim to preserve distance from a given point, others aim to preserve area, for example.

The first exercise was to add points to our map. The following dialog shows these such points have a rich set of metadata, and also options for user-defined data.


The following shows a set of points added to map, shown as red dots.


These dots, which could represent trees for example, can be saved to a file. The format, called a shape file, is industry standard, open and interchangeable. These are essential characteristics for using data across different systems and organisations, and simplifies working with them programmatically.

The shape files can also include metadata for ownership and copyright information too.

In addition to points, we can add lines and polygons. The following shows an open poly-line (fence) and a closed polygon (building).


The map shows how different shapes can have different colours, but also display attribute table data. In this case the numbers inside the polygons could refer to the building capacity.

Caroline also demonstrated the various options for importing and exporting map data.

To reiterate the importance of selecting the correct projection, the following shows a common map of the world, with Japan and Britain highlighted. They look comparable in size.


In the next map, Japan has been moved closer to the Britain, preserving its area. It is clear that Japan is much larger than Britain, which wasn't apparent on the previous map.



In the next map, Japan has been moved closer to the Britain, preserving its area. It is clear that Japan is much larger than Britain, which wasn't apparent on the previous map.


Thoughts

For me the field of geographical information systems and working with maps has become democratised through open source tools, maps and data.

The importance of this can't be understated. The cost of the proprietary ArcGIS tool is beyond the reach of most individuals and organisations, and historically, maps and location data came with heavy licensing terms and costs.



Today, thanks to open source tools, maps and entity data, we have a rich and vibrant ecosystem of products, innovation and educational possibilities.

Saturday, June 15, 2019

Python Data Science for Kids Taster Workshops

During May and June I ran a series of taster workshops in several locations in Cornwall for children aged 7-17 designed to:

  • introduce some of these standard tools to children aged 7-17
  • provide some experience of methods like data loading, cleaning, visualisation, exploring, machine learning


The event page is here:

Data Science and Python

Data Science is a bread term which covers a range is valuable skills - from coding to machine learning, from data engineering to visualisation.

Python has become the leading tool for data scientists by far - and some of tools in the Python ecosystem are not just defacto standards, but familiarity with them is pretty much expected. These standard tools include the jupyter notebook and libraries like pandas and scikit-learn.


I think it is incredibly advantageous for children to have some experience with these tools, and I think it is important for them to practice some of the data science disciplines, such as data cleaning and visualisation.


Taster Workshops for Kids

The series of workshops was supported by a grant from Numfocus and the Jupyter Project. NumFocus is acharity whose mission is to promote open practices in research, data, and scientific computing. They support many of the open source data science tools you very likely already use. You can read a blog announcing the supported projects here:


Mini Projects For Kids

It is always a challenge to create activities for children that are engaging and also meaningfully help children learn something new. 

Activities need to be small enough so they don't overwhelm, and of a duration that matches a child's comfortable attention span. 

It helps if the activities can be i the form of a story - to make the ideas more real and relatable. 

Furthermore, in single workshops there is limited scope to take children through a lot of pre-requisite training in Python - so the activities need to have a lot of the boiler-plate work removed or already done. This means a carefully thought out balance between "pre-typed code" and instructions and questions which as a child to experiment and explore, or solve a puzzle. 

In my own experience, it helps to avoid any kind of technical complexity like installing and configuring software. Web-based tools that require no installation work best in the limited time, attention and diverse setting of a children's workshop.


With this in mind I came up with a series of projects at different levels of difficulty, all using Google's hosted colab notebook service.


Demonstrating Python and the Jupyter Notebook

At the start of each workshop I talked briefly about the importance of data science at a global scale as well as its relevance to Cornwall.

I then demonstrated basic Python and the Jupyter notebook to show how it works, and to illustrate how easy coding with Python is. I showed how the notebook is just a web page with fields to fill in and run using the "play" button. Having no need to install any software and configure it was a major relief!

The basic python was simply variables, print statements, progressing onto using a list of children's ages, and using operations on the list like max() and len(). The lack of a mean() or average() was nice point to show that it is common to pull in extension libraries that implement features not part of the core Python. I showed how to import pandas, and demonstrated the dataframe, which does have a mean() function. I then showed how easy it was to plot a dataframe as a linechart, and change it to a bar chart, and then a histogram.

I emphasised the important point that learning all the instructions of a language or its libraries is not the aim. A more important skill is being able to search the documentation and reference sites to find how Python and its libraries can be used to achieve your task.



0 - Getting Started

This short worksheet helps children and their parents or carers get set up to use the Google hosted notebook service.

It makes sure they have a Google account, and helps them create one if needed, and tests access to a simple hosted notebook to check everything is working.




1 - Hands And Fingers

This is a project suitable for younger children. It focuses on measuring the length of fingers on each hand and collecting that data.


The idea of a DataFrame is introduced, and these are used to plot charts showing the lengths. Very simple statistics are explored - the minimum, maximum and mean of a column of data. Children are encouraged to explore how their left and right hands are different using the statistics, but also see how it is much easier to see when the data is visualised.

The following photo shows a bar chart comparing the lengths of left and right hand using different colours for each hand.


The Hand and Fingers project and printable rulers are online:




2 - Garden Bug Detective

The next project starts simple and is set in a friendly story about a robot that collects items from the garden.


The robot doesn't know what it has picked up. It only knows how to measure the width, length and weight of the items.

The children are encouraged to visualise the data to get a high level view of it before diving into any further exploration. This time the first chart isn't very enlightening.


The idea of a histogram is introduced to see the data in a different way. The following photo shows a girl exploring a histogram which clearly shows that the data seems to have two groups - a good start to further exploration.


One child worked out how to show three data series on the same histogram chart!


Scatter charts were introduced next, and this visualisation revealed three definite clusters in the data.


With all these visualisations, the children were encouraged to vary what was plotted, and to use a search engine to find out what the code syntax should be.

The project then progresses to use the sklearn library to perform k-means clustering on the data.  The children were excited to be using the same software used by grown-up machine learning and AI researchers!

Seeing the computer identify the group clusters was exciting, and even more exciting was providing the trained model with new data to classify.


I felt it was important for the children to have seen this training and classification process at least once at first hand. I think it will place them in good stead when they consider or see machine learning again in future.

It was great seeing children as young at eight using sklearn to train a model, and use it to predict whether a garden item was a word, ladybird or stone!

The project is online:




2a - Secret Spy Messages

The next project focussed again on an engaging story to wrap an interesting data science concept.


One spy, Jane, is trying to get messages to another spy, John, but the messages arrived messed up by noise, probably caused baddie. Jane tries to send the message 20 times.


The children were asked to look at 20 messages to see if they can spot the hidden message. The photo below shows a child looking at these noisy images.


This project introduces images as data, and encourages the children to explore mathematical or other operations on images.  The project also demonstrates getting data through a URL and opening the received zip file.

The matplotlib library is extensively used to show bitmap images, which are 2d numpy arrays.

After the students try subtracting images, and failing, clues encourage them to add images. All the students discovered that adding more and more images seemed to reveal an image.


After that revelation, which seemed to excite the children, they were encouraged to think about why adding noisy images together seems to work.



The children, and especially the parents, found it very exciting to see a theoretical idea - average value of random noise being zero - applied in this useful and practical way.

The project is online:




3 - Mysterious Space X-Rays

The next project is a significant challenge for the more confident, enthusiastic or able children.

It uses real data from a NASA space mission which measures radiation from space. Often the only way to identify objects in deep space is to look at the only thing that gets to us on Earth - radiation.


The Cygnus X-3 system is a mysterious object which behaves in ways which aren't like the standard kinds of stars or other space objects.


The children are encouraged to explore the data and use any idea they have to extract any insightful pattern from the data. Both the children and parents found it exciting that this task was genuinely at the cutting edge of human understanding, and that any idea they had stood a chance of making them famous!



The project itself started by describing steps to look at and identify anomalous data, and then take data cleansing steps. After that, it intentionally stopped prescribing analysis steps, encouraging the children to think up and try their own ideas, using an internet search engine to read about those ideas and how they might be implemented in code. I emphasised again that this skill is valuable.

I was pleasantly surprised by the great ideas that some of the students came up with - including removing small amplitudes as a way of removing noise, or only keeping the very peaks of the data as a way to keep "radiation events".


Overall,  the more confident and able students really enjoyed working on a data challenge where there was no correct single answer. It was a huge contrast to the tasks they're set at school where there is only one correct answer, and an answer that has been found endlessly before.

The project is online:




Conclusion & Thanks

The motivation behind this touring series of taster workshops was to give children actual experience of using the same tools that are used by professionals across the globe, doing exciting and cutting edge work from AI to data journalism. I also wanted the children to practice some of the methods and discipline from data science, such as visualising data to understand it better, data cleaning, and using different forms of visualisation to gain deeper insights.


A lesson that I learned was that a small number of children didn't follow the prompts to try things themselves or to think about solving some of the puzzles along the way. They were set deliberately because learning happens best when it is done actively rather than passively. I'm not sure there is a good solution to this that can work within the scope of a workshop - attitudes and values to learning come from a broader family environment.


I was really pleased to see some children found the projects genuinely exciting and left wanting to do more ... and I was rather surprised that the parents took as much interest in the projects as the children!

I'd like to thank all the groups that helped make this happen, including the Jupyter project, Numfocus, Carbubian Arts and Science Trust, the Royal Cornwall Museum, the Poly Falmouth, the Krowji Arts Centre and Falmouth University.