Friday, June 28, 2019

Design for Data Visualisation, and Introduction to Maps and Spatial Data

This month's we had a two part session one the design principles for information visualisation, and an introduction to maps, spatial data and working with the leading open source tool QGIS.


The slides are here: (pdf).


Information Design

Caroline Robinson FRGS, an experienced designer and cartographer, introduced some of the principles for designing effective information visualisations.

She focusses on two key elements - colour and typography.

Caroline explained that colour fidelity is not easy to ensure, one person's display and printer will not necessarily output the same colours as those on another's. In fact, colour accuracy and reproducibility is a very broad and deep area.

She explained that, in the past when many displays only supported 256 colours, an informal standard of so-called web-safe colours emerged. Today, this is less of a challenge, but where high assurance is required, the industry standard is the Pantone colour scheme. By referring to named colours, different people, organisations and technologies can be assured they mean the same colour as printed in reference swatches, like this one which Caroline which brought in to show the group.


Caroline also discussed the important issue of colour blindness, with about 1 in 10 people seeing colours in ways different to others. It was a surprise to me that a common colour combination, red and green, often chosen because they appear opposite, are in fact difficult for many colour blind people to distinguish.

In terms of cognitive load, Caroline recommended that chart keys or labels do not grow beyond 7 colours.


Caroline recommended that the best approach to colour design was to test with a range of users, not just for an ability to distinguish colours, but also for cultural interpretation too. For example, red in some cultures doesn't mean danger but is a sign of good luck.

On typography Caroline explained that serif fonts are easier to read, and is particularly suitable for body text. She explained that sans-serif variants are better suited to headings and titles.


She also explained that the size and clarity of the "round circles", shown above, are key in aiding the readability of text.



Using examples and props, Caroline got us to think about the human side of design - how some people can or can't see colours, how design choices can have different cultural significance, and also the broader challenge of bias in the data use.


Geospatial Information and Maps

For the second part of the evening, Caroline led us through a hands-on introduction to maps and geospatial data using QGIS, a leading GIS tool which also happens to be open source.


One of the great developments over the last few years is the open source maps and map data. Open Street Map is a leading example of collaborative contributions together building a free to use map of the world. Proprietary alternatives are either expensive or have constrained terms of use.

The following shows the QGIS using OpenStreetMap to show the Penryn campus where we were meeting.


Caroline stressed several times the importance of making sure the right projection is used when using maps, and especially when combining multiple maps or data with maps. If this isn't done right, features will be placed at the wrong location.

The root of this complexity is that the earth is a sphere and most maps are flat surfaces, and there are many choices for how the earth's surface is projected onto flat maps. Some projections aim to preserve distance from a given point, others aim to preserve area, for example.

The first exercise was to add points to our map. The following dialog shows these such points have a rich set of metadata, and also options for user-defined data.


The following shows a set of points added to map, shown as red dots.


These dots, which could represent trees for example, can be saved to a file. The format, called a shape file, is industry standard, open and interchangeable. These are essential characteristics for using data across different systems and organisations, and simplifies working with them programmatically.

The shape files can also include metadata for ownership and copyright information too.

In addition to points, we can add lines and polygons. The following shows an open poly-line (fence) and a closed polygon (building).


The map shows how different shapes can have different colours, but also display attribute table data. In this case the numbers inside the polygons could refer to the building capacity.

Caroline also demonstrated the various options for importing and exporting map data.

To reiterate the importance of selecting the correct projection, the following shows a common map of the world, with Japan and Britain highlighted. They look comparable in size.


In the next map, Japan has been moved closer to the Britain, preserving its area. It is clear that Japan is much larger than Britain, which wasn't apparent on the previous map.



In the next map, Japan has been moved closer to the Britain, preserving its area. It is clear that Japan is much larger than Britain, which wasn't apparent on the previous map.


Thoughts

For me the field of geographical information systems and working with maps has become democratised through open source tools, maps and data.

The importance of this can't be understated. The cost of the proprietary ArcGIS tool is beyond the reach of most individuals and organisations, and historically, maps and location data came with heavy licensing terms and costs.



Today, thanks to open source tools, maps and entity data, we have a rich and vibrant ecosystem of products, innovation and educational possibilities.

Saturday, June 15, 2019

Python Data Science for Kids Taster Workshops

During May and June I ran a series of taster workshops in several locations in Cornwall for children aged 7-17 designed to:

  • introduce some of these standard tools to children aged 7-17
  • provide some experience of methods like data loading, cleaning, visualisation, exploring, machine learning


The event page is here:

Data Science and Python

Data Science is a bread term which covers a range is valuable skills - from coding to machine learning, from data engineering to visualisation.

Python has become the leading tool for data scientists by far - and some of tools in the Python ecosystem are not just defacto standards, but familiarity with them is pretty much expected. These standard tools include the jupyter notebook and libraries like pandas and scikit-learn.


I think it is incredibly advantageous for children to have some experience with these tools, and I think it is important for them to practice some of the data science disciplines, such as data cleaning and visualisation.


Taster Workshops for Kids

The series of workshops was supported by a grant from Numfocus and the Jupyter Project. NumFocus is acharity whose mission is to promote open practices in research, data, and scientific computing. They support many of the open source data science tools you very likely already use. You can read a blog announcing the supported projects here:


Mini Projects For Kids

It is always a challenge to create activities for children that are engaging and also meaningfully help children learn something new. 

Activities need to be small enough so they don't overwhelm, and of a duration that matches a child's comfortable attention span. 

It helps if the activities can be i the form of a story - to make the ideas more real and relatable. 

Furthermore, in single workshops there is limited scope to take children through a lot of pre-requisite training in Python - so the activities need to have a lot of the boiler-plate work removed or already done. This means a carefully thought out balance between "pre-typed code" and instructions and questions which as a child to experiment and explore, or solve a puzzle. 

In my own experience, it helps to avoid any kind of technical complexity like installing and configuring software. Web-based tools that require no installation work best in the limited time, attention and diverse setting of a children's workshop.


With this in mind I came up with a series of projects at different levels of difficulty, all using Google's hosted colab notebook service.


Demonstrating Python and the Jupyter Notebook

At the start of each workshop I talked briefly about the importance of data science at a global scale as well as its relevance to Cornwall.

I then demonstrated basic Python and the Jupyter notebook to show how it works, and to illustrate how easy coding with Python is. I showed how the notebook is just a web page with fields to fill in and run using the "play" button. Having no need to install any software and configure it was a major relief!

The basic python was simply variables, print statements, progressing onto using a list of children's ages, and using operations on the list like max() and len(). The lack of a mean() or average() was nice point to show that it is common to pull in extension libraries that implement features not part of the core Python. I showed how to import pandas, and demonstrated the dataframe, which does have a mean() function. I then showed how easy it was to plot a dataframe as a linechart, and change it to a bar chart, and then a histogram.

I emphasised the important point that learning all the instructions of a language or its libraries is not the aim. A more important skill is being able to search the documentation and reference sites to find how Python and its libraries can be used to achieve your task.



0 - Getting Started

This short worksheet helps children and their parents or carers get set up to use the Google hosted notebook service.

It makes sure they have a Google account, and helps them create one if needed, and tests access to a simple hosted notebook to check everything is working.




1 - Hands And Fingers

This is a project suitable for younger children. It focuses on measuring the length of fingers on each hand and collecting that data.


The idea of a DataFrame is introduced, and these are used to plot charts showing the lengths. Very simple statistics are explored - the minimum, maximum and mean of a column of data. Children are encouraged to explore how their left and right hands are different using the statistics, but also see how it is much easier to see when the data is visualised.

The following photo shows a bar chart comparing the lengths of left and right hand using different colours for each hand.


The Hand and Fingers project and printable rulers are online:




2 - Garden Bug Detective

The next project starts simple and is set in a friendly story about a robot that collects items from the garden.


The robot doesn't know what it has picked up. It only knows how to measure the width, length and weight of the items.

The children are encouraged to visualise the data to get a high level view of it before diving into any further exploration. This time the first chart isn't very enlightening.


The idea of a histogram is introduced to see the data in a different way. The following photo shows a girl exploring a histogram which clearly shows that the data seems to have two groups - a good start to further exploration.


One child worked out how to show three data series on the same histogram chart!


Scatter charts were introduced next, and this visualisation revealed three definite clusters in the data.


With all these visualisations, the children were encouraged to vary what was plotted, and to use a search engine to find out what the code syntax should be.

The project then progresses to use the sklearn library to perform k-means clustering on the data.  The children were excited to be using the same software used by grown-up machine learning and AI researchers!

Seeing the computer identify the group clusters was exciting, and even more exciting was providing the trained model with new data to classify.


I felt it was important for the children to have seen this training and classification process at least once at first hand. I think it will place them in good stead when they consider or see machine learning again in future.

It was great seeing children as young at eight using sklearn to train a model, and use it to predict whether a garden item was a word, ladybird or stone!

The project is online:




2a - Secret Spy Messages

The next project focussed again on an engaging story to wrap an interesting data science concept.


One spy, Jane, is trying to get messages to another spy, John, but the messages arrived messed up by noise, probably caused baddie. Jane tries to send the message 20 times.


The children were asked to look at 20 messages to see if they can spot the hidden message. The photo below shows a child looking at these noisy images.


This project introduces images as data, and encourages the children to explore mathematical or other operations on images.  The project also demonstrates getting data through a URL and opening the received zip file.

The matplotlib library is extensively used to show bitmap images, which are 2d numpy arrays.

After the students try subtracting images, and failing, clues encourage them to add images. All the students discovered that adding more and more images seemed to reveal an image.


After that revelation, which seemed to excite the children, they were encouraged to think about why adding noisy images together seems to work.



The children, and especially the parents, found it very exciting to see a theoretical idea - average value of random noise being zero - applied in this useful and practical way.

The project is online:




3 - Mysterious Space X-Rays

The next project is a significant challenge for the more confident, enthusiastic or able children.

It uses real data from a NASA space mission which measures radiation from space. Often the only way to identify objects in deep space is to look at the only thing that gets to us on Earth - radiation.


The Cygnus X-3 system is a mysterious object which behaves in ways which aren't like the standard kinds of stars or other space objects.


The children are encouraged to explore the data and use any idea they have to extract any insightful pattern from the data. Both the children and parents found it exciting that this task was genuinely at the cutting edge of human understanding, and that any idea they had stood a chance of making them famous!



The project itself started by describing steps to look at and identify anomalous data, and then take data cleansing steps. After that, it intentionally stopped prescribing analysis steps, encouraging the children to think up and try their own ideas, using an internet search engine to read about those ideas and how they might be implemented in code. I emphasised again that this skill is valuable.

I was pleasantly surprised by the great ideas that some of the students came up with - including removing small amplitudes as a way of removing noise, or only keeping the very peaks of the data as a way to keep "radiation events".


Overall,  the more confident and able students really enjoyed working on a data challenge where there was no correct single answer. It was a huge contrast to the tasks they're set at school where there is only one correct answer, and an answer that has been found endlessly before.

The project is online:




Conclusion & Thanks

The motivation behind this touring series of taster workshops was to give children actual experience of using the same tools that are used by professionals across the globe, doing exciting and cutting edge work from AI to data journalism. I also wanted the children to practice some of the methods and discipline from data science, such as visualising data to understand it better, data cleaning, and using different forms of visualisation to gain deeper insights.


A lesson that I learned was that a small number of children didn't follow the prompts to try things themselves or to think about solving some of the puzzles along the way. They were set deliberately because learning happens best when it is done actively rather than passively. I'm not sure there is a good solution to this that can work within the scope of a workshop - attitudes and values to learning come from a broader family environment.


I was really pleased to see some children found the projects genuinely exciting and left wanting to do more ... and I was rather surprised that the parents took as much interest in the projects as the children!

I'd like to thank all the groups that helped make this happen, including the Jupyter project, Numfocus, Carbubian Arts and Science Trust, the Royal Cornwall Museum, the Poly Falmouth, the Krowji Arts Centre and Falmouth University.



Thursday, May 30, 2019

Python First Steps - A Hands On Tutorial

This month we had a first-steps introduction to Python. It was arranged in response to feedback from members who felt a beginners introduction would be useful in helping them explore the Python data science ecosystem of tools and methods.


The slides for the talk are online [link].


Aim

The aim of the session was not to provide a comprehensive coverage of python as a language, nor an exhaustive tour of the ecosystem of libraries and tools.

The aim was to:

  • demonstrate enough of the basic of python to see how it works,
  • write your own code,
  • practice the important skill of searching the internet for code syntax and how to use libraries
  • be able to understand a good amount of python code that others have written
  • and most importantly, develop the confidence to continue to learn and explore python.

Throughout we emphasised the point that today, an encyclopaedic knowledge of a language or a library's options is not indicative of a good programmer. Today, languages and tools are so huge in number and size, that the ability to find the right tool and learn how to use it is a much more important skill. Added to this, the fact that tools change at an ever faster rate.


Why Python?

We briefly set the scene by looking at several recent charts showing Python as one of the fastest growing languages, already in the top 3 in most market analyses, and far ahead in the fields of data science and especially machine learning.



We pondered on the fact that python was not initially designed as a numerical language, but its ease of use accelerated its adoption on many fields including data science.


Notebooks

In the last decade a key innovation has emerged that has made coding easier, friendlier, and avoids the technical setup that was previously necessary.


That innovation is the notebook. In essence, it is just a web page into which we write our instructions, and see the results of those instructions.

A web page is already very familiar to many people and reduces the barriers to coding.

Today notebooks are both simple, and also very capable, with the ability to show charts, include animations, and even include control widgets.

Github, and other code repositories, even support previewing uploaded notebooks - here's an example from one of our own meetups:




Getting / Using Python

Most users of python make extensive use of the healthy and vibrant ecosystem of libraries and tools. The official python distribution from python.org is fairly capable but doesn't include many of the now popular libraries.

Many data scientists and machine learning researchers use the Anaconda Python distribution which includes many of the common libraries used in these fields. They are fairly well tested to work together, and the distribution even includes performance optimisations for Intel CPUs. Anaconda Python also includes the standard jupyter notebook system.

Another good alternative is to use Google's hosted system called colab. This makes using python even easier as there is nothing to install. Everything runs in Google's infrastructure, through a web browser. Despite being a test, the service is robust and growing rapidly in popularity. Even better, the service is free, subject to some controls to avoid exploitative use. Most compelling to machine learning researchers is free access to otherwise very expensive GPUs for accelerating computation.


Python Basics

We worked through the following key python concepts - first discussing them, then seeing some examples, and finally having a go at solving some of the challenges which were designed to test our understanding of the theme, or our ability to find answers on the internet.

  • Variables and Lists
  • Loops and Logic
  • Functions
  • Objects and Classes
  • Visualisation


The slides include links to simple notebooks which you can open and explore, and even edit after you save your own copy. The following shows a snippet of the first notebook introducing variables and lists:




The class did very well, working through all the themes. Noting that some had never coded before, this is quite impressive.

The only topic that caused some trouble was the more advanced topic of objects and classes, which could be a topic for an entire class itself.


Object oriented programming is considered an advanced topic, so it is still an achievement if attendees can recognise it in code they look at in future, even if not all the details are immediately clear.


Looking At Other's Code

To demonstrate that what we covered was indeed a large proportion of the basic elements from which real-world code is built, we looked at two different examples:

  • a generative adversarial network which uses neural networks that learn to render faces
  • a web application server which runs a twitter-like service


We noted that the machine learning code was built from now-familiar elements such as variables, functions, imports, loops, classes and objects and visualising numerical arrays.

The web application code was stark in how small it was - given the service was in essence the sae as twitter. The point of this was to show that, with libraries, many problems can be solved with a very small amount of python - and very understandable python at that.


Conclusion

Speaking with attendees afterwards, I was pleased that the session had:

  • demystified coding and python
  • given some the confidence to explore more, noting that what we covered in class is a large proportion of the foundations on which most code is built
  • underlined the importance of research skills over memorising python instructions and options


Monday, April 1, 2019

Machine Learning for Image Classification - Tensorflow Tutorial

This month's meetup was a tutorial on machine learning to do image classification with Tensorflow.
We also had a short talk looking deeper at the last session's sentiment analysis.


Barney's image classification slides are at: (pdf). David's notebook on sentiment analysis is at: (link).

A video of the talks is at: (youtube).


A Deeper Look Sentiment Analysis

At the previous session we explored simple approaches to sentiment analysis, in particular the lexical approach of summing scores associated with words known to be positive or negative.



David dig deeper into that approach and found that documents given a positive or negative overall score actually have many positive and negative scored words within them. The following histogram illustrates this.


David's explorations remind us that it is important to:
  • understand your data and not apply analysis blindly
  • understand the limits or weakness of an algorithm
  • a statistical answer isn't complete without a measure of "confidence"

You can find more if his code and results in his bitbucket.


Image Classification - Automating Manual Processes

Barney started in a very compelling scenario of a business manually sorting paper - invoices, cash claims, letters. The work itself is boring, very slow, prone to fatigue and error, and not good use of people's time.

A natural question occurred to him - could that manual process of classifying document be automated?


Barney looked to modern neural network based machine learning methods which have proven very successful at image classification.

This illustration shows a neural network learning to classify images from a data set as one of three particular characters.


Neural networks learn by adjusting link weights between nodes that make up layers of nodes. Given a training example, the error in its prediction is used to update link weights by a small amount to try to improve that prediction. Over many training examples, a neural network can get better and better at classifying a given image.

Although it is tempting to build a neural network from scratch, in many industrial applications it makes sense to use architectures that have been proven suitable for a given task. The neural network architecture for image classification will likely be different to one for natural language prediction.

Barney discussed Google's Inception network, a large network optimised for image classification.


There are some excellent articles online that explain the history and rationale of Google's Inception networks:



The deep (and wide) Inception network is trained on 1.2 million images, and is only practical with large large compute power available to organisations like Google.

Barney explained that we don't need to train such large complex networks from scratch - we can take advantage of the training that Google has done and simply extend that training to our own data. This is called transfer learning.


In essence the start (left part) of the network has learned to pick out features that help it with the task of image classification. We retain this learning and only train the small end of the network to focus on learning our own subset of images, making use of the same features learned from the huge training data.

You can read more about transfer learning here:



Barney's results were very promising and has generated significant excitement about automating the manual business process. He continues to develop and refine his solution.

Barney touched on an important aspect of automation - the impact on people and employment. His analysis is that businesses should focus on people shifting away from boring and low-skilled tasks towards more challenging and creative work - and the very same people previously employed can do this.


Tensorflow Walk-Through

Barney walked us through a python notebook which demonstrates the training of a simple network using the popular Tensorflow machine learning framework to classify images of fashion items from the MNIST fashion dataset.

The online colab notebook which you can run, and is very well commented, is here:




The central elements of the process are:

  • import Tensorflow and a higher-level API Keras which makes describing and using neural networks easier
  • import the MNIST fashion data set of 60,000 training images and 10,000 test images
  • convert the grey-scale image data from 0-255 to 0-1
  • construct a 3-layer network, the input layer of which is 28x28 which match the image size:

  • the network model is "compiled" with a loss-function and a method for adjusting the weights, commonly called gradient descent of which there are many options:
  • the neural network is then trained (5 times, or epochs) using the 60,000 image training data set:
  • after training, we check how well the network has been trained, by testing it on the 10,000 image test set:

That score of 0.8778 means the neural network correctly classified 88% of the 10,000 test images - an excellent initial result!

Barney did explore an important aspect of image classification. He first showed and example of an ankle boot and the result of a network prediction. It is clear that the network has a very clear and high belief the boot image is indeed of an ankle boot.


He then showed us a more interesting example. Here is the network confidently but incorrectly classifying a sneaker as a sandal. 


A different run shows the network correctly classifying the sneaker but the outputs of the network are high for both sandal and sneaker. The confidence isn't so clear cut.


Looking at this confidence is a useful enrichment to understanding the otherwise simple output of a network.



Overall Barney's walk through demonstrated the key stages of machine learning and highlighted some key issues, such as distinct training and test data, and understanding the confidence of a prediction.


Local Apps, AI in the Cloud

Barney then explained a useful architectural approach of having a lighter local app, perhaps a web app running on a smartphone, backed by a machine learning model hosted in the cloud which benefits from larger compute resources.


As a fun example, Barney did a live demo of a smartphone web app took a photo of scone and used a cloud hosted pre-trained model to determine if it was a Cornish or Devon scone!



Conclusion

Barney succeeded in conveying the key steps applicable to most machine learning exercises, whilst also showing how easy modern tools and technology make this process.

Both Barney and David also highlighted that although the tools and algorithms appear impressive and confident, it is important to look beneath the simple outputs to understand the confidence of those answers. David did this with sentiment analysis and Barney illustrated this with image classification.


Quite a few members said they were inspired to try the tools themselves.