Saturday, September 29, 2018

Python for Data Science - Top 10 Tools

We held the first Data Science Cornwall meetup this week.

The main talk was an introductory overview of the main python tools and libraries used in data science.



The slides are at: https://goo.gl/Zydbrr

Code and data is at: https://github.com/datasciencecornwall/top10_python_libraries


Python for Data Science

Python has become the leading language for data science. The vibrant python ecosystem includes tools and libraries that have become very popular for data science.

Some of these have become, in effect, part of the de facto python data science stack.

Why python? Python was not designed as a numerical computing language like Fortran, R, SAS or the more recent, Julia.

Python was designed to be easy to learn, and designed to be applicable to a wide variety of tasks. The low barrier to entry has led to wide adoption. Python is now very popular, being used to support global scale infrastructures as well as programming small low cost educational hardware such as the bbc:microbit. Today, python is being used by primary school children and university students alike,  and being learned by artists and engineers, further guaranteeing a healthy future.

Python itself isn't very competitive for numerical computing performance, but performant code can be wrapped and used from python as a supporting library. This gives users the best of both worlds - the ease of use of python, and the performance of optimised low level code.


Anaconda Python

The standard python distribution includes a fairy rich set of support libraries, supporting a broad range of tasks such as downloading data from the web, concurrent programming, and working with XML. But it doesn't contain many of the key libraries that are popular for data science today.

As a result, many data scientists use distributions that have these data science libraries included, or downloadable from repositories for easy inclusion. A key benefit of these distributions is that these libraries of tested to work together.


A leading distribution for data science is Anaconda python. Versions for Linux, Windows and MacOS can be download from here.


Working with Notebooks

Traditionally computer programs have been written as text files, edited using text editors or IDEs, and run using command shells.

An alternative approach has become popular, particularly in data science, and that is an interactive, web-based notebook, capable of displaying richer media than a simple command terminal.


The web based notebook approach is far more familiar and user-friendly than a text editor and command shell, and the ability to output formatted data and graphics makes it suitable for more tasks without the need for additional software.

Being web-based they are easy to share and use, with viewing possible on any device with a browser, including tablets and smartphones.

Github will render any jupyter notebook that you upload to it. The example code for this talk is provided in the form of notebooks.

A simple notebook showing very simple python code as well as markdown text, useful for commenting and documentation, is on github:




Data

We've already mentioned that python on its own isn't ideal for working with large arrays of data. In fact python itself doesn't have a builtin array data structure. It only has lists. Of course, 2-dimensional arrays can be made out of lists.

The main issue is that each item in a python list is a high-level python object, and working with large numbers isn't very efficient, and can lead to performance problems well before other languages designed for working with data.


The ubiquitous numpy library provides an array data structure, ndarray, which stores data in a simpler form, and manipulates it more directly in memory. In particular, whole-array operations become much faster without the overhead of working with high-level objects.

The example code illustrates how numpy operations are much faster python list operations:



Numpy arrays are so common they are effectively a core python data type, and efforts are underway to add it to the core.

Numpy maintains a small scope, and allows other libraries to built further functionality on the basic array type.

A leading extension of the numpy array is the pandas datafrane. Pandas offers convenience extensions to the basic array, offering the ability to output nicely formatted data tables, add column and row names, slice and pivot the data, apply filters to the, perform simple operations over the data, provide useful import/export, and simple but useful plotting functions.


The example code illustrates how easy it is to load data from a csv file, output it in pleasant format, filter by column name and or a numerical threshold, and trivially easy graphing:



A key library for working with data is scipy which provides a wide range of scientific and mathematics functions - linear algebra, image and signal processing, spatial and statistics, for example.

The example code demonstrates using scipy to perform a fourier analysis of a signal to find the component frequencies within that data:




Text Analysis

Python has long been used for text processing, and more recently this has extended to modern text mining and natural language processing.

The NLTK natural language toolkit toolkit has been around a long time, and as such is well known, with plenty of online tutorials and examples based on it. The toolkit was designed as an education and research tool, and its age means it relatively proven and mature. The toolkit provides common natural language functions like tokenisation, stemming, part of speech tagging, basic classification and a set of example text datasets (corpora).

The example code demonstrates using nltk to break some sample text into words (tokenisation), and labelling them as parts of speech (verbs, nouns, etc), and extracting entities such as names and places:



With nltk being focussed on research and education, a more modern library spaCy is becoming the choice for building products and services. It's main feature is that it is fast. It also provides support for several languages and provides pre-trained word vectors to support more modern methods for working with natural language. The organisation behind spaCy invests in performance and accuracy benchmarking and tuning.

The example code shows spaCy used to process a segment of Shakespeare, extracting people entities, and a simple demonstration of text similarity:



Natural language text analysis has been in the news recently with notable advancements such as the oft-quoted king - man + woman = queen example. A key innovation behind these advancements are word vectors. These are multi-(but low)-dimensional vectors which are similar for words which are semantically similar. This allows clustering of similar texts, and can also allow a kind of algebra for text. Another notable example illustrates the biases and faults in our own culture, expressed in the text we produce, and learned by algorithms, doctor - man + woman = nurse.

Gensim has become the leading library for providing and working with models based on word vectors. The example code demonstrates using word vectors learned from a snapshot of wikipedia 2014 and a large set of mostly news feed data, used to perform these concept calculations:




Machine Learning

Machine learning is at the heart of recent advances in artificial intelligence. Improvements to both models and how they are trained has led us to be able to solve problems that previously we're feasible.

The scikit-learn library can almost be considered the native python machine learning as it has grown in the community, as is now well established and actively developed.

The example code demonstrates how it can be used to perform clustering on data - finding groups within data:


The ecosystem for machine learning is energetic, with both new and established tools. Given the commercial interest, companies like Facebook, Amazon, Microsoft are all keen to promote their own tools.


Tensorflow from Google can't be ignored. It has huge resources behind it, it is developing rapidly, and has a growing body of tutorials around it. The javascript version is also notable, tensorflow.js.

I suggested that users take a moment to think about the risks, if any, of lock-in to a product that is very much managed and developed by Google on their own terms.

I encouraged members to consider PyTorch. Pytorch is notable because of its design:

  • it is pythonic, in contrast to some frameworks which are clearly very thin wrappers around C/C++ or even Fortran code
  • is aims to make the design of machine learning models easy and dynamic - and was innovative in the use of automatically calculating error gradients for you
  • trivially simple use of GPU hardware acceleration - historically, coding for GPUs has been painful


The following illustrates how easy it is to switch from using the CPU to the GPU.


GPU accleration has been instrumental in the advancement of modern machine learning, and specifically deep learning,  which is not a new concept but has been made possible with the increasing accessibility of hardware acceleration.

Sadly, the machine learning ecosystem is dominated by Nvidia and it's CUDA framework. AMD's efforts haven't become popular, and vendor independent frameworks like OpenCL haven't taken hold.

You can find an introduction to using PyTorch here. The tutorial implements a neural network to learn to classify the MNIST handwritten digits, a canonical machine learning challenge. Initial results suggested plain simple python performed twice as fast as the CUDA accelerated pytorch code.


Although this looks disheartening, the results were shown to make a point. The following graph shows performance as the neural network grows deeper.


The lesson here is that for very small tasks, the pytorch code, and likely many other high-level frameworks, will be slower than the simplest python code. However, as the data or complexity of the models grows, the hardware acceleration becomes very beneficial.


Visualisation

There are many choices available for visualising data and results with python. However a few key libraries stand out.

Matplotlib is the elder statesman of python graphing. It has been around a long time, and is very flexible. You can do almost anything you want with it. Have a look at the gallery for an overview of its broad capability - from simple charts to vector fields, from 2-d bitmaps to contour plots, from 3d-landscapes to multi-plots.

The demonstration code illustrates matplotlib basics - line plot, x-y plot, histogram, 2-d bitmap plot, contour plot, plotting a 3-d function, and even a fun xkcd-style plot! The code also demonstrates how pandas dataframes also provide convenient plotting functions, which also happen to use matplotlib.




Sadly matplotlib's syntax isn't the simplest, and we often have to keep a reference close when using it. Several other frameworks are built on matplotlib to provide nicer defaults and simpler syntax.

Notable alternates to matplotlib include:

  • ggplot - which implements the "grammar of graphics", an attempt to design a language for describing plots and charts. The plotting system of R is based on this same design.
  • bokeh - draws plot using browser canvas elements, making for smoother visuals, but also enables the possibility of easy interactivity and dynamic manipulation.
  • plotly - a popular online service, with investment in documentation and quality, but some services cost, and the risk of currently free services being charged in future needs to be considered.


Finally, we looked at visualising networks of linked data. This is an area that isn't very mature in that there isn't a very well established library that most data scientists rely on.

Networkx is a library for manipulating graph data as comes as part of the Anaconda distribution. It has a very simple capability to plot static plots of graphs.

For dynamic exploration of graph data, which is often dense and requires interactivity to explore, a common solution is to use the javascript d3.js library. This isn't a python library, but data scientists often find themselves having to code their own wrapper around it to use from python.

The example code illustrates a simple networkx visualisation, as well as an example of data from a pandas dataframe being used to populate a d3.js force-directed graph - which is also interactive. The github rendering doesn't show the interactive plot so you'll have to run the code yourself.



The code for this implementation was developed by myself, based heavily on the work of Mike Bostock, but made to work in a jupyter notebook. The original code was designed to visualise clusters of related text documents. For further explanation see the blog.



Conclusion & References

The talk seemed to be well received, but more importantly, a community has started to form, and the great discussions around the talk bode well for an active future.


A member suggested we look at Google's Colab a system which includes a version of python notebooks which can be shared and remain editable by multiple users, much like documents in Google Drive: