Friday, February 1, 2019

Sentiment Analysis - A Hands-On Tutorial With Python

This month we hand a hands-on tutorial taking us through simple sentiment analysis of natural language text.

The slides are at: [PDF]

Code and data are at: [github]

Natural Language and Sentiment Analysis

Natural language is everywhere - from legal documents to tweets, from corporate emails to historic literature, from customer discussions to public inquiry reports. The ability to automatically extract insight from text is a powerful one.

The challenge is that human language is hard to compute with. It was never designed to be consistent, precise and unambiguous - in fact, that is its beauty!

In the broad disciplines of natural language processing and text mining, sentiment analysis stands out as particularly common and useful to many organisations. Sentiment analysis aims to work out whether a piece of text is being positive or negative about the subject of discussion.

This sentiment analysis can result in a simple number, or an even simpler positive / negative label. Even this simplicity can be really useful, providing insights into large or rapidly emerging text, where it would not be feasible to read and assess the text manually.

We were lucky to have Peter give us an overview of sentiment analysis and lead a hands on tutorial using Python's venerable NLTK toolkit.

Two Approaches

Approaches to sentiment analysis roughly fall into two categories:
  • Lexical - using prior knowledge about specific words to establish whether a piece of text has positive or negative sentiment.
  • Machine Learning - training a model using examples of positive and negative texts. Often that model is probabilistic, that is, it learns the probability of positive or negative sentiment based on the combination of words present in the text.

Peter created two simplified tasks for each of these approaches.

Lexical Approach

A very simple lexical approach is to have a set of words which we know contribute a negative or positive sentiment.

This picture shows just five words.

The word poor indicates a negative sentiment. The word bad indicates a stronger negative sentiment. The word terrible indicates a really negative sentiment. The scores associated with these words reflect how strong that negative sentiment is.

Similarly, the word good suggests a positive sentiment, and the word great suggests stronger positive sentiment. The scores reflect this too.

This is just a very small sample of scored words, but researchers have created longer, more comprehensive, lists of such words. A good example is the VADER project's list of words and their contribution to sentiment: vader_lexicon.txt.

A particularly simple way of using these scored words is to simply add up the scores as we find the words in the text being analysed.

You can see in this very short film review we've found the words poor, terrible and good. Adding up the scores for those gives us a total of -5. The positive sentiment of good wasn't enough to outweigh the very negative sentiment from the first sentence.

This is a simple approach which serves to illustrate the lexical method for sentiment analysis.

You can see that in practice, if we want to compare scores across reviews, we'd need to adjust the scores so that very long sentences or texts aren't unfairly advantaged over shorter ones. A good way to do this is to divide the scores by the number of words in the text snippet. Even this might be improved by dividing by the number of words actually matched and scored, otherwise there is a risk of long passages diluting sentiment scores.

In our simple example, that score would be -5 / 3 = -1.67. The negative result indicating an overall negative sentiment.

A key message that Peter underlines was that there is no perfect method, and each approach, simple or sophisticated, has advantages and weaknesses.

Peter provides a data set of movie reviews, and steam game reviews, and introduced key elements of Python to help us write our own code to calculate sentiment scores for these reviews.

In trying this, some of us found that the review text needed to be lowercased because the VADER lexicon of sentiment scores was lowercase.

The class had great fun trying out this simple example, and it was great to see more experienced members helping those less experienced with Python coding.

Peter's own code where he explores additional ideas like calculating the sentiment sentence by sentence:

Machine Learning A Sentiment Classifier

We didn't get time in the session to try the second approach of training a model with examples of positive and negative text.

Peter did discuss a simple approach training a Naive Bayes Classifier.

The Bayes theorem is often difficult to understand when coming across it for the first time, so Peter pointed to an easy explainer on youtube. Essentially, it provides a way of calculating the probability of something given something else has happened (which also has its own probability). For example, what's the probability that it is raining, given my head is wet? You'll hear the term conditional probability to describe this idea.

How is this relevant to our task of sentiment analysis?

Well, we're trying to work out the probability of a piece of text having positive or negative sentiment. That probability depends on the occurrence of words in that piece of text. And each of those words has a likelihood of being in positive or negative texts.

Have a look at this simplified example.

If we look at a training set of negative documents, we might find that the probability of the word poor occurring is 0.8. We'd do this by counting the occurrence of the word.  The word poor might occur in text documents which are assigned a positive sentiment, but they're less likely. In this example they have an occurrence probability of 0.1.

Similarly probabilities for the word good and apple can be established. No surprise that the probability of good in positive texts is 0.7, and a low 0.1 in negative samples.

So how do we use this to help classify a previously unseen document as positive or negative?

Imagine a new previously unseen document only contains the word poor. What's the probability that it is a positive document? What's the probability it is a negative document? Intuitively we know the document is negative, and looking at the numbers the probability of poor being in the negative classification is much larger than the positive.

That's the intuition - and it's not so complicated.

The Bayes theorem just helps us calculate the actual probabilities. Why do we need to calculate them, surely our intuition is enough? Well, that word poor was likely from a negative document, but there's a small chance it could have been from a positive document. That's why we need to take more care over the competing probabilities.

Let's take the key formula and apply it here:

P(negative given poor) = P(poor given negative) * P(negative) / P(poor)

We want to work out the probability of the document being negative given that it has the one word poor. That's the left hand side of the equation. Let's look at the right hand side of the equation:
  • The probability of poor given the document is negative. That's what we know from the training data. That's 0.8
  • The probability of a document being negative. We've assumed a half-half split of positive and negative documents in the training data but this might not be the case. It may be that negative documents are just more likely to occur, just as many reviews tend to be negative because that's when people are motivated to write them. For now let's assume an equal split so this is 0.5
  • The probability of the word poor itself occurring at all, irrespective of positive or negative document, is something we have to find from the data set itself. If the word is rare this probability is low. In our example, the probability of poor is (0.9 + 0.1)/2 = 0.45.

The means the probability of the document being negative is 0.8 * 0.5 / 0.45 = 0.889.

Doing a similar calculation for the probability of the document being positive if the only word it contained is poor, we get 0.1 * 0.5 / 0.45 = 0.111.

So having a document with the one word poor, the probability that it is a negative sentiment document is far higher than it being a positive sentiment document.

This all looks overly complicated, but we do need this machinery when our training data has an uneven number of positive and negative documents, and when we extend the idea from one word to many.

What we've done is classification. And in fact we can use this very same idea to classify documents against different kinds of categories - spam versus not spam being a common example.

You can see Peter's own code that uses the NLTK Naive Bayes Classifier:


What we've looked at are simple ideas that may not perform very well without further preparation and optimisation.

A key reason for this is that natural language is not consistent, precise, and unambiguous. Natural language has constructs like "not bad at all" where considering the individual words might suggest an overall negative sentiment. Sarcasm and humour have been particularly challenging for algorithms to accommodate.

Improvements can include using "not" to negate the sentiment of the subsequent word or few words. Another approach is to consider word pairs, known as bigrams, as pairs of words often encapsulate meaning better than the individual words.

Peter raised the issue of asymmetry in the lexical approach. The strength of "not bad' is not equal but opposite to "bad", and the same for "good" versus "not good".

In terms of assessing and comparing the performance of classifiers, Peter touched on the issue of precision, recall, and the F1 measure that combines them.


Peter succeeded in framing the complex challenge of natural language, introducing two simple methods that represent two different and important approaches in the world of text analysis, and also providing an opportunity for hands-on learning with supportive friends.

Further Reading

Friday, December 14, 2018

How To Design For Big Data

This month's meetup was on the topic of designing technology services for growing data.

We also had short talks on making home sensor data open for exploration, and proposing that AI improves game narratives.

A video of the talks is on the group youtube channel: [link]

Slides for Designing For Big Data are here: [link].

Slides for the Smartline project are here: [link].

Slides for the AI Game Narrative talk are here: [link].

Short Talks

Ian Mason from Exeter University explained the Smartline project for-social-good research exploring sensor data from homes in Cornwall, and an invitation to participate in data mining hackathons.

The project is using sensor data installed in selected Cornwall homes, and is investigating how that data can be used to improve the lives of residents, focussing on health and happiness. Examples include predicting house infrastructure faults, and monitoring energy consumption and cost.

The project is looking to open up the data for the wider community of data scientists and businesses for open-ended research to provide additional value. This work could lead to further funded research or the development of business products.

The nexts steps are a formal launch of the data [event link], and subsequent workshops organised by Data Science Cornwall.

John Moore's talk proposed that, although games have drastically improved visual detail, rich interactivity, audio effects and overall immersion, the core narratives and storylines have not improved significantly. He suggested the development of AI techniques to automate the development of such narratives and plot-lines. This is a unique idea, and worth exploring. A suggestion from the audience was to explore existing efforts for generating works of fiction as a similar challenge.

The Data Success Problem

A successful product will often see a growth in the data it stores and processes.

The downside of this success is that:
  • data storage can grow beyond the limits of a single traditional database, and 
  • data processing can outgrow the processing and memory limits of a single computer.

Rob Harrison is experienced in designing, and remedying, architectures which support larger and growing data.

Scaling Basics

In the past, a common approach to dealing with growing data and processing needs was to grow your computer - faster processor, larger storage and bigger memory - vertical scaling.

This can work up to a point, but very quickly the cost of large computers escalates, and they remain a single point of failure. Ultimately you won't be able to buy a big enough machine to match your data growth.

A better approach is to use multiple but relatively small compute and storage nodes. Collectively the solution can have a larger storage, processing or memory capacity, enough to meet your needs. Benefits of this horizontal scaling approach are:
  • cost - each unit is relatively cheap, and collectively cheaper than an equivalent single large machine.
  • resilience - if designed correctly, failure of some nodes isn't disastrous, and can often appear to end users as if there was no problem.
  • parallelism - lots of compute nodes give the opportunity to processes data in parallel, improving performance for amenable tasks.
  • growth - the ability to incrementally add more (and just enough) nodes to meet growing demand.

This fundamental shift from vertical scaling to horizontal scaling has driven a wide range of changes in software and infrastructure architecture over recent decades, from the emergence of multi-core computing to distributed databases.

Assessing Products & Solutions

Rob presented a framework for assessing products and solution architectures. The key points can also be considered design features and principles for designing your own solutions where proportionate to your needs.

He noted that a good design doesn't limit storage - that would not be useful if your data grows. Furthermore, it should scale not just data reads but also writes. The latter is a tougher requirement and some products don't do this well.

Good architectures should be resilient to failure of individual components, and this is typically achieved through redundancy. Related to this is the ability to failover automatically. Too often, products remain off-line for hours as they fail over, which doesn't match modern customer expectations.

Rob made an interesting point about horizontally scaled solutions which don't work if there is any heterogeneity in technology versions. In some sectors, it is a requirement to run several versions of a technology to avoid total failure where a fault affects a specific version.

He also made a point about technology that is aware of its physical location, within a rack or a geographical zone, to better optimise data flows amongst nodes.

An important point made by Rob is that any solution shouldn't lock you into a single vendor's products, or indeed a single cloud. This requirement for sovereignty over your own data underlines the importance of open source technology and cloud vendor agnostic platforms, such as Docker.

State Then And Now

State is just a word that describes the information in, and configuration of, a system. Applied to technology products, the state often refers to user data. It is this data that grows as a service grows.

The complexity of state is the major driver of how complex a system is to manage and scale.

The standard architecture of many technology platforms is often described as in this diagram.

It shows a user device accessing a service over the internet, with requests being distributed to one of several web servers by a load balancer. We can add more web servers to meet demand if needed. So far that's fairly resilient and scalable.

The diagram shows all those web servers sharing a common database and perhaps a common file server. These are the single points of failure and potential bottlenecks.

Rob then shared a more complex architecture that more truthfully reflects the reality of today's services.

Often the application logic is not located in web/application servers, but in an app on the user's device. That device will have its own local data store. The device will connect to API servers over the internet, but that connection is rarely permanent, and sometimes poor. The server side datastore is often not a traditional relational database but a less-rigid data store that can store data structures that better match the application, JSON documents for example. The term NoSQL has emerged to describe a broad range of such data stores.

Many of these modern NoSQL data stores offer a range of choices for how they scale, allowing developers a reduction in consistency in return for much easier scaling and resilience. Many applications don't need, or can tolerate, a write operation taking its time to asynchronously replicate to multiple nodes.

Rob's picture also refers to OS for object store. These store data in the form used by applications, and not broken into fields and forced into tabular form, only to be reconstituted when queried. He also refers to PNS for publish and subscribe - a pattern for publishing messages to be picked up by interested subscribers when they're ready. This loosely coupled approach works well for synchronising state across intermittent internet connections.

Another advantage of some modern NoSQL data stores is their flexibility with data schemas. That is, they don't firmly insist that stored data all conforms to the exact same schema. This makes much easier the iterative agile development of applications. An up-front one-shot data design is unlikely to meet future needs as a product evolves.

Rob offered some wisdom in his analysis of the emergence of modern NoSQL data stores:
  • Traditional relational databases emerged from a time when storage was expensive, and so huge effort was put into normalising data to reduce duplication. Today storage is cheap, and the cost of normalising and denormalising, and the risks of a fixed schema, are no longer tolerable. 
  • Today's data stores are better matched to the needs of application developers. They are able to store objects in formats closer to those in the application, such as key-value pairs and JSON objects, binary objects and even graphs of linked entities. Flexible schemas support the reality of applications that iterate and evolve.
  • Today's data stores aim to meet internet-scale demand and the availability expectations of empowered customers. 

What's So Wrong With Traditional Databases?

Rob discussed the challenges of traditional databases throughout his talk so it is worth summarising the key points:

  • Relational Database Management Systems (RDMBS) were, and still are, the most common kind of database in use. Today's popular databases like PostgreSQL and MySQL, though extended with modern capabilities, have their roots in decades old design assumptions. 
  • One assumption is that data storage is expensive, and that effort put into normalising data, linked by keys, is worth the effort of decomposition and reconstitution. This assumption isn't particularly valid today.
  • Although the data model of fields in tables linked by keys can be useful, it can sometimes lead to catastrophic performance failures when the right indexing isn't anticipated or the joining of such fields becomes complex.
  • Many traditional relational databases aim to be ACID compliant, which can be simplified as being always correct at the expense of latency. This can lead to performance bottlenecks as an entire table or even database is locked during an update. Many of today's applications don't need this level of transactional consistency, and many only require eventual consistency
  • Traditional relational databases weren't initially designed to be horizontally scaled across nodes. This means that attempts to make them work this way are more of a retro-fit than an engineered solution. Issues and challenges include locking across nodes and inability to handle data inconsistencies caused by inevitable network failure, exacerbated when the links are across distant geographies. Aside from infrastructure, the challenges continue at the logical level, for example, keeping universally unique identifiers for database keys consistent is a challenge across multiple instances of a traditional database, particularly when rebuilding a failed database. 

A standard approach to scaling relational databases is sharding - splitting your queries amongst your nodes. The following picture shows the simple splitting of database queries so those related to user names beginning with A-D are processed by the first shard, and those with U-Z are processed by the last shard.

Although this seems like a fine idea, it fails when a shard can no longer meet demand and further scaling is needed. Unique database keys make migrating data to a new architecture very difficult, without rebuilding all the data again. Another problem with sharding is that the demand profile can change leading to over-used and under-used shards. Again, rebalancing the queries requires a database rebuild as we can't simply shift data from one shard to another.

In summary, traditional relational databases were fine for the time they were initially designed. They are useful today, but we now have more choices for data storage and processing that better match modern internet-scale design goals and user expectations.

Modern Tech Stack

Rob the talked us through the technology stack from low-level memory up to filer servers, comparing historic approaches and modern technologies.

Memory Is Faster Than Disk

An early method of accelerating traditional web applications was the use of memory-based caches. That is, temporary in-memory stores that very rapidly served responses to queries that had been seen before, and for which the response had been calculated one before.

Although the original setting for these caches was web queries and responses, the idea easily generalises to other kinds of queries and responses, including data store queries and responses.

Today these memory-based caches can coordinate across multiple nodes, allowing the scaling of the cache beyond the limits of a single machine.

Memcached and Redis are to popular choices for a memory-based cache for data stores, with memcached being a better choice for simpler and smaller data structures, and redis being better for more sophisticated data structures.

Such memory-based caches should be considered as accelerators reducing the load on data stores, and not as key methods for scaling the stores themselves.


Rob proceeded to discuss examples of so-called NoSQL databases, built to different design goals from traditional relational databases, and which aim to be much more easily scalable and flexible for application developers.

To explain the terminology, SQL is the query language most commonly used with relational databases. It was developed in the early 1970s! Modern data stores initially didn't use this language, so they became termed No-SQL, but today some do offer the ability to query them using SQL-like queries and so No-SQL can mean "not-only" SQL.

Rob explained how Google was an early leader with BigTable which can be thought of as a two-dimensional key-value store, with one key being a row and the other being a column identifier. The contents of the value don't need to conform to a schema. Google describes their public BigTable service as suitable for petabyte scale data whilst still providing a performant services (sub 10ms latency). A number of Google's own services are thought to run over their own BigTable implementation.

Today BigTable supports a wide range of businesses, processing a wide range of data from financial time series data to user profile data.

Facebook open-sourced its implementation of a similar wide-column database, Apache Cassandra. Because it is open source, we can see how it works. Key design features include:
  • Distributed across nodes, where every node has the same role, if not data. There is no single-point of failure, and no concept of a master or slave hierarchy. 
  • Aim to scale both read and write fairly linearly as nodes are added.
  • Fault-tolerant to node failure, and designed for nodes to be replaced with no service downtime.
  • Tuneable consistency, from "issue write and don't worry when it completes" to "block everything until data written to at least 3 nodes, and they all confirm it".

These are typical objectives of several modern NoSQL data stores, and the contrast with traditional databases is stark!

Cassandra has been used successfully by organisations such as CERN, Apple, and Netflix.

Rob also mentioned Couchbase, which is often used to store JSON objects, ideal for application developers and supporting REST APIs. The following chart shows its write performance compared to other NoSQL databases. Couchbase remains performant at 20,000 writes per second compared to MongoDB which degrades at 6,000 on comparable systems.

You can read more about how Couchbase on Google Cloud infrastructure reached 1 million writes per second, of over 3 billion items, using just 50 nodes [link].

File / Object Storage

Sometimes our data just isn't particularly structured and would normally be stored on a file server.

Modern options for storing arbitrary files or objects include the popular Amazon S3 service. Although not particularly fast, such services are very cheap and very resilient, offering availability of 99.9%, which means it should not be down for more than 43 minutes per month.

After Amazon's lead, other providers have offered competing object storage. Almost all of them are designed to be accessed programmatically, offering easy to use REST APIs. Some even offer APIs compatible with Amazon's S3 to ease development, migration and multi-vendor or hybrid-cloud architectures.

The term object here just means an arbitrary bunch of data, like a file, rather than a structured object used by application code.

An illustrative use for such low-cost, if not very fast, object stores is for user generated photos.

Distributed Computation

Distributing the storage and retrieval of data from one node to many nodes can improve performance and resilience. The same can also work with computation.

Instead of a single node performing calculations on data, the computation can be spread over many nodes. Many, but not all, tasks are amenable to being split into smaller parallel tasks. The performance benefit of performing these tasks in parallel is very attractive.

A good illustration of the idea is the task of counting the number of words in a book. One node can count the words from the first to last page. Alternatively, many nodes could be working on a chapter each, working in parallel. If those nodes weren't sufficiently performant, those chapters could be further divided into paragraphs distributed to further nodes to count.

The most famous example of this approach is Google's MapReduce, an idea which has been implemented by others. Apache's Mahout is a notable example of numerical and machine learning algorithms implemented over a map-reduce framework.

Rob's discussion led to Hadoop, probably the most recognisable name amongst the new wave of data storage and compute technologies. Hadoop is not one, but a collection of technologies including a distributed filesystem HDFS, a wide-column store HBase, a MapReduce engine, and task managers.

Hadoop is often lauded as the solution to many data problems, but its main focus is distributed computation, and other technologies may be better choices if your task is storing and retrieving data but not distributed computation.

Functional Programming

Rob also shared his thoughts on application development. One theme was particularly interesting.

It is a reality that many programming languages don't protect against accidental changing of data beyond the intended scope. As application code and logic gets bigger and more complex, the risks of these unintended side-effects grows. Add to this mix, parallelism and we have a new class of potential errors.

Functional programming languages impose a restriction on functions such that they can't have any side-effects, except those that are explicitly intended, and even these are tightly controlled. This is done not by complexity but simplicity. The benefit is that sophisticated and complex applications can be composed of these basic functions, and we can verify that the fuller code also doesn't have unintended side-effects.

As a bonus, the implementation of these functions requires them to independent in operation and that makes them very easily parallelisable. Functional programs naturally benefit from multi-core and multi-node hardware without additional effort on the part of the developer.

The last 5 years has seen a growth in demand for functional programmers in some sectors, because large complex distributed applications are easier and safer in functional languages.


Rob's talk was well anticipated and well received. Many members commented that his talk had forced them to rethink their assumptions and designs, or broadened their options for future projects.

My own summary of Rob's message is that today's data technologies
  • aim to support internet-scale services, and the high expectations of modern users.
  • are designed from the ground-up to scale horizontally.
  • allow application developers to choose their own balance between performance, consistency and availability. 

More Reading

Saturday, September 29, 2018

Python for Data Science - Top 10 Tools

We held the first Data Science Cornwall meetup this week.

The main talk was an introductory overview of the main python tools and libraries used in data science.

The slides are at:

Code and data is at:

Python for Data Science

Python has become the leading language for data science. The vibrant python ecosystem includes tools and libraries that have become very popular for data science.

Some of these have become, in effect, part of the de facto python data science stack.

Why python? Python was not designed as a numerical computing language like Fortran, R, SAS or the more recent, Julia.

Python was designed to be easy to learn, and designed to be applicable to a wide variety of tasks. The low barrier to entry has led to wide adoption. Python is now very popular, being used to support global scale infrastructures as well as programming small low cost educational hardware such as the bbc:microbit. Today, python is being used by primary school children and university students alike,  and being learned by artists and engineers, further guaranteeing a healthy future.

Python itself isn't very competitive for numerical computing performance, but performant code can be wrapped and used from python as a supporting library. This gives users the best of both worlds - the ease of use of python, and the performance of optimised low level code.

Anaconda Python

The standard python distribution includes a fairy rich set of support libraries, supporting a broad range of tasks such as downloading data from the web, concurrent programming, and working with XML. But it doesn't contain many of the key libraries that are popular for data science today.

As a result, many data scientists use distributions that have these data science libraries included, or downloadable from repositories for easy inclusion. A key benefit of these distributions is that these libraries of tested to work together.

A leading distribution for data science is Anaconda python. Versions for Linux, Windows and MacOS can be download from here.

Working with Notebooks

Traditionally computer programs have been written as text files, edited using text editors or IDEs, and run using command shells.

An alternative approach has become popular, particularly in data science, and that is an interactive, web-based notebook, capable of displaying richer media than a simple command terminal.

The web based notebook approach is far more familiar and user-friendly than a text editor and command shell, and the ability to output formatted data and graphics makes it suitable for more tasks without the need for additional software.

Being web-based they are easy to share and use, with viewing possible on any device with a browser, including tablets and smartphones.

Github will render any jupyter notebook that you upload to it. The example code for this talk is provided in the form of notebooks.

A simple notebook showing very simple python code as well as markdown text, useful for commenting and documentation, is on github:


We've already mentioned that python on its own isn't ideal for working with large arrays of data. In fact python itself doesn't have a builtin array data structure. It only has lists. Of course, 2-dimensional arrays can be made out of lists.

The main issue is that each item in a python list is a high-level python object, and working with large numbers isn't very efficient, and can lead to performance problems well before other languages designed for working with data.

The ubiquitous numpy library provides an array data structure, ndarray, which stores data in a simpler form, and manipulates it more directly in memory. In particular, whole-array operations become much faster without the overhead of working with high-level objects.

The example code illustrates how numpy operations are much faster python list operations:

Numpy arrays are so common they are effectively a core python data type, and efforts are underway to add it to the core.

Numpy maintains a small scope, and allows other libraries to built further functionality on the basic array type.

A leading extension of the numpy array is the pandas datafrane. Pandas offers convenience extensions to the basic array, offering the ability to output nicely formatted data tables, add column and row names, slice and pivot the data, apply filters to the, perform simple operations over the data, provide useful import/export, and simple but useful plotting functions.

The example code illustrates how easy it is to load data from a csv file, output it in pleasant format, filter by column name and or a numerical threshold, and trivially easy graphing:

A key library for working with data is scipy which provides a wide range of scientific and mathematics functions - linear algebra, image and signal processing, spatial and statistics, for example.

The example code demonstrates using scipy to perform a fourier analysis of a signal to find the component frequencies within that data:

Text Analysis

Python has long been used for text processing, and more recently this has extended to modern text mining and natural language processing.

The NLTK natural language toolkit toolkit has been around a long time, and as such is well known, with plenty of online tutorials and examples based on it. The toolkit was designed as an education and research tool, and its age means it relatively proven and mature. The toolkit provides common natural language functions like tokenisation, stemming, part of speech tagging, basic classification and a set of example text datasets (corpora).

The example code demonstrates using nltk to break some sample text into words (tokenisation), and labelling them as parts of speech (verbs, nouns, etc), and extracting entities such as names and places:

With nltk being focussed on research and education, a more modern library spaCy is becoming the choice for building products and services. It's main feature is that it is fast. It also provides support for several languages and provides pre-trained word vectors to support more modern methods for working with natural language. The organisation behind spaCy invests in performance and accuracy benchmarking and tuning.

The example code shows spaCy used to process a segment of Shakespeare, extracting people entities, and a simple demonstration of text similarity:

Natural language text analysis has been in the news recently with notable advancements such as the oft-quoted king - man + woman = queen example. A key innovation behind these advancements are word vectors. These are multi-(but low)-dimensional vectors which are similar for words which are semantically similar. This allows clustering of similar texts, and can also allow a kind of algebra for text. Another notable example illustrates the biases and faults in our own culture, expressed in the text we produce, and learned by algorithms, doctor - man + woman = nurse.

Gensim has become the leading library for providing and working with models based on word vectors. The example code demonstrates using word vectors learned from a snapshot of wikipedia 2014 and a large set of mostly news feed data, used to perform these concept calculations:

Machine Learning

Machine learning is at the heart of recent advances in artificial intelligence. Improvements to both models and how they are trained has led us to be able to solve problems that previously we're feasible.

The scikit-learn library can almost be considered the native python machine learning as it has grown in the community, as is now well established and actively developed.

The example code demonstrates how it can be used to perform clustering on data - finding groups within data:

The ecosystem for machine learning is energetic, with both new and established tools. Given the commercial interest, companies like Facebook, Amazon, Microsoft are all keen to promote their own tools.

Tensorflow from Google can't be ignored. It has huge resources behind it, it is developing rapidly, and has a growing body of tutorials around it. The javascript version is also notable, tensorflow.js.

I suggested that users take a moment to think about the risks, if any, of lock-in to a product that is very much managed and developed by Google on their own terms.

I encouraged members to consider PyTorch. Pytorch is notable because of its design:

  • it is pythonic, in contrast to some frameworks which are clearly very thin wrappers around C/C++ or even Fortran code
  • is aims to make the design of machine learning models easy and dynamic - and was innovative in the use of automatically calculating error gradients for you
  • trivially simple use of GPU hardware acceleration - historically, coding for GPUs has been painful

The following illustrates how easy it is to switch from using the CPU to the GPU.

GPU accleration has been instrumental in the advancement of modern machine learning, and specifically deep learning,  which is not a new concept but has been made possible with the increasing accessibility of hardware acceleration.

Sadly, the machine learning ecosystem is dominated by Nvidia and it's CUDA framework. AMD's efforts haven't become popular, and vendor independent frameworks like OpenCL haven't taken hold.

You can find an introduction to using PyTorch here. The tutorial implements a neural network to learn to classify the MNIST handwritten digits, a canonical machine learning challenge. Initial results suggested plain simple python performed twice as fast as the CUDA accelerated pytorch code.

Although this looks disheartening, the results were shown to make a point. The following graph shows performance as the neural network grows deeper.

The lesson here is that for very small tasks, the pytorch code, and likely many other high-level frameworks, will be slower than the simplest python code. However, as the data or complexity of the models grows, the hardware acceleration becomes very beneficial.


There are many choices available for visualising data and results with python. However a few key libraries stand out.

Matplotlib is the elder statesman of python graphing. It has been around a long time, and is very flexible. You can do almost anything you want with it. Have a look at the gallery for an overview of its broad capability - from simple charts to vector fields, from 2-d bitmaps to contour plots, from 3d-landscapes to multi-plots.

The demonstration code illustrates matplotlib basics - line plot, x-y plot, histogram, 2-d bitmap plot, contour plot, plotting a 3-d function, and even a fun xkcd-style plot! The code also demonstrates how pandas dataframes also provide convenient plotting functions, which also happen to use matplotlib.

Sadly matplotlib's syntax isn't the simplest, and we often have to keep a reference close when using it. Several other frameworks are built on matplotlib to provide nicer defaults and simpler syntax.

Notable alternates to matplotlib include:

  • ggplot - which implements the "grammar of graphics", an attempt to design a language for describing plots and charts. The plotting system of R is based on this same design.
  • bokeh - draws plot using browser canvas elements, making for smoother visuals, but also enables the possibility of easy interactivity and dynamic manipulation.
  • plotly - a popular online service, with investment in documentation and quality, but some services cost, and the risk of currently free services being charged in future needs to be considered.

Finally, we looked at visualising networks of linked data. This is an area that isn't very mature in that there isn't a very well established library that most data scientists rely on.

Networkx is a library for manipulating graph data as comes as part of the Anaconda distribution. It has a very simple capability to plot static plots of graphs.

For dynamic exploration of graph data, which is often dense and requires interactivity to explore, a common solution is to use the javascript d3.js library. This isn't a python library, but data scientists often find themselves having to code their own wrapper around it to use from python.

The example code illustrates a simple networkx visualisation, as well as an example of data from a pandas dataframe being used to populate a d3.js force-directed graph - which is also interactive. The github rendering doesn't show the interactive plot so you'll have to run the code yourself.

The code for this implementation was developed by myself, based heavily on the work of Mike Bostock, but made to work in a jupyter notebook. The original code was designed to visualise clusters of related text documents. For further explanation see the blog.

Conclusion & References

The talk seemed to be well received, but more importantly, a community has started to form, and the great discussions around the talk bode well for an active future.

A member suggested we look at Google's Colab a system which includes a version of python notebooks which can be shared and remain editable by multiple users, much like documents in Google Drive:

Friday, September 28, 2018


We held the first Data Science Cornwall meetup yesterday.


I'd like to thank the great tech community in Cornwall for the support and encouragement in making this happen, especially Brian at Falmouth University, Toby at Headforwards, Garry at Hertzian and Software Cornwall too.

Mission & Values

The aim of Data Science Cornwall is primarily to develop a community around data science in Cornwall. That means providing opportunities for people to connect, share knowledge, learn from each other, be inspired and find support. That community is for individuals, companies, newcomers to data science, beginners and experienced practitioners.

We aim to be an open relaxed community -  which means we'll be neutral from any company affiliation, essential for building trust and sharing knowledge.

We'll work primarily with open source technology, because we value open inclusive access to tools, and transparency into how they work.

We will retain a focus on being beginner-friendly. Even our advanced sessions will have something for beginners to benefit from.

We must be open and inclusive - so please do let us know is something is a barrier that we're not conscious of.

What We'll Do

We'll mostly have talks aiming to share knowledge or experience. We'll encourage members of our community to do these, and not just have travelling serial speakers.

We'll also run hands-on practical tutorials.

There is interest in hackathons so we'll organise these too.

We'll always build in time to talk and network.

Feedback Is Important

It is important that the community says what it likes and dislikes, what its needs are, and what is wants to see more and less of. That's the only way we can organise future meetups that meet our community's needs.