Friday, February 1, 2019

Sentiment Analysis - A Hands-On Tutorial With Python

This month we hand a hands-on tutorial taking us through simple sentiment analysis of natural language text.


The slides are at: [PDF]

Code and data are at: [github]


Natural Language and Sentiment Analysis

Natural language is everywhere - from legal documents to tweets, from corporate emails to historic literature, from customer discussions to public inquiry reports. The ability to automatically extract insight from text is a powerful one.

The challenge is that human language is hard to compute with. It was never designed to be consistent, precise and unambiguous - in fact, that is its beauty!

In the broad disciplines of natural language processing and text mining, sentiment analysis stands out as particularly common and useful to many organisations. Sentiment analysis aims to work out whether a piece of text is being positive or negative about the subject of discussion.

This sentiment analysis can result in a simple number, or an even simpler positive / negative label. Even this simplicity can be really useful, providing insights into large or rapidly emerging text, where it would not be feasible to read and assess the text manually.


We were lucky to have Peter give us an overview of sentiment analysis and lead a hands on tutorial using Python's venerable NLTK toolkit.


Two Approaches

Approaches to sentiment analysis roughly fall into two categories:
  • Lexical - using prior knowledge about specific words to establish whether a piece of text has positive or negative sentiment.
  • Machine Learning - training a model using examples of positive and negative texts. Often that model is probabilistic, that is, it learns the probability of positive or negative sentiment based on the combination of words present in the text.

Peter created two simplified tasks for each of these approaches.


Lexical Approach

A very simple lexical approach is to have a set of words which we know contribute a negative or positive sentiment.

This picture shows just five words.


The word poor indicates a negative sentiment. The word bad indicates a stronger negative sentiment. The word terrible indicates a really negative sentiment. The scores associated with these words reflect how strong that negative sentiment is.

Similarly, the word good suggests a positive sentiment, and the word great suggests stronger positive sentiment. The scores reflect this too.

This is just a very small sample of scored words, but researchers have created longer, more comprehensive, lists of such words. A good example is the VADER project's list of words and their contribution to sentiment: vader_lexicon.txt.

A particularly simple way of using these scored words is to simply add up the scores as we find the words in the text being analysed.


You can see in this very short film review we've found the words poor, terrible and good. Adding up the scores for those gives us a total of -5. The positive sentiment of good wasn't enough to outweigh the very negative sentiment from the first sentence.

This is a simple approach which serves to illustrate the lexical method for sentiment analysis.

You can see that in practice, if we want to compare scores across reviews, we'd need to adjust the scores so that very long sentences or texts aren't unfairly advantaged over shorter ones. A good way to do this is to divide the scores by the number of words in the text snippet. Even this might be improved by dividing by the number of words actually matched and scored, otherwise there is a risk of long passages diluting sentiment scores.

In our simple example, that score would be -5 / 3 = -1.67. The negative result indicating an overall negative sentiment.

A key message that Peter underlines was that there is no perfect method, and each approach, simple or sophisticated, has advantages and weaknesses.

Peter provides a data set of movie reviews, and steam game reviews, and introduced key elements of Python to help us write our own code to calculate sentiment scores for these reviews.

In trying this, some of us found that the review text needed to be lowercased because the VADER lexicon of sentiment scores was lowercase.


The class had great fun trying out this simple example, and it was great to see more experienced members helping those less experienced with Python coding.

Peter's own code where he explores additional ideas like calculating the sentiment sentence by sentence:



Machine Learning A Sentiment Classifier

We didn't get time in the session to try the second approach of training a model with examples of positive and negative text.

Peter did discuss a simple approach training a Naive Bayes Classifier.

The Bayes theorem is often difficult to understand when coming across it for the first time, so Peter pointed to an easy explainer on youtube. Essentially, it provides a way of calculating the probability of something given something else has happened (which also has its own probability). For example, what's the probability that it is raining, given my head is wet? You'll hear the term conditional probability to describe this idea.

How is this relevant to our task of sentiment analysis?

Well, we're trying to work out the probability of a piece of text having positive or negative sentiment. That probability depends on the occurrence of words in that piece of text. And each of those words has a likelihood of being in positive or negative texts.

Have a look at this simplified example.


If we look at a training set of negative documents, we might find that the probability of the word poor occurring is 0.8. We'd do this by counting the occurrence of the word.  The word poor might occur in text documents which are assigned a positive sentiment, but they're less likely. In this example they have an occurrence probability of 0.1.

Similarly probabilities for the word good and apple can be established. No surprise that the probability of good in positive texts is 0.7, and a low 0.1 in negative samples.

So how do we use this to help classify a previously unseen document as positive or negative?

Imagine a new previously unseen document only contains the word poor. What's the probability that it is a positive document? What's the probability it is a negative document? Intuitively we know the document is negative, and looking at the numbers the probability of poor being in the negative classification is much larger than the positive.

That's the intuition - and it's not so complicated.

The Bayes theorem just helps us calculate the actual probabilities. Why do we need to calculate them, surely our intuition is enough? Well, that word poor was likely from a negative document, but there's a small chance it could have been from a positive document. That's why we need to take more care over the competing probabilities.


Let's take the key formula and apply it here:

P(negative given poor) = P(poor given negative) * P(negative) / P(poor)

We want to work out the probability of the document being negative given that it has the one word poor. That's the left hand side of the equation. Let's look at the right hand side of the equation:
  • The probability of poor given the document is negative. That's what we know from the training data. That's 0.8
  • The probability of a document being negative. We've assumed a half-half split of positive and negative documents in the training data but this might not be the case. It may be that negative documents are just more likely to occur, just as many reviews tend to be negative because that's when people are motivated to write them. For now let's assume an equal split so this is 0.5
  • The probability of the word poor itself occurring at all, irrespective of positive or negative document, is something we have to find from the data set itself. If the word is rare this probability is low. In our example, the probability of poor is (0.9 + 0.1)/2 = 0.45.

The means the probability of the document being negative is 0.8 * 0.5 / 0.45 = 0.889.

Doing a similar calculation for the probability of the document being positive if the only word it contained is poor, we get 0.1 * 0.5 / 0.45 = 0.111.

So having a document with the one word poor, the probability that it is a negative sentiment document is far higher than it being a positive sentiment document.

This all looks overly complicated, but we do need this machinery when our training data has an uneven number of positive and negative documents, and when we extend the idea from one word to many.

What we've done is classification. And in fact we can use this very same idea to classify documents against different kinds of categories - spam versus not spam being a common example.

You can see Peter's own code that uses the NLTK Naive Bayes Classifier:



Care!

What we've looked at are simple ideas that may not perform very well without further preparation and optimisation.

A key reason for this is that natural language is not consistent, precise, and unambiguous. Natural language has constructs like "not bad at all" where considering the individual words might suggest an overall negative sentiment. Sarcasm and humour have been particularly challenging for algorithms to accommodate.

Improvements can include using "not" to negate the sentiment of the subsequent word or few words. Another approach is to consider word pairs, known as bigrams, as pairs of words often encapsulate meaning better than the individual words.

Peter raised the issue of asymmetry in the lexical approach. The strength of "not bad' is not equal but opposite to "bad", and the same for "good" versus "not good".

In terms of assessing and comparing the performance of classifiers, Peter touched on the issue of precision, recall, and the F1 measure that combines them.


Conclusion

Peter succeeded in framing the complex challenge of natural language, introducing two simple methods that represent two different and important approaches in the world of text analysis, and also providing an opportunity for hands-on learning with supportive friends.


Further Reading