Saturday, April 11, 2020

Visualising High-Dimensional Data With t-SNE

This post is a beginner-friendly introduction to t-SNE, a method for visualising high-dimensional data in a smaller dimensional space, like a 2-d plot, whilst preserving clusters in that data.

There is an accompanying video on the group's youtube channel:


Slides used for the video are here: [link].

There is also sample Python code below illustrating how to use t-SNE.


Problem - Visualising High Dimensional Data

Often we're working with data that has many dimensions. For example, data about people might include features such as age, height, weight, walking speed, resting heart rate, average blood sugar level, hours of sleep, and so on.  Each of these is a dimension of the data.

Visualising data is always an excellent idea. Visualisations can tell us very quickly the broad nature of the data, and also reveal patterns and relations in that data. Visualisations are a good start to data investigation, not just the end point of an investigation.

However, visualising data with more than 2 or 3 dimensions is not easy. Data with 2 dimensions can be shown on a flat chart, a scatter plot for example. Data with 3 dimensions can be shown on a flat paper or computer display by simulating a 3D space, but these can risk obscuring data or misrepresenting relations in that data. Visualising data with 4 dimensions is much harder to do well.

Imagine a chart showing 5 dimensions age, height, weight, heart rate and hours of sleep, all on the same chart. That would be difficult to do well.

It would help if we could somehow show high-dimensional data on a 2-d plot. Reducing data down to 2-dimensions inevitably means losing information. So this reduction will only be useful if it preserves a desired element of the original data.

One useful element worth preserving is the closeness of data, so we can see clusters of related data points on that 2-d plot.

This is what t-SNE does.

The t-SNE method was developed around 2008 by researchers including the Geoffrey Hinton of neural networks fame.


How t-SNE Works

A simplified explanation of how t-SNE works is as follows:
  • distances between points in high-dimensional space are calculated
  • data points are placed randomly in a small-dimensional space

.. then the following steps are repeated for each point ..
  • each point is subject to attractive and repulsive forces with other points
  • if two points have a short distance (in the high-dimensional space) they attract
  • if two points have a large distance (in the high-dimensional space) they repel
  • each point is moved a small amount according to the balance of these forces

Note that the the smaller dimensional space is often 2-dimensional to make plots easier, but doesn't have to be.

After many iterations the points that were placed randomly in the small-dimensional space will have moved so that they are close to points, to which they are also close in the high-dimensional space.

Because this method starts with a random initialisation, the resulting arrangement of points in the smaller 2-d space will be different every time we run the process.


Python Examples

Here we'll work through four simple python notebooks, with each illustrating a key element of the tSNE process.


Example 1
The first notebook starts with simple 2-dimensional data which happens to be clustered around 3 points. The data isn't high dimensional but we're starting with it just to keep things simple, and to see the effect of tSNE on simple data so we get a good intuition for it.


We can see 3 fairly well spread clusters be transformed into 3 very tightly clustered clusters. This shows the exaggerating effect of t-SNE, bringing close data points even closer, and separate data points that aren't close to be even further apart.

The python code is here, and you can copy it and experiment yourself:

The t-SNE tool from the popular sklearn library doesn't take much code to use at all. We create a TSNE object, specifying the number of dimensions it will reduce data down to. We then simply apply it to the data to be processed.


Example 2
In this example we look at data that is 3-dimensional. Again, this isn't particularly high-dimensional. The following shows a 3-d plot of this data.


You may be able to see clusters in this data, but you can imagine similar data where the clusters are not distinct at all.

The following shows 2-d slices of the 3-dimensional data. Again, clusters may be visible but they aren't clear.


If we try to count the clusters, many of us would say there are 6 clusters.

The following shows the 3-dimensional data transformed by the t-SNE process.


The t-SNE visualisation shows very clear clusters. What's important is that there are 7 clusters, not 6.

The previous visualisation were obscuring the 7 clusters, and very carefully chosen angles on the 3-d plot would have shown them. The t-SNE process shows its strength in separating them out in a 2-dimensional space.

The python notebook is online:



Example 3
The third example uses data with 64 dimensions, much higher than the 2 and 3 dimensional data we were looking at before. The data is actually a set of 8 pixel by 8 pixel images of digits. Our code selects only 500 from the 1500 digits to keep our visualisations clear.

The following shows a digit "0". We can think of each digit as a 64-dimensional data point, with each pixel value being one of the 64 dimensions.


The following shows a 2-d plot of dimensions 2 and 3, and also a 3-d plot of dimensions 2, 3 and 4.


None of these plots shows any clusters in the high-dimensional data. Try adjusting the code yourself to see if different 2-d and 3-d slices of the 64 dimensions reveals a more insightful chart.

Processing the data with t-SNE to reduce it down to 2 dimensions results in the following plot.


There are distinct clusters visible, which really shows the power of t-SNE. I count 9 clusters.

Plotting the data points with colours showing the actual digits confirms that the t-SNE clusters are very much correct.


However, the plot does show one weakness. At the bottom (and elsewhere) some of the digits seem to have been merged into one cluster, or have points in what looks like a different cluster. This is very likely to do the fact that digits can sometimes be ambiguous themselves.

The code is online:



Example 4
The final example is a test to push the t-SNE method with a deliberately challenging data set.

The data set consists of 2 clusters, both encompassed in a ring. It'll be interesting to see what t-SNE makes of the outer ring which does have points that are close together, but which span a much larger area than the 2 simple clusters.


We can see that t-SNE has managed to separate out the 3 groups, and also retain the broad topology of the ring.

Impressive!

The code is at:



Tuning T-SNE

The example code specifies how many dimensions to reduce data down to. It also specifies another parameter called perplexity. This parameter tunes the balance of attractive and repulsive forces we discussed above, and in a rough way, specifies how many neighbours a point should have. The higher the value, the more the process will try to bring points together.

The following article explains the effect of changing this perplexity, and how it can mislead or result in very different clusters for the same data.

That article includes interactive demonstrations showing how different kinds of data react to different perplexity values, and is really worth experimenting with.

Importantly, the article shows how t-SNE can mislead, for example showing clusters from data which is actually random.


Further Reading