Friday, July 26, 2019

Hands-On Introduction to PyTorch

This month we ran a hands-on introductory tutorial on PyTorch.


The slides are online (link).


Machine Learning

We started with a quick overview of machine learning, and very simple illustrative example.

The following shows a machine which is being trained to convert kilometres to miles. Although we know how to do that conversion, let's pretend we don't and that we've built a machine that needs to learn to do that from examples.

When we build that machine, we make some assumptions about the model. Here the assumption is that miles and kilometres are related by a multiplicative factor. We don't know what that parameter is, so we start with a randomly chosen number 0.5.


When we're training that machine, we need to know what the correct answer should be to a given question. It is with examples of correct pairs of kilometres and miles that we train the machine - the training data.

When we ask the machine to convert 100 km to miles, the answer it gives us is 100 * 0.5 = 50 miles. We know this is wrong.


The difference between the machine's answer 50 and the correct answer 62.127 is 12.127. That's the error.

We can use this error to guide how we adjust that parameter inside the machine. We need to increase that parameter to make the output larger and closer to the correct answer. Let's try 0.6.


This time the machine tells us that 100 km is 60 miles, which is much closer to the correct 62.127. The error has been reduce to 2.137.

If repeat this process of trying different known-good examples, and after each one tune the parameter in response to the error, the machine should get better and better at doing the conversion. The error should fall.

This is essentially how many machine learning systems work, with key common elements being using known-good questions and answers as a training dataset, and using the error to guide the refinement of the machine learning model.


Neural Networks

Combining several small learning units intuitively allows more complex datasets to be learned.

Animal brains and nervous systems are also collections of neurons working together to learn and perform complex tasks.


Historically, research in machine learning took inspiration from nature, and the design of neural networks is indeed influenced by animal physiology.

The following shows a simple neural network, made of 3 layers, with each node connected to every other node in preceding and subsequent layers.


In modern neural networks the adjustable parameter is the strength of the connections between nodes, not a parameter inside the node like our first kilometres-to-miles example.

As signals pass through a network, they are combined when connections bring several signals into a node. As they emerge from a node, they are subject to a threshold function. This is how animal neurons works, they only pass on a signal once it reaches a certain strength. In artificial neural networks, the non-linearity of this threshold function, or activation function, is essential for a network's ability to learn complex data.

If we write a mathematical expression which relates the output of the network to the incoming signal and all the many link weights, it will be a rather scary function. Instead of trying to solve this scary function, we instead take a simpler more approximate approach.

The following shows the error (the difference between the correct answer and the actual output) as it relates to the link weights in a network. As the weights are varied, the error can go up or down. We want to vary the weights to minimise the error. We can do this by working out the local slope and taking small steps down it. Over many iterations we should find ourselves moving down the error to a minimum.


This is called gradient descent, and is an ideal for learning iteratively from a large training dataset.

We didn't spend a lot of time on how a neural network works and is trained, as the focus of the session was on using PyTorch. If you want to learn about how a neural network actually works, there are many good tutorials online and textbooks. Make Your Own Neural Network, which I wrote, is designed to be as accessible as possible.


PyTorch

We could write code to implement a neural network from scratch. That would be a good educational experience, and I do recommend you try it once for a very simple network. When you do that, you realise one of the most laborious steps is doing the algebra for working out the gradient descent for each of the many links in a neural network. If we change the design of the network, we have to do this all over again.

Luckily, frameworks like Tensorflow and PyTorch are designed to make this easy. One of the key things they do is to automate the gradient calculations when given a network architecture.

PyTorch is considered to be more community-led in its design and evolution, and also more pythonic in its idioms. Being pythonic it also has much easier to interpret error messages, a real issue for new Tensorflow users.


PyTorch Variables


In the session, we started by seeing how PyTorch variables are familiar and do work like normal python variables. We also saw how PyTorch remembers how one variable depends on others, and uses these remembered relationships (called a computation graph) to automatically work out the derivatives of one variable with respect to another.

The following shows how a simple PyTorch variable, named x, is created and given a value of 3.5. You can see that we set an option to enable automatic gradient calculation.


x = torch.tensor(3.5, requires_grad=True)


PyTorch's variables are called tensors, and are similar to python numpy arrays.

The following defines a new variable y, which is calculated from x.


y = (x-1) * (x-2) * (x-3)


PyTorch will assign the value 1.8750 to y, which is a simple calculation using x = 3.5. But in addition to this, PyTorch will remember that y depends on x, and use the definition of y to work out the gradient of y with respect to x.

We can ask PyTorch to work out the gradients and print it out:


# work out gradients
y.backward()

# what is gradient at x = 3.5
x.grad


PyTorch correctly give us a gradient of 5.75 at x=3.5.


You can try the code yourself:

This ability to do automatic gradient calculations is really helpful in training neural networks as we perform gradient descent down the error function.


GPU Acceleration with PyTorch

We then looked at how PyTorch makes it really easy to take advantage of GPU acceleration. GPUs are very good at performing massively parallel calculations. You can think of a CPU as a single-lane road which can allow fast traffic, but a GPU as a very wide motorway with many lanes, which allows even more traffic to pass.

Without GPU acceleration many application of machine learning would not have been possible.

The only downside is that it is only Nvidia's GPUs and software framework, known as CUDA, that is mature and well-adopted in both research and industry. Sadly the competition, for example from AMD, has been late and isn't as mature or sufficiently featured.

Twenty years ago, using GPUs to accelerate calculations was painful compared to how easy it is today. PyTorch makes it really easy.

The following code checks to see if cuda acceleration is available, and if so, sets the default tensor type to be a floating point tensor on the GPU:


if torch.cuda.is_available():
  torch.set_default_tensor_type(torch.cuda.FloatTensor)
  print("using cuda:", torch.cuda.get_device_name(0))
  pass


The following shows that Google's free colab hosted python notebook service gives us access to a Tesla T4, a very expensive GPU!


If we now create tensors, they are by default on the GPU. The following shows us testing where a tensor is located:


We can see the tensor x is on the GPU.

You can try the code yourself, and see an example of a matrix multiplication being done on the GPU:


MNIST Challenge

Having looked at the basics of PyTorch we then proceeded to build a neural network that we hoped would learn to classify images of hand-written digits.

People don't write in a precise and consistent way, and this makes recognising hand-written digits a difficult task, even for human eyes!


This challenge has become a standard test for machine learning researchers, and goes by the name MNIST.

Learning is often best done with physical action, rather than passive listening or reading. For this reason we spent a fair bit of time typing in code to implement the neural network.

We started by downloading the training and test datasets. The training dataset is used to train the neural network, and the test dataset is intentionally separate and used to test how well a trained neural network performs. The images in the test set will not have been seen by the neural network and so there can't be any memorising of images.

We started by exploring the contents of the dataset files, which are in csv format. We used the pandas framework to load the data and preview it. We confirmed the training data had 60,000 rows of 785 numbers.

The first number is the actual number the image represents and the rest of the 784 numbers are the pixel values for the 28x28 images.

We used matplotlib's imshow() to view images of selected rows. The following shows the top of the training dataset and highlights the 5th row (index 4 because we start at 0). The label is 9, and further down we can see a 9 when we draw the pixel values as an image.


Having explored the data directly, we proceeded to build a PyTorch dataset class.


# dataset class

class MnistDataset(torch.utils.data.Dataset):
    
    def __init__(self, csv_file):
        self.data_df = pandas.read_csv(csv_file, header=None)
        pass
    
    def __len__(self):
        return len(self.data_df)
    
    def __getitem__(self, index):
        # image target (label)
        label = self.data_df.iloc[index,0]
        image_target = torch.zeros((10))
        image_target[label] = 1.0
        
        # image data, normalised from 0-255 to 0-1
        image_values = torch.FloatTensor(self.data_df.iloc[index,1:].values) / 255.0
        
        # return label, image data tensor and target tensor
        return label, image_values, image_target
    
    def plot_image(self, index):
        arr = self.data_df.iloc[index,1:].values.reshape(28,28)
        plt.title("label = " + str(self.data_df.iloc[index,0]))
        plt.imshow(arr, interpolation='none', cmap='Blues')
        pass
    
    pass


The dataset class is inherited from PyTorch and specialised for our own needs. In our example, we get it load data from the csv file when an object is initialised. We do need to tell it how big our data is, which in our case is simply the number of rows of the csv file, or in our code, the length of the pandas dataframe. We also need to tell the class how to get an item of data from the dataset. In our example we use the opportunity to normalise the pixel values from 0-255 to 0-1. We return the label, these normalised pixel values and also something we call image_target.

That image_target is a 1-hot vector of size 10, all zero except one set to 1.0 at the position that corresponds to the label. So if the image is a 0 then the vector has all zeros except the first one. If the image is a 9 then the vector is all zeros except the last one.

As an optional extra, I've added an image plotting function which will draw an image from the pixel values from a given record in the data set. The following shows a dataset object called mnist_dataset instantiated from the MnistDataset class we defined, and uses the plot_image function to draw the 11th record.


We then worked on the main neural network class, which is again inherited from PyTorch and key parts specified for our own needs.


# classifier class

class Classifier(nn.Module):
    
    def __init__(self):
        # initialise parent pytorch class
        super().__init__()
        
        # define neural network layers
        self.model = nn.Sequential(
            nn.Linear(784, 200),
            nn.Sigmoid(),
            nn.Linear(200, 10),
            nn.Sigmoid()
        )
        
        # create error function
        self.error_function = torch.nn.BCELoss()

        # create optimiser, using simple stochastic gradient descent
        self.optimiser = torch.optim.SGD(self.parameters(), lr=0.01)
        
        # counter and accumulator for progress
        self.counter = 0;
        self.progress = []
        pass
    
    
    def forward(self, inputs):
        # simply run model
        return self.model(inputs)
    
    
    def train(self, inputs, targets):
        # calculate the output of the network
        outputs = self.forward(inputs)
        
        # calculate error
        loss = self.error_function(outputs, targets)
        
        # increase counter and accumulate error every 10
        self.counter += 1;
        if (self.counter % 10 == 0):
            self.progress.append(loss.item())
            pass
        if (self.counter % 10000 == 0):
            print("counter = ", self.counter)
            pass
        

        # zero gradients, perform a backward pass, and update the weights.
        self.optimiser.zero_grad()
        loss.backward()
        self.optimiser.step()

        pass
    
    
    def plot_progress(self):
        df = pandas.DataFrame(self.progress, columns=['loss'])
        df.plot(ylim=(0, 1.0), figsize=(16,8), alpha=0.1, marker='.', grid=True, yticks=(0, 0.25, 0.5))
        pass
    
    pass


The class needs to be initialised by calling the initialisation of its parent. Beyond that we can add our own initialisation. We create a neural network model and the key lines of code define a 3-layer network with the first layer having 784 nodes, a hidden middle layer with 200 nodes, and a final output layer with 10 nodes. The 784 matches the number of pixel values for each image. The 10 outputs correspond to each of the possible digits 0-9.

The following diagram summarises these dimensions.


The activation function is set as Sigmoid() which is a simple and fairly popular s-shaped threshold.

We also define an error function which summarises the how far wrong the networks output is from what it should be. There are several options provided by PyTorch, and we've chosen the BCELoss which is often suitable for cases where only one of the output nodes should be 1, corresponding to a categorising network. Another common option is MSELoss which is a mean squared error.

We also define a method to do perform gradient descent steps. Here we've chosen the simple stochastic gradient descent, or SGD. Other options are available and you can find out more detail on the others here (link).

In the initialisation function I've also set a counter and an empty list progress which we'll use to monitor training progress.

We need to define a forward() function in our class as this is expected by PyTorch. In our case it is trivially simple, it simply passes the 784-sized data through the model and return the 10-sized output.

We have defined a train() function which does the real training. It takes both pixel data and that 1-hot target vector we saw earlier coming from the dataset class representing what the output should be. It passes the pixel data input to to the forward() function, and keeps the network's output as output. The chosen error function is used to calculate the error, often called a loss.

Finally the following key code does calculates the error gradients in the network from this error and uses chosen gradient descent method to take a step down the gradient for all the network weights. You'll see this key code in almost all PyTorch programs.


# zero gradients, perform a backward pass, and update the weights.
self.optimiser.zero_grad()
loss.backward()
self.optimiser.step()


You might be asking why the first line of code is needed, which appears to zero the gradients before they are calculated again? The reason is that more complex models can be built which combine gradients from multiple sources and to achieve that we don't want to always zero the gradients when calculating new ones.

In the train() function we've also added code to increase the counter at every call of the function, and then used this to add the current error (loss) to the list every 10 iterations. At every 10,000 iterations we print out the current count so we can visually see progress during training.

That's the core code. As an optional extra, I've added a function to plot a chart of the loss against training iteration so we can visualise how well the network trained.

The following simple code shows how easy it is to use this neural network class.


%%time

# create classifier

C = Classifier()

# train classifier

epochs = 3

for i in range(epochs):
    print('training epoch', i+1, "of", epochs)
    for label, image_data_tensor, target_tensor in mnist_dataset:
        C.train(image_data_tensor, target_tensor)
        pass
    pass


You can see we create an object from the PyTorch neural network class Classifier, here unimaginatively called C. We use a loop to work through all the data in the mnist_dataset 3 times, called epochs. You can see we take the label, pixel and target vector data for each image from the dataset and pass it to the C.train() function which does all the hard work we discussed above.

Working through the training dataset of 60,000 images, pushing the data through the neural network, calculating the error and using this to calculate the gradients inside the network, and taking a step down the gradient .. takes about 3 minutes on Google's infrastructure.

After training we plotted the chart of loss against training iteration. Everyone in the session had a plot where the error fell towards zero over time. A reducing error means the network is getting better at correctly classifying the images of digits.


Because our neural networks are initialised with random link weights, the plots for everyone in the room were slightly different in detail, but the same in the broad trend of diminishing error.

The following shows the trained neural network's forward() function being used to classify the 14th record in the dataset.


The plot shows it is a 6. The neural network's 10 outputs are shown as a bar chart. It is clear the node corresponding to 6 has a value close to 1.0 and the rest are very small. This means the network thinks the image is a 6 with high confidence.

There will be cases where an image is ambiguous and two or more of the network's output will have medium or high values. For example, some renditions of a 9 look like a 4.

We performed a very simplistic calculation of how well the network performs keeping a score when our trained network correctly classifies images from the test dataset. All of us in the session had networks which scored about 90%. That is about 9,000 out of the 10,000 test images were correctly classified.

That's impressive for a neural network kept very simple for educational purposes.

The above code doesn't run on the GPU. As a final step we set the default tensor type to be on the GPU and re-ran the code. A small bit of code in the dataset class was also needed to be changed to assert this tensor type on the pixel data as the current version of PyTorch didn't seem to apply the newly set default. No change was needed to the neural network code at all.

This time the training took about 2 minutes. Given the Tesla T4 costs over £2,000 this didn't seem like much of a gain.

We discussed how GPUs only really shine when their massive parallelism is employed. If we increase size of data so it is larger than 28*28=784 for each record, or increase the size of the network, then we'll see significant improvements over a CPU.

The following graph shows the effect of increasing the middle layer of our neural network from 200 up to 5000. We can see that training time increases with a CPU but stays fairly constant with a GPU.


You can see and run the full code used for the tutorial online:


Thoughts

Most of the attendees had never written a machine learning application, and some had little coding experience.

Despite this, everyone felt that they had developed an intuition for how machine learning and neural networks in particular work. The hands-on working with code also helped us understand the mechanics of using PyTorch, something which can't easily be developed just by listening passively or reading about it.

A member asked if we had implemented "AI". The answer was emphatically, yes! We did discuss how the word AI has been overused and mystified, and how the session demystified AI and neural networks.

Demystifying AI aside, we did rightly appreciate how the relatively simple idea of a neural network and our simple implementation still managed to learn to identify images with such a high degree of accuracy.