Machine Learning Part 2
In a previous post I talked about finally making progress learning machine learning. This post is a guided tour of the neural network code from that post giving a high level explanation of how it works. I will assume you know a bit about neural networks, eg what layers are, activation functions etc. so I won't go into too much details about the maths.
In the last post I explained that I found a Rust project on Github that implemented a simple neural network. Guided by the āProgramming Machine Learningā book1 I extended that code so it could process the MNIST dataset. This commit 381858eec622 is roughly where I was at if you want to follow along.
The code has a simple structure. The core modules are,
- Matrix (matrix/src/matrix.rs)
- Activation (neural-network/src/activations.rs)
- Network (neural-net-rs/neural-network/src/network.rs)
Matrix
This module contains a standard set of matrix operations, addition
, dot_multiply
etc. and some utility functions like random
and a map
for applying functions over a collection of matrices. In keeping with starting from first principals they are āhand codedā and donāt use any external libraries or optimisation techniques.
Activation
This module defines the activation functions. The activation function is the ābrainsā of a neural network. At this point thereās only one, a Sigmoid function. The function comes in two flavours, scalar and vector for operating on single values or a matrix. This is the definition of the Sigmoid activation function and the derivatives. The vector versions are just using map to apply the function to a collection.
pub const SIGMOID_VECTOR: Activation = Activation {
function: |x| 1.0 / (1.0 + E.powf(-x)),
derivative: |x| x * (1.0 - x),
vector_function: Some(|m| m.map(|x| 1.0 / (1.0 + E.powf(-x)))),
vector_derivative: Some(|m| m.map(|x| x * (1.0 - x))),
activation_type: ActivationType::SigmoidVector,
};
Network
The network module is a runner for the network. It has a main loop which does three things
- Feed forward. This function multiplies all of the input nodes by their weights and sums them. It then passes the weighted sum through the activation function. The resulting value becomes the value of the next node in the next layer.
- Error calculation. When weāve calculated all the nodes in all the layers we have the output layer. We then calculate the error between the predicted (output) values and the target values.
- Back propagate. This step uses the error to adjust the weights for the next feed forward. This uses the derivative of the activation function to do a gradient descent search for a better fit. It does this in reverse, applying the error adjustment to the weights back from the output layer to the input.
The Network.train
function runs these three operations in a loop for a pre specified number of iterations (epochs).
pub fn train(&mut self, inputs: Vec<Vec<f64>>, targets: Vec<Vec<f64>>, epochs: u32) {
for epoch in 1..=epochs {
...
inputs.iter().zip(&targets).for_each(|(input, target)| {
let outputs = self.feed_forward(Matrix::from(input.clone()));
let error = Matrix::from(target.clone()).subtract(&outputs);
...
self.back_propagate(outputs, Matrix::from(target.clone()));
});
...
}
}
Feed forward is pretty simple. Itās just a few matrix calculations, a dot product and the activation function.
let output = weight.dot_multiply(input);
if let Some(vector_fn) = activation.vector_function {
vector_fn(&output)
} else {
output.map(activation.function)
}
Calculating the error is a simple subtraction.
let mut errors = targets.subtract(&outputs);
Back propagation is the most complicated part (too much to post here) and involves some complex maths, a lot of which is still over my head. But the neural network code and the core algorithm itself is pretty simple.
I made some changes to the algorithm in the original code.
Bias
I removed the seperate bias vectors. The original code tracked the biases in their own Vec
of matrices. This is unnecessary because of maths. We can add a bias term to the input vector and do away with a seperate bias list. This simplifies some of the maths and eliminates the need to manage the biases. I learnt this technique in the book.
Momentum
Momentum is a technique used to accelerate the convergence of the gradient descent algorithm2 and to help it escape local minima by using past gradients to smooth out the updates to the weights. I think the momentum is the reason my neural network is so much faster than the one in the book. My network converges in about 30 steps (epochs) whereas the book runs for 10,000. I canāt remember where this idea came from. Itās discussed in the book but in a later chapter.
The Model
Other changes made were to do with saving and loading the model data. Building a neural network model happens in two steps
- Training. You train the model on a sample dataset.
- Testing. You test the model on a different dataset that it hasnāt seen before.
The original code did both of these steps in one. I split the training and testing into two distinct steps, and different binaries. In order to do this I had to save the model, the configuration of the network and the model weights, to a file after training. Then I could load it for testing. If you have played with running LLMs locally youāll know that you need to download the āmodelā or the āweightsā. This is the same. I only have about 100,000 weights so saving them as JSON is fine. LLMs can have tens of billions of weights.
Training and Testing
The train
binary loads the MNIST data, trains the neural network and saves the model data. Thereās not much to it. Hereās the important bit
# cargo run --bin train --release
let mut network = Network::new(vec![784, 128, 10], SIGMOID, 0.01, Some(0.5));
let inputs: Vec<Vec<f64>> = mnist_data.images().iter().map(|m| m.data.clone()).collect();
let targets: Vec<Vec<f64>> = mnist_data.labels().iter().map(|m| m.data.clone()).collect();
network.train(inputs, targets, 30);
The network is configured with three layers vec![784, 128, 10]
- the input. This is one node for each pixel in the image
- the output. This is 10 for the 10 digits weāre looking for
- one hidden layer. This has 128 nodes because itās a nice round number ĀÆ\_(ć)_/ĀÆ
There is the one Sigmoid function which is applied to the input and hidden layer. The last two arguments are the learning rate and the initial momentum. Learning rate is a coefficient applied to the error. Basically how fast we want to adjust the error. 0.01
is much higher than the suggested value in the book but it seems to work ok.
The inputs for the model are the MNIST images (mnist_data.images()
) and the target is a list of the corresponding digits, called labels in machine learning parlance (mnist_data.labels()
)
Run it with ārelease
flag because we donāt have all day.
The test
binary loads the test images and the model. Thereās no Network
setup; the network parameters are deserialised from the model file. It then runs network feed_forward
function once to guess the numbers in the image data.
// cargo run --bin test --release
test_data
.images()
.iter()
.zip(test_data.labels().iter())
.for_each(|(image, label)| {
let output = network.feed_forward(Matrix::new(784, 1, image.data.clone()));
let predicted = get_actual_digit(&output);
let actual = get_actual_digit(label);
confusion_matrix[actual][predicted] += 1;
progress_bar.inc(1);
});
Most of the code in test
is for the progress bar and to print out the pretty statistics to show how well the model did at guessing the digits.
MNIST
The rest of the extra code is to do with loading the MNIST data. Itās a binary file that contains single channel (grayscale) uncompressed image data. Pretty straight forward. The values of 784 pixels (28x28 pixels) become the input layer to the network.
Turns out coding an artificial neural network is pretty straight forward. We can easily extend this to have more activation functions and handle more hidden layers (we can already specify the number of nodes in a layer). And there is plenty of scope for optimisation. The Matrix math can be improved, Iām using clone
3 everywhere and there are improvements in the training algorithms that can be made. While writing this I spotted a couple unnecessary operations.
If you want to learn how to code a Rust neural network yourself I suggest you checkout this commit af1ca648c310. This is the code after I cleaned up the original repo but before I started adding the MNIST code. Itās much simpler at this point. If you want to learn by building something yourself maybe branch from here.
I am following the book āProgramming Machine Learningā by Paolo Perrotta. https://pragprog.com/titles/pplearn/programming-machine-learning/↩
Gradient descent is an optimisation algorithm used in machine learning to minimise the error of a model. Imagine you are at the top of a hill and want to find the lowest point in the valley below. Gradient descent helps you do this by calculating the slope of the hill (the gradient) at your current position. It then takes a small step downhill in the direction of the steepest slope. By repeating this processātaking steps based on the slopeāyou gradually move closer to the lowest point, which represents the best set of parameters for your model. The size of the steps can be adjusted, and this is known as the learning rate. With each iteration, the model improves its predictions by reducing the error.↩
If you have looked at the source code for the neural network and you know a bit of Rust you might be wondering why arenāt there any
aā
anywhere? The answer isclone
. Yesclone
is bad but I just wanted to get something working quickly. I think this is an approach you should take too if you are new to Rust. Make sure you come back later and clean them up if they are causing issues.↩