Machine Learning Part 2

31 Mar, 2025

In a previous post I talked about finally making progress learning machine learning. This post is a guided tour of the neural network code from that post giving a high level explanation of how it works. I will assume you know a bit about neural networks, eg what layers are, activation functions etc. so I won't go into too much details about the maths.

In the last post I explained that I found a Rust project on Github that implemented a simple neural network. Guided by the ”Programming Machine Learning” book¹ I extended that code so it could process the MNIST dataset. This commit 381858eec622 is roughly where I was at if you want to follow along.

The code has a simple structure. The core modules are,

Matrix (matrix/src/matrix.rs)
Activation (neural-network/src/activations.rs)
Network (neural-net-rs/neural-network/src/network.rs)

Matrix

This module contains a standard set of matrix operations, addition, dot_multiply etc. and some utility functions like random and a map for applying functions over a collection of matrices. In keeping with starting from first principals they are “hand coded” and don’t use any external libraries or optimisation techniques.

Activation

This module defines the activation functions. The activation function is the “brains” of a neural network. At this point there’s only one, a Sigmoid function. The function comes in two flavours, scalar and vector for operating on single values or a matrix. This is the definition of the Sigmoid activation function and the derivatives. The vector versions are just using map to apply the function to a collection.

pub const SIGMOID_VECTOR: Activation = Activation {
    function: |x| 1.0 / (1.0 + E.powf(-x)),
    derivative: |x| x * (1.0 - x),
    vector_function: Some(|m| m.map(|x| 1.0 / (1.0 + E.powf(-x)))),
    vector_derivative: Some(|m| m.map(|x| x * (1.0 - x))),
    activation_type: ActivationType::SigmoidVector,
};

Network

The network module is a runner for the network. It has a main loop which does three things

Feed forward. This function multiplies all of the input nodes by their weights and sums them. It then passes the weighted sum through the activation function. The resulting value becomes the value of the next node in the next layer.
Error calculation. When we’ve calculated all the nodes in all the layers we have the output layer. We then calculate the error between the predicted (output) values and the target values.
Back propagate. This step uses the error to adjust the weights for the next feed forward. This uses the derivative of the activation function to do a gradient descent search for a better fit. It does this in reverse, applying the error adjustment to the weights back from the output layer to the input.

The Network.train function runs these three operations in a loop for a pre specified number of iterations (epochs).

pub fn train(&mut self, inputs: Vec<Vec<f64>>, targets: Vec<Vec<f64>>, epochs: u32) {
    for epoch in 1..=epochs {
        ...
        inputs.iter().zip(&targets).for_each(|(input, target)| {
            let outputs = self.feed_forward(Matrix::from(input.clone()));
            let error = Matrix::from(target.clone()).subtract(&outputs);
            ... 
            self.back_propagate(outputs, Matrix::from(target.clone()));
        });
		...
    }
}

Feed forward is pretty simple. It’s just a few matrix calculations, a dot product and the activation function.

let output = weight.dot_multiply(input);
if let Some(vector_fn) = activation.vector_function {
    vector_fn(&output)
} else {
    output.map(activation.function)
}

Calculating the error is a simple subtraction.

let mut errors = targets.subtract(&outputs);

Back propagation is the most complicated part (too much to post here) and involves some complex maths, a lot of which is still over my head. But the neural network code and the core algorithm itself is pretty simple.

I made some changes to the algorithm in the original code.

Bias

I removed the seperate bias vectors. The original code tracked the biases in their own Vec of matrices. This is unnecessary because of maths. We can add a bias term to the input vector and do away with a seperate bias list. This simplifies some of the maths and eliminates the need to manage the biases. I learnt this technique in the book.

Momentum

Momentum is a technique used to accelerate the convergence of the gradient descent algorithm² and to help it escape local minima by using past gradients to smooth out the updates to the weights. I think the momentum is the reason my neural network is so much faster than the one in the book. My network converges in about 30 steps (epochs) whereas the book runs for 10,000. I can’t remember where this idea came from. It’s discussed in the book but in a later chapter.

The Model

Other changes made were to do with saving and loading the model data. Building a neural network model happens in two steps

Training. You train the model on a sample dataset.
Testing. You test the model on a different dataset that it hasn’t seen before.

The original code did both of these steps in one. I split the training and testing into two distinct steps, and different binaries. In order to do this I had to save the model, the configuration of the network and the model weights, to a file after training. Then I could load it for testing. If you have played with running LLMs locally you’ll know that you need to download the “model” or the “weights”. This is the same. I only have about 100,000 weights so saving them as JSON is fine. LLMs can have tens of billions of weights.

Training and Testing

The train binary loads the MNIST data, trains the neural network and saves the model data. There’s not much to it. Here’s the important bit

# cargo run --bin train --release

let mut network = Network::new(vec![784, 128, 10], SIGMOID, 0.01, Some(0.5));
let inputs: Vec<Vec<f64>> = mnist_data.images().iter().map(|m| m.data.clone()).collect();
let targets: Vec<Vec<f64>> = mnist_data.labels().iter().map(|m| m.data.clone()).collect();
network.train(inputs, targets, 30);

The network is configured with three layers vec![784, 128, 10]

the input. This is one node for each pixel in the image
the output. This is 10 for the 10 digits we’re looking for
one hidden layer. This has 128 nodes because it’s a nice round number ¯\_(ツ)_/¯

There is the one Sigmoid function which is applied to the input and hidden layer. The last two arguments are the learning rate and the initial momentum. Learning rate is a coefficient applied to the error. Basically how fast we want to adjust the error. 0.01 is much higher than the suggested value in the book but it seems to work ok. The inputs for the model are the MNIST images (mnist_data.images()) and the target is a list of the corresponding digits, called labels in machine learning parlance (mnist_data.labels()) Run it with —release flag because we don’t have all day.

The test binary loads the test images and the model. There’s no Network setup; the network parameters are deserialised from the model file. It then runs network feed_forward function once to guess the numbers in the image data.

// cargo run --bin test --release

test_data
    .images()
    .iter()
    .zip(test_data.labels().iter())
    .for_each(|(image, label)| {
        let output = network.feed_forward(Matrix::new(784, 1, image.data.clone()));
        let predicted = get_actual_digit(&output);
        let actual = get_actual_digit(label);

        confusion_matrix[actual][predicted] += 1;
        progress_bar.inc(1);
    });

Most of the code in test is for the progress bar and to print out the pretty statistics to show how well the model did at guessing the digits.

MNIST

The rest of the extra code is to do with loading the MNIST data. It’s a binary file that contains single channel (grayscale) uncompressed image data. Pretty straight forward. The values of 784 pixels (28x28 pixels) become the input layer to the network.

Turns out coding an artificial neural network is pretty straight forward. We can easily extend this to have more activation functions and handle more hidden layers (we can already specify the number of nodes in a layer). And there is plenty of scope for optimisation. The Matrix math can be improved, I’m using clone³ everywhere and there are improvements in the training algorithms that can be made. While writing this I spotted a couple unnecessary operations.

If you want to learn how to code a Rust neural network yourself I suggest you checkout this commit af1ca648c310. This is the code after I cleaned up the original repo but before I started adding the MNIST code. It’s much simpler at this point. If you want to learn by building something yourself maybe branch from here.

I am following the book “Programming Machine Learning” by Paolo Perrotta. https://pragprog.com/titles/pplearn/programming-machine-learning/↩
Gradient descent is an optimisation algorithm used in machine learning to minimise the error of a model. Imagine you are at the top of a hill and want to find the lowest point in the valley below. Gradient descent helps you do this by calculating the slope of the hill (the gradient) at your current position. It then takes a small step downhill in the direction of the steepest slope. By repeating this process—taking steps based on the slope—you gradually move closer to the lowest point, which represents the best set of parameters for your model. The size of the steps can be adjusted, and this is known as the learning rate. With each iteration, the model improves its predictions by reducing the error.↩
If you have looked at the source code for the neural network and you know a bit of Rust you might be wondering why aren’t there any a’ anywhere? The answer is clone. Yes clone is bad but I just wanted to get something working quickly. I think this is an approach you should take too if you are new to Rust. Make sure you come back later and clean them up if they are causing issues.↩

#ai #machine_learning #rust