Machine Learning Part 3

02 Apr, 2025

In Machine Learning Part 2 I talked about how the Rust neural network code works, but I forgot the stats that show how well it works. They were in my original post, but here they are again for reference. Training took 10 minutes and 13 seconds for 30 epochs and finished the training set with 98.40%. Here are the tests stats.

Confusion Matrix:
           Predicted
Actual     0    1    2    3    4    5    6    7    8    9
      +--------------------------------------------------
  0   |  971    0    0    1    0    1    2    2    2    1
  1   |    0 1123    2    1    0    1    4    1    3    0
  2   |    9    3  996    6    1    0    1    9    7    0
  3   |    1    0    6  979    0    7    1    6    5    5
  4   |    0    0    2    0  958    0    6    0    0   16
  5   |    4    2    0    6    1  866    5    1    3    4
  6   |    7    3    0    0    4    9  932    0    3    0
  7   |    2   10   11    1    3    1    0  990    2    8
  8   |    5    1    1    3    4    5    5    2  945    3
  9   |    8    5    0    7   12    1    1    6    4  965

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.1%   |   96.4%   |  99.1%  |   97.7%
   1   |  98.9%   |   97.9%   |  98.9%  |   98.4%
   2   |  96.5%   |   97.8%   |  96.5%  |   97.2%
   3   |  96.9%   |   97.5%   |  96.9%  |   97.2%
   4   |  97.6%   |   97.5%   |  97.6%  |   97.5%
   5   |  97.1%   |   97.2%   |  97.1%  |   97.1%
   6   |  97.3%   |   97.4%   |  97.3%  |   97.3%
   7   |  96.3%   |   97.3%   |  96.3%  |   96.8%
   8   |  97.0%   |   97.0%   |  97.0%  |   97.0%
   9   |  95.6%   |   96.3%   |  95.6%  |   96.0%

Overall Accuracy: 97.25%

The accuracy is a little bit lower which is sort of expected. The model has difficulty with 7s and 9s.

In this post I’m going to show some enhancements.

Softmax activation function
Handle arbitrary numbers of hidden layers.
Network configuration file

Softmax

The original code had a single activation function. I have added a second; the SOFTMAX function. A softmax function transforms a vector of raw scores into values that sum to one, allowing for interpretation as probabilities of each class. Sigmoid, on the other hand, returns discrete values. Softmax also makes back propagation easier but I don’t understand why yet. Sigmoid and softmax are both classification functions. Classification functions push the data towards high or low instead of linear results. They look like “S” curves. They are used on problems where you want to categorise or label a result.

Here are the results using sigmoid on the first layer and softmax on the hidden layer. First training.

Loading network configuration...
Creating network...
Training network...
Epoch 1 (32.89s): Average Error = 0.151774, Accuracy = 89.74%
Epoch 2 (33.08s): Average Error = 0.079143, Accuracy = 94.77%
Epoch 3 (33.09s): Average Error = 0.055417, Accuracy = 96.38%
...
Epoch 27 (32.82s): Average Error = 0.000539, Accuracy = 99.99%
Epoch 28 (32.91s): Average Error = 0.000473, Accuracy = 100.00%
Epoch 29 (32.95s): Average Error = 0.000417, Accuracy = 100.00%
Epoch 30 (33.36s): Average Error = 0.000369, Accuracy = 100.00%
Total training time: 16m 34s (994.98s)
Average time per epoch: 33s (33.17s)
Saving trained network...

Boom, 100%!

As you can imagine softmax is a much more complicated calculation because it needs to know the range of all it's inputs and it shows in the time. 16 minutes 34 seconds versus about 10 minutes with just Sigmoid.

One thing to note here is that it hit almost 90% on the first epoch. This has surprised and bothered me quite a lot. It doesn’t make a lot of sense that it was that accurate straight out of the gate. More on this a bit later. Here are the test stats.

Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  972    0    2    1    0    3    0    2    0    0
  1   |    0 1125    3    1    0    1    2    1    2    0
  2   |    4    3 1012    1    1    1    1    4    4    1
  3   |    1    0    5  988    0    5    0    2    2    7
  4   |    0    0    2    1  968    0    4    1    0    6
  5   |    2    0    0    7    0  877    3    1    1    1
  6   |    6    3    2    0    3    8  935    0    1    0
  7   |    1    2   10    3    5    1    0  999    1    6
  8   |    4    1    6    6    5    6    1    3  938    4
  9   |    4    3    0    3    9    4    0    4    1  981

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.2%   |   97.8%   |  99.2%  |   98.5%
   1   |  99.1%   |   98.9%   |  99.1%  |   99.0%
   2   |  98.1%   |   97.1%   |  98.1%  |   97.6%
   3   |  97.8%   |   97.7%   |  97.8%  |   97.8%
   4   |  98.6%   |   97.7%   |  98.6%  |   98.1%
   5   |  98.3%   |   96.8%   |  98.3%  |   97.6%
   6   |  97.6%   |   98.8%   |  97.6%  |   98.2%
   7   |  97.2%   |   98.2%   |  97.2%  |   97.7%
   8   |  96.3%   |   98.7%   |  96.3%  |   97.5%
   9   |  97.2%   |   97.5%   |  97.2%  |   97.4%

Overall Accuracy: 97.95%

Despite hitting (almost) 100% on the training data we got barely any improvement on the test data 🤔 This is a case of over fitting. Overfitting in machine learning refers to a modelling error that occurs when a machine learning algorithm “memorises” the training data instead of the learning the underlying pattern. This results in a model that performs very well on the training dataset but poorly on test datasets. So after training it had a perfect score detecting which number was in the image but when it saw another set of numbers it didn’t do so well. There are strategies for reducing over fitting. We’ll get to those later.

For the Rust folks reading I changed the ActivationFunction to use dynamic dispatch. This wasn’t really necessary but YOLO. This over complicated some stuff without getting rid of the thing I’d hoped it would, ie. testing for the type of ActivationFunction. Using a “dynamic” trait removed one check but there’s still one in calculate_gradients

if matches!(activation.activation_type(), ActivationType::Softmax) {
    ...
}

There’s a cost to dynamic dispatch but I’ll worry about that later. TheActivationFunction is declared with the dyn keyword.

// In the Network struct
    activations: Vec<Box<dyn ActivationFunction>>,

// As a function argumemt
    activation: &dyn ActivationFunction

Layer

While implementing softmax I fixed the layer code. Despite the layers configuration being a vector the system only worked properly for a single hidden layer. I fixed that and now you can specify any number of hidden layers and the activation function for each layer.

Configuration

It was becoming a bit annoying re-compiling every time I changed a network parameter so I added a configuration that can be loaded from a file.

pub struct NetworkConfig {
    /// Sizes of each layer in the network, including input and output layers.
    /// For example, `[784, 128, 10]` represents a network with:
    /// - 784 input neurons
    /// - 128 hidden neurons
    /// - 10 output neurons
    pub layers: Vec<usize>,

    /// Activation types for each layer transition.
    /// The length should be one less than the number of layers.
    /// Each activation function is applied to the output of its corresponding layer.
    pub activations: Vec<ActivationType>,

    /// Learning rate for gradient descent.
    /// Controls how much the weights are adjusted during training.
    pub learning_rate: f64,

    /// Optional momentum coefficient for gradient descent.
    /// When specified, helps accelerate training and avoid local minima.
    pub momentum: Option<f64>,

    /// Number of training epochs.
    /// One epoch represents one complete pass through the training dataset.
    pub epochs: usize,
}

And a sample config.json

{
  "layers": [784, 200, 10],
  "activations": ["Sigmoid", "Softmax"],
  "learning_rate": 0.01,
  "epochs": 30
}

I made a couple of other miscellaneous changes

Test

Previously the test binary was running the feed_forward function to predict the digits. I simplified this a bit by creating a predict function which doesn’t collect the outputs of the intermediate layers. They aren’t needed because we’re not running the back propagate step. We only want the final output layer.

Save images

I wrote a binary (save_mnist_images) that saves the first 5 images in the training set and test set to pngs so I had some visual testing. The file name includes the models prediction.

If you want to look at the code after these changes check out this commit 7233f2c5c6af

First Epoch Accuracy

So the first epoch accuracy. This didn’t make sense to me. Because the weights are initialised with random numbers I’d expect the first epoch accuracy to be around 10%. How can it be ~90% accurate first time through? It was either a bug in the training or a bug in the accuracy calculation. I spent a lot of time trying to figure out how this was the case. I even wrote a program to pick a random image from the dataset, save it as a png so I can look at it and output what digit the network thought it was. It got it right every time.

But, I think what is happening is both a problem and a feature. This is the train method

for epoch in 1..=epochs {
    let epoch_start = std::time::Instant::now();
    let mut total_error = 0.0;
    let mut correct_predictions = 0;
    let total_samples = inputs.len();

    inputs.iter().zip(&targets).for_each(|(input, target)| {
        let outputs = self.feed_forward(Matrix::from(input.clone()));
        let error = &Matrix::from(target.clone()) - &outputs;
        ...
        self.back_propagate(outputs, Matrix::from(target.clone()));
    }

Inside each epoch there is another loop that is running the system on each image, not a matrix of the entire image set. The input is a 784 x 1 matrix. It’s also calculating the error for each image and feeding that into back_propagate. So for one epoch it’s actually running the system 60,000 times. The python and numpy code in the book is clearly processing the entire training set (a single 784 x 60,000 matrix) in a single operation. But when I was writing the Rust code I completely missed this. I’m not using a matrix library so reasoning about and coding individual 784 x 1 matrix operations was simpler. I never went back and reviewed my code against the code in the book.

The code should probably look like this without the internal loop.

for epoch in 1..=epochs {
    ...
    let outputs = self.feed_forward(Matrix::from(inputs)); // <---- plural inputs
    let error = &Matrix::from(targets) - &outputs;
    ...
    self.back_propagate(outputs, Matrix::from(targets)); // <---- plural targets

I’m sure if I switch to a purpose built linear algebra library like ndarray and combine all the inputs into a single matrix I should get a big performance improvement per epoch. But what will happen to the accuracy? I guess we’ll have to wait to find out.

#ai #machine_learning #rust