Machine Learning Part 3
In Machine Learning Part 2 I talked about how the Rust neural network code works, but I forgot the stats that show how well it works. They were in my original post, but here they are again for reference. Training took 10 minutes and 13 seconds for 30 epochs and finished the training set with 98.40%. Here are the tests stats.
Confusion Matrix:
Predicted
Actual 0 1 2 3 4 5 6 7 8 9
+--------------------------------------------------
0 | 971 0 0 1 0 1 2 2 2 1
1 | 0 1123 2 1 0 1 4 1 3 0
2 | 9 3 996 6 1 0 1 9 7 0
3 | 1 0 6 979 0 7 1 6 5 5
4 | 0 0 2 0 958 0 6 0 0 16
5 | 4 2 0 6 1 866 5 1 3 4
6 | 7 3 0 0 4 9 932 0 3 0
7 | 2 10 11 1 3 1 0 990 2 8
8 | 5 1 1 3 4 5 5 2 945 3
9 | 8 5 0 7 12 1 1 6 4 965
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 99.1% | 96.4% | 99.1% | 97.7%
1 | 98.9% | 97.9% | 98.9% | 98.4%
2 | 96.5% | 97.8% | 96.5% | 97.2%
3 | 96.9% | 97.5% | 96.9% | 97.2%
4 | 97.6% | 97.5% | 97.6% | 97.5%
5 | 97.1% | 97.2% | 97.1% | 97.1%
6 | 97.3% | 97.4% | 97.3% | 97.3%
7 | 96.3% | 97.3% | 96.3% | 96.8%
8 | 97.0% | 97.0% | 97.0% | 97.0%
9 | 95.6% | 96.3% | 95.6% | 96.0%
Overall Accuracy: 97.25%
The accuracy is a little bit lower which is sort of expected. The model has difficulty with 7s and 9s.
In this post Iām going to show some enhancements.
- Softmax activation function
- Handle arbitrary numbers of hidden layers.
- Network configuration file
Softmax
The original code had a single activation function. I have added a second; the SOFTMAX
function. A softmax function transforms a vector of raw scores into values that sum to one, allowing for interpretation as probabilities of each class. Sigmoid, on the other hand, returns discrete values. Softmax also makes back propagation easier but I donāt understand why yet. Sigmoid and softmax are both classification functions. Classification functions push the data towards high or low instead of linear results. They look like āSā curves. They are used on problems where you want to categorise or label a result.
Here are the results using sigmoid on the first layer and softmax on the hidden layer. First training.
Loading network configuration...
Creating network...
Training network...
Epoch 1 (32.89s): Average Error = 0.151774, Accuracy = 89.74%
Epoch 2 (33.08s): Average Error = 0.079143, Accuracy = 94.77%
Epoch 3 (33.09s): Average Error = 0.055417, Accuracy = 96.38%
...
Epoch 27 (32.82s): Average Error = 0.000539, Accuracy = 99.99%
Epoch 28 (32.91s): Average Error = 0.000473, Accuracy = 100.00%
Epoch 29 (32.95s): Average Error = 0.000417, Accuracy = 100.00%
Epoch 30 (33.36s): Average Error = 0.000369, Accuracy = 100.00%
Total training time: 16m 34s (994.98s)
Average time per epoch: 33s (33.17s)
Saving trained network...
Boom, 100%!
As you can imagine softmax is a much more complicated calculation because it needs to know the range of all it's inputs and it shows in the time. 16 minutes 34 seconds versus about 10 minutes with just Sigmoid.
One thing to note here is that it hit almost 90% on the first epoch. This has surprised and bothered me quite a lot. It doesnāt make a lot of sense that it was that accurate straight out of the gate. More on this a bit later. Here are the test stats.
Confusion Matrix:
Predicted ā
Actual 0 1 2 3 4 5 6 7 8 9
ā +--------------------------------------------------
0 | 972 0 2 1 0 3 0 2 0 0
1 | 0 1125 3 1 0 1 2 1 2 0
2 | 4 3 1012 1 1 1 1 4 4 1
3 | 1 0 5 988 0 5 0 2 2 7
4 | 0 0 2 1 968 0 4 1 0 6
5 | 2 0 0 7 0 877 3 1 1 1
6 | 6 3 2 0 3 8 935 0 1 0
7 | 1 2 10 3 5 1 0 999 1 6
8 | 4 1 6 6 5 6 1 3 938 4
9 | 4 3 0 3 9 4 0 4 1 981
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 99.2% | 97.8% | 99.2% | 98.5%
1 | 99.1% | 98.9% | 99.1% | 99.0%
2 | 98.1% | 97.1% | 98.1% | 97.6%
3 | 97.8% | 97.7% | 97.8% | 97.8%
4 | 98.6% | 97.7% | 98.6% | 98.1%
5 | 98.3% | 96.8% | 98.3% | 97.6%
6 | 97.6% | 98.8% | 97.6% | 98.2%
7 | 97.2% | 98.2% | 97.2% | 97.7%
8 | 96.3% | 98.7% | 96.3% | 97.5%
9 | 97.2% | 97.5% | 97.2% | 97.4%
Overall Accuracy: 97.95%
Despite hitting (almost) 100% on the training data we got barely any improvement on the test data š¤ This is a case of over fitting. Overfitting in machine learning refers to a modelling error that occurs when a machine learning algorithm āmemorisesā the training data instead of the learning the underlying pattern. This results in a model that performs very well on the training dataset but poorly on test datasets. So after training it had a perfect score detecting which number was in the image but when it saw another set of numbers it didnāt do so well. There are strategies for reducing over fitting. Weāll get to those later.
For the Rust folks reading I changed the ActivationFunction
to use dynamic dispatch. This wasnāt really necessary but YOLO. This over complicated some stuff without getting rid of the thing Iād hoped it would, ie. testing for the type of ActivationFunction
. Using a ādynamicā trait removed one check but thereās still one in calculate_gradients

if matches!(activation.activation_type(), ActivationType::Softmax) {
...
}
Thereās a cost to dynamic dispatch but Iāll worry about that later. TheActivationFunction
is declared with the dyn
keyword.
// In the Network struct
activations: Vec<Box<dyn ActivationFunction>>,
// As a function argumemt
activation: &dyn ActivationFunction
Layer
While implementing softmax I fixed the layer code. Despite the layers configuration being a vector the system only worked properly for a single hidden layer. I fixed that and now you can specify any number of hidden layers and the activation function for each layer.
Configuration
It was becoming a bit annoying re-compiling every time I changed a network parameter so I added a configuration that can be loaded from a file.
pub struct NetworkConfig {
/// Sizes of each layer in the network, including input and output layers.
/// For example, `[784, 128, 10]` represents a network with:
/// - 784 input neurons
/// - 128 hidden neurons
/// - 10 output neurons
pub layers: Vec<usize>,
/// Activation types for each layer transition.
/// The length should be one less than the number of layers.
/// Each activation function is applied to the output of its corresponding layer.
pub activations: Vec<ActivationType>,
/// Learning rate for gradient descent.
/// Controls how much the weights are adjusted during training.
pub learning_rate: f64,
/// Optional momentum coefficient for gradient descent.
/// When specified, helps accelerate training and avoid local minima.
pub momentum: Option<f64>,
/// Number of training epochs.
/// One epoch represents one complete pass through the training dataset.
pub epochs: usize,
}
And a sample config.json
{
"layers": [784, 200, 10],
"activations": ["Sigmoid", "Softmax"],
"learning_rate": 0.01,
"epochs": 30
}
I made a couple of other miscellaneous changes
Test
Previously the test
binary was running the feed_forward
function to predict the digits. I simplified this a bit by creating a predict
function which doesnāt collect the outputs of the intermediate layers. They arenāt needed because weāre not running the back propagate step. We only want the final output layer.
Save images
I wrote a binary (save_mnist_images
) that saves the first 5 images in the training set and test set to pngs so I had some visual testing. The file name includes the models prediction.
If you want to look at the code after these changes check out this commit 7233f2c5c6af
First Epoch Accuracy
So the first epoch accuracy. This didnāt make sense to me. Because the weights are initialised with random numbers Iād expect the first epoch accuracy to be around 10%. How can it be ~90% accurate first time through? It was either a bug in the training or a bug in the accuracy calculation. I spent a lot of time trying to figure out how this was the case. I even wrote a program to pick a random image from the dataset, save it as a png so I can look at it and output what digit the network thought it was. It got it right every time.
But, I think what is happening is both a problem and a feature. This is the train method
for epoch in 1..=epochs {
let epoch_start = std::time::Instant::now();
let mut total_error = 0.0;
let mut correct_predictions = 0;
let total_samples = inputs.len();
inputs.iter().zip(&targets).for_each(|(input, target)| {
let outputs = self.feed_forward(Matrix::from(input.clone()));
let error = &Matrix::from(target.clone()) - &outputs;
...
self.back_propagate(outputs, Matrix::from(target.clone()));
}
Inside each epoch there is another loop that is running the system on each image, not a matrix of the entire image set. The input is a 784 x 1 matrix. Itās also calculating the error for each image and feeding that into back_propagate
. So for one epoch itās actually running the system 60,000 times. The python and numpy
code in the book is clearly processing the entire training set (a single 784 x 60,000 matrix) in a single operation. But when I was writing the Rust code I completely missed this. Iām not using a matrix library so reasoning about and coding individual 784 x 1 matrix operations was simpler. I never went back and reviewed my code against the code in the book.
The code should probably look like this without the internal loop.
for epoch in 1..=epochs {
...
let outputs = self.feed_forward(Matrix::from(inputs)); // <---- plural inputs
let error = &Matrix::from(targets) - &outputs;
...
self.back_propagate(outputs, Matrix::from(targets)); // <---- plural targets
Iām sure if I switch to a purpose built linear algebra library like ndarray
and combine all the inputs into a single matrix I should get a big performance improvement per epoch. But what will happen to the accuracy? I guess weāll have to wait to find out.