Machine Learning Part 5

11 Apr, 2025

After the ndarray diversion I’m back following the book¹.

This time I’m implementing mini batch gradient descent and dataset standardisation.²

Mini Batch Gradient Descent

Mini-batch gradient descent is an optimisation technique that, instead of using the entire dataset for each update, it splits the data into small subsets (mini-batches) of the data, computes the gradient, and updates the weights. Apparently this approach “balances the efficiency of stochastic gradient descent with the stability of batch gradient descent”.

At least I think I have implemented it. The gradient descent code is the most complicated part of the neural network and as such changing it has made the biggest impact on the code. I mentioned at the end of Part 3 that there was something odd going on with the training loop. It seems instead of doing Batch Gradient Descent, where you pass the entire training set through the system in one go I was accidentally doing Stochastic gradient descent, where you process each sample one by one. I now have a prepare_mini_batches function that chunks the inputs and targets into batch_size groups which then go through the feed forward and back propagate loop. But I’m still processing each image in the batch individually. The algorithm is a mess at the moment 🙁

Here are the stats³

Training network...
  [00:03:11] [################################################################################] 30/30 epochs | Accuracy: Training completed in 191.261431625s                       
Total training time: 3m 11s (191.37s)
Average time per epoch: 6s (6.38s)

Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  967    0    2    2    0    5    1    2    1    0
  1   |    0 1117    3    2    0    1    4    3    5    0
  2   |    5    1 1004    4    3    0    3    5    6    1
  3   |    0    0    5  985    1    7    0    6    3    3
  4   |    1    0    2    1  951    0    5    2    2   18
  5   |    5    1    0    8    1  867    4    0    3    3
  6   |    6    3    2    0    5    7  933    0    2    0
  7   |    0    4   12    7    1    1    0  991    1   11
  8   |    4    1    5    7    4    4    4    3  936    6
  9   |    2    4    0    7    8    3    0    4    0  981

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  98.7%   |   97.7%   |  98.7%  |   98.2%
   1   |  98.4%   |   98.8%   |  98.4%  |   98.6%
   2   |  97.3%   |   97.0%   |  97.3%  |   97.1%
   3   |  97.5%   |   96.3%   |  97.5%  |   96.9%
   4   |  96.8%   |   97.6%   |  96.8%  |   97.2%
   5   |  97.2%   |   96.9%   |  97.2%  |   97.0%
   6   |  97.4%   |   97.8%   |  97.4%  |   97.6%
   7   |  96.4%   |   97.5%   |  96.4%  |   97.0%
   8   |  96.1%   |   97.6%   |  96.1%  |   96.8%
   9   |  97.2%   |   95.9%   |  97.2%  |   96.6%

Overall Accuracy: 97.32%

3m 11s! That’s 2.6 times faster than the ndarray results.

I don’t know why this has sped up each epoch and the overall runtime. We are still processing all of the images so it’s not like we have skipped a third of the batches. All the literature I have read talks about speeding up convergence which I get because mini-batches average the error over smaller batches leading to less variance. You don’t accidentally make the accuracy worse in one epoch which is possible when you process each input separately. But I have no idea why it would also speed up the execution so I need to spend some time to figure it out.

Standardisation

As per the footnote standardising the training data is a process of scaling all your input variables to be within a similar range. You want to do this so that one variable doesn’t have an overwhelming impact on the training. The book has a good explanation of this.

For example, the height of humans has a relatively low standard deviation because nobody is hundreds of times taller than anyone else. On the other hand, the height of plants has a high standard deviation because a plant can be as short as moss, or as tall as a redwood.

So if your training set has tree and human heights then multiplying the weights by a 10m tall tree is going to obscure the impact of multiplying the weights by a 1.7m tall human.

This isn’t really necessary for the MNIST dataset because each input node is a pixel so is in the range 0 to 255, or 0.0 to1.0; they are already mostly standardised. I have implemented it anyway because it might be useful when I start to work on other data sets. Standardising will reduce the impact of all the black pixels (zeros) though. As expected standardisation didn’t change the runtime or the training accuracy for this dataset.

Programming Machine Learning by Paolo Perrotta.↩
Dataset standardisation in machine learning is the process of transforming the features of a dataset so that they have a mean of 0 and a standard deviation of 1. This ensures that each feature contributes equally to the model's performance, particularly when different features are measured on different scales. Standardisation helps improve the convergence of optimization algorithms and can enhance the overall accuracy of the model by preventing certain features from disproportionately influencing the results.↩
Yes, I have changed the progress indicator again↩

#ai #machine_learning #rust