Machine Learning Part 6
As I have mentioned several times, there is a problem with the training algorithm where I have been processing each digit one by one. Instead, I should have been leveraging matrix math and processing batches of images in a single operation. Itās now time to tackle this problem. But first some bike shedding / procrastination. I gave NetworkConfig
a work over.
Layers
The way layers are defined in the network configuration has up to now been a bit meh. There has been a list of ālayersā which is a count of the neurons in each layer and a separate list of activation functions.
pub struct NetworkConfig {
/// Sizes of each layer in the network, including input and output layers.
/// For example, `[784, 128, 10]` represents a network with:
/// - 784 input neurons
/// - 128 hidden neurons
/// - 10 output neurons
pub layers: Vec<usize>,
/// Activation types for each layer transition.
/// The length should be one less than the number of layers.
/// Each activation function is applied to the output of its corresponding layer.
pub activations: Vec<ActivationType>,
...
These have been combined. There is now a Layer
struct that has the neuron count (called nodes
for some reason ) and the activation function (ActivationType
) for a layer.
pub struct Layer {
/// The number of nodes (neurons) in this layer.
pub nodes: usize,
/// The activation function applied to the output of this layer.
/// If `None`, no activation function is applied (e.g., for the output layer).
pub activation: Option<ActivationType>,
}
activation
is an Option
because the last layer, the output, doesnāt have an activation function. I also added impl fmt::Display for Layer
so they are printed nicely. See the stats output later in this post. NetworkConfig
now has a list of Layers
pub struct NetworkConfig {
/// A collection of [`Layer`]s in the network, including input and output layers.
/// For example,
/// vec![Layer { nodes: 784, activation: Some(ActivationType::Sigmoid) },
/// Layer { nodes: 128, activation: Some(ActivationType::Sigmoid) },
/// Layer { nodes: 10, activation: None }]
/// represents a network with:
/// - 784 input neurons (e.g., 28x28 pixels) with sigmoid activation
/// - 128 hidden neurons with sigmoid activation
/// - 10 output neurons with no activation
pub layers: Vec<Layer>,
...
The network configuration JSON file now looks like this
{
"layers": [
{ "nodes": 784, "activation": "Sigmoid" },
{ "nodes": 128, "activation": "Softmax" },
{ "nodes": 10 }
],
"learning_rate": 0.01,
"epochs": 30,
"batch_size": 32
}
This was a partial refactor. Network
still has the original seperate lists. NetworkConfig
has two functions that Network
uses to get the layer info. Iāll add Layer
to Network
at a later time.
Newtypes
I have created newtypes for the configuration parameters. Newtypes look like this.
pub struct Momentum(f64);
We wrap the bare type, the f64
, in a tuple struct. Then we can either create a constructor and a value accessor or we implement Try/From
which is a bit like a type cast.
impl TryFrom<f64> for Momentum {
type Error = &'static str;
fn try_from(value: f64) -> Result<Self, Self::Error> {
if (0.0..1.0).contains(&value) {
Ok(Self(value))
} else {
Err("Momentum must be between 0.0 and 1.0")
}
}
}
impl From<Momentum> for f64 {
fn from(momentum: Momentum) -> Self {
momentum.0
}
}
We define functions that convert from one type to the other and we use TryFrom
because it could fail. We can also add methods to our type just like any otherstruct
. This trait is for the multiplication operator (*
). This is how you define an operator in Rust. Always strike me as a bit weird.
impl Mul<Momentum> for f64 {
type Output = f64;
fn mul(self, momentum: Momentum) -> Self::Output {
self * momentum.0
}
}
And you use these new types like this
let momentum: Momentum = Momentum::try_from(0.5).unwrap()
let my_float: f64 = f64::from(momentum);
This does a few things
- Makes sure you donāt create an invalid value. Momentums should be between 0.0 and 1.0. The code will fail if the value is outside this range.
- Makes sure the value you pass into a function is actually the momentum value and not some other
f64
value. Rust doesnāt have named arguments so you could easily pass in your arguments out of order. - Makes sure your type only has the methods that make sense.
This might seem like itās expensive converting between these types all the time, but (Iām pretty sure) the compiler removes all this. The new types are there to ensure I donāt mess up and the compiler, once itās confirmed I havenāt, will strip all these type conversions from the executable.
This is the secondary benefit of this whole process. Iām getting a deeper understanding of Rust and it might be edging out Ruby as my favourite programming language.
Big Matrices
On to the big problem; processing the images in bulk.
Almost all of the code so far has been dealing with single column matrices (vectors?). The MNIST code generates a big list of 784x1
matrices and those get passed around and, as already discussed, eventually processed individually. What I really want are matrices of 784x(a large number)
that get passed into the feed forward and back propagation functions and use ndarray
to operate on multidimensional matrices. Mathematically thereās (almost) no difference between multiplying 1 or 1000 column matrices.
It took quite a bit of heading scratching but in the end it was a fairly simple change. It only really impacted the process_batch
function. Rather than iterating over the list of matrices like this
fn process_batch(&mut self, batch_inputs: &[&Matrix], batch_targets: &[&Matrix]) -> (f64, usize, Vec<Matrix>) {
let mut accumulated_gradients = Vec::new();
let mut batch_error = 0.0;
let mut batch_correct = 0;
// Process each sample in the batch
for (input, target) in batch_inputs.iter().zip(batch_targets.iter()) { // <--- BAD!
let outputs = self.feed_forward((*input).clone());
I concatenated the functions inputs into two big matrices and got rid of the inner loop.
fn process_batch(&mut self, batch_inputs: Vec<&Matrix>, batch_targets: Vec<&Matrix>) -> (f64, usize) {
// Combine batch inputs and targets into single matrices
let input_matrix = Matrix::concatenate(&batch_inputs, Axis(1));
let target_matrix = Matrix::concatenate(&batch_targets, Axis(1));
// Feed forward entire batch at once
let outputs = self.feed_forward_batch(input_matrix);
Then we pass these multi-column matrices through the functions that already took a single column matrix. For example feed_forward
takes a Matrix
as a parameter. Before the matrix was a single column, now itās a 784xbatch_size
matrix. I didnāt need to change this code at all.
fn feed_forward(&mut self, inputs: Matrix) -> Matrix {
Eventually they get to ndarray
which figures out how to deal with multi-column matrices. I said in Part 3 that I expected this to improve the performance. I was not wrong.
Creating network...
Network Configuration:
Layer 0: { nodes: 784, activation: Some(Sigmoid) }
Layer 1: { nodes: 128, activation: Some(Softmax) }
Layer 2: { nodes: 10, activation: None }
Learning Rate: 0.0100
Momentum: 0.5000
Epochs: 30
Batch Size: 32
Training network...
[00:01:01] [################################################################################] 30/30 epochs | Accuracy: Final accuracy: 100.00%
Training completed in 61.222866792s
Total training time: 1m 1s (61.22s)
Confusion Matrix:
Predicted ā
Actual 0 1 2 3 4 5 6 7 8 9
ā +--------------------------------------------------
0 | 972 0 2 1 0 0 2 1 1 1
1 | 0 1125 2 2 0 1 3 1 1 0
2 | 4 0 1008 7 3 0 0 5 5 0
3 | 0 0 2 996 0 2 0 4 2 4
4 | 2 1 0 0 961 1 5 0 2 10
5 | 3 0 0 7 0 874 2 0 4 2
6 | 3 2 0 1 3 5 941 1 2 0
7 | 2 3 6 4 0 0 0 1007 0 6
8 | 3 1 2 3 2 5 1 3 952 2
9 | 2 2 0 4 9 1 0 4 3 984
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 99.2% | 98.1% | 99.2% | 98.6%
1 | 99.1% | 99.2% | 99.1% | 99.2%
2 | 97.7% | 98.6% | 97.7% | 98.1%
3 | 98.6% | 97.2% | 98.6% | 97.9%
4 | 97.9% | 98.3% | 97.9% | 98.1%
5 | 98.0% | 98.3% | 98.0% | 98.1%
6 | 98.2% | 98.6% | 98.2% | 98.4%
7 | 98.0% | 98.1% | 98.0% | 98.1%
8 | 97.7% | 97.9% | 97.7% | 97.8%
9 | 97.5% | 97.5% | 97.5% | 97.5%
Overall Accuracy: 98.20%
Processing time is now down to one minute! And after hanging around 97% for this entire series the test set is consistently at 98% accuracy.
I said in Part 4 that I saw no evidence of multi treading. Well these changes have changed that too. Training is getting more than 400% CPU utilisation now. So those big matrices are being spread across the CPU cores which is nice to see. I can also hear it too which is a bit disconcerting. Thereās a definite electrical hum from my iMac when I run the training.
This is very satisfying for a couple of reasons. When I started this series the code took 20 minutes to produce the output shown on the first post. And now the code is looking much nicer.