Machine Learning Part 4

07 Apr, 2025

In Part 3 of this series I presented some stats for the neural network using a Softmax activation function on the hidden layer. Unfortunately I showed the wrong output. The ouput I showed was for a network with two hidden layers not one as I have been using for the other stats. The 16m 34s run time and the 100% accuracy was because of the extra hidden layer. These are the actual training and testing statistics.

Training network...
Epoch 1 (19.60s): Average Error = 0.154137, Accuracy = 89.72%
Epoch 2 (19.16s): Average Error = 0.082968, Accuracy = 94.54%
Epoch 3 (19.35s): Average Error = 0.060954, Accuracy = 96.00%
...
Epoch 28 (20.72s): Average Error = 0.001754, Accuracy = 99.96%
Epoch 29 (20.22s): Average Error = 0.001556, Accuracy = 99.97%
Epoch 30 (20.36s): Average Error = 0.001384, Accuracy = 99.97%
Total training time: 10m 11s (611.14s)
Average time per epoch: 20s (20.37s)

Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  970    1    0    0    2    2    0    0    2    3
  1   |    0 1128    1    1    0    1    1    1    2    0
  2   |    7    2 1005    2    3    1    1    4    6    1
  3   |    0    1    5  982    1    8    1    2    4    6
  4   |    0    1    2    1  961    0    4    0    1   12
  5   |    4    0    1    7    1  871    4    1    1    2
  6   |    6    2    1    1    5   10  930    0    3    0
  7   |    2    6   12    1    8    0    0  983    3   13
  8   |    4    0    2   10    7    5    1    4  935    6
  9   |    2    4    0    2   12    3    0    4    1  981

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.0%   |   97.5%   |  99.0%  |   98.2%
   1   |  99.4%   |   98.5%   |  99.4%  |   98.9%
   2   |  97.4%   |   97.7%   |  97.4%  |   97.5%
   3   |  97.2%   |   97.5%   |  97.2%  |   97.4%
   4   |  97.9%   |   96.1%   |  97.9%  |   97.0%
   5   |  97.6%   |   96.7%   |  97.6%  |   97.2%
   6   |  97.1%   |   98.7%   |  97.1%  |   97.9%
   7   |  95.6%   |   98.4%   |  95.6%  |   97.0%
   8   |  96.0%   |   97.6%   |  96.0%  |   96.8%
   9   |  97.2%   |   95.8%   |  97.2%  |   96.5%

Overall Accuracy: 97.46%

The network didn’t hit 100% in training, though it got close, and the test stats were roughly the same. If I run the same code with two Sigmoid functions versus a Sigmod and a Softmax it’s about 5% faster. I expected softmax to be much slower so the cost of the Softmax might be offset by easier error calculation. Overall Softmax hasn’t really made much difference ¯\_(ツ)_/¯

ndarray

The next change I made is switching to the ndarray crate for the matrix maths. I chose ndarray because that’s what came up first in a web search. I could have also picked ndalgebra.

ndarray is a Rust vector library that is kind of the Rust equivalent or numpy ¹ Considering how they are supposed to be roughly equivalent my Rust ndarray code looks nothing like the Python numpy code in the book².

The changes to the code to support ndarray were surprisingly minor. I was able to mostly keep the same interface to Matrix so the network and activation code barely changes. The internals to Matrix did change. The struct definition has been simplified, it’s now just a wrapper around the ndarray Array2 struct.

// Before
pub struct Matrix {
    pub rows: usize,
    pub cols: usize,
    pub data: Vec<f64>,
}

// After
pub struct Matrix {
    pub data: Array2<f64>,
}

The most obvious changes can be seen in dot_multiply and transpose. The original code looped over each row and column of the matrix.

    pub fn dot_multiply(&self, other: &Matrix) -> Self {
        assert_eq!(
            self.cols, other.rows,
            "Invalid dimensions for matrix multiplication"
        );

        let mut result = vec![0.0; self.rows * other.cols];

        for i in 0..self.rows {
            for j in 0..other.cols {
                result[i * other.cols + j] = (0..self.cols)
                    .map(|k| self.data[i * self.cols + k] * other.data[k * other.cols + j])
                    .sum();
            }
        }

        Matrix {
            rows: self.rows,
            cols: other.cols,
            data: result,
        }
    }

    pub fn transpose(&self) -> Self {
        let mut result = vec![0.0; self.cols * self.rows];
        for i in 0..self.rows {
            for j in 0..self.cols {
                result[j * self.rows + i] = self.data[i * self.cols + j];
            }
        }

        Matrix {
            rows: self.cols,
            cols: self.rows,
            data: result,
        }
    }

And now we call ndarray functions.

    pub fn dot_multiply(&self, other: &Matrix) -> Self {
        assert_eq!(
            self.cols(),
            other.rows(),
            "Invalid dimensions for matrix multiplication"
        );
        Matrix {
            data: self.data.dot(&other.data),
        }
    }

    pub fn transpose(&self) -> Self {
        Matrix {
            data: self.data.t().to_owned(),
        }
    }

I’m still creating new Matrices for every function which will be expensive. ndarray has some tricks to avoid this, eg. it has an inplace transpose but I’m not using it yet.

The new Matrix struct could be simplified further by getting rid of the data field and making a newtype. But if I was going to do that I might as well get rid of Matrix altogether and use the ndarray API directly. But for now this is a safer refactoring.

There was some faffing about to get ndarray to work with BLAS³. I had to add linker flags for openblas in .cargo/config.toml

[target.aarch64-apple-darwin]
rustflags = ["-L", "/opt/homebrew/opt/openblas/lib", "-l", "openblas"]

And I had to turn on the correct features inmatrix/Cargo.toml

ndarray = { version = "0.16", features = [
    "serde",
    "rayon",
    "blas",
    "matrixmultiply-threading",
] }

ndarray is using rayon but I haven’t noticed any “multi-threading” going on. All the code I have written so far, including this change, has max out a single core and not much else.

Here are stats after switching the maths code to ndarray

Training network...
Epoch 1/30: Error = 0.220905, Accuracy = 98.47%, Time = 15.67s
Epoch 2/30: Error = 0.130687, Accuracy = 99.20%, Time = 17.72s
Epoch 3/30: Error = 0.114105, Accuracy = 99.30%, Time = 17.56s
...
Epoch 28/30: Error = 0.027446, Accuracy = 99.76%, Time = 18.44s
Epoch 29/30: Error = 0.026475, Accuracy = 99.77%, Time = 18.16s
Epoch 30/30: Error = 0.025550, Accuracy = 99.78%, Time = 15.99s
Total training time: 8m 31s (511.86s)
Average time per epoch: 17s (17.06s)

Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  967    0    1    1    0    7    1    1    2    0
  1   |    0 1121    4    2    0    1    2    2    3    0
  2   |    6    1 1004    4    3    0    3    4    6    1
  3   |    0    0    4  984    0    9    0    6    4    3
  4   |    0    0    7    1  956    0    1    2    2   13
  5   |    5    1    0    3    1  872    5    1    2    2
  6   |    9    3    4    1    6   11  919    0    5    0
  7   |    1    4   13    9    2    1    0  984    2   12
  8   |    3    2    3    8    4    4    2    4  939    5
  9   |    4    4    0    9   11    4    0    4    1  972

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  98.7%   |   97.2%   |  98.7%  |   97.9%
   1   |  98.8%   |   98.7%   |  98.8%  |   98.7%
   2   |  97.3%   |   96.5%   |  97.3%  |   96.9%
   3   |  97.4%   |   96.3%   |  97.4%  |   96.9%
   4   |  97.4%   |   97.3%   |  97.4%  |   97.3%
   5   |  97.8%   |   95.9%   |  97.8%  |   96.8%
   6   |  95.9%   |   98.5%   |  95.9%  |   97.2%
   7   |  95.7%   |   97.6%   |  95.7%  |   96.7%
   8   |  96.4%   |   97.2%   |  96.4%  |   96.8%
   9   |  96.3%   |   96.4%   |  96.3%  |   96.4%

Overall Accuracy: 97.18%

At eight and a half minutes it’s 15% faster. Accuracy is down marginally, but that might be a function of the random weights. The difference isn’t as much as I expected but I think the main problem is I’m processing each image individually as I talked about at the end of the last post. There are several other optimsation strategies I can investigate, SIMD⁴, using the GPU, and I’m doing a lot of cloning. But I’m going to leave these until later⁵. The next chapter of the book talks about "mini-batch gradient descent" as an optimisation technique so that’s what I’ll tackle next.

ndarray for NumPy users.↩
I might have a poke around the ndarray documentation and see if I can get it to look more similar. This is for another time though.↩
BLAS (Basic Linear Algebra Subprograms) or more specifically OpenBLAS .OpenBLAS is an open-source implementation of the BLAS (Basic Linear Algebra Subprograms) and LAPACK APIs with many hand-crafted optimisations for specific processor types. I installed openblas with Homebrew↩
Single instruction, multiple data ↩
Premature optimisation yadda yadda↩

#ai #machine_learning #rust