Machine Learning Part 4
In Part 3 of this series I presented some stats for the neural network using a Softmax activation function on the hidden layer. Unfortunately I showed the wrong output. The ouput I showed was for a network with two hidden layers not one as I have been using for the other stats. The 16m 34s run time and the 100% accuracy was because of the extra hidden layer. These are the actual training and testing statistics.
Training network...
Epoch 1 (19.60s): Average Error = 0.154137, Accuracy = 89.72%
Epoch 2 (19.16s): Average Error = 0.082968, Accuracy = 94.54%
Epoch 3 (19.35s): Average Error = 0.060954, Accuracy = 96.00%
...
Epoch 28 (20.72s): Average Error = 0.001754, Accuracy = 99.96%
Epoch 29 (20.22s): Average Error = 0.001556, Accuracy = 99.97%
Epoch 30 (20.36s): Average Error = 0.001384, Accuracy = 99.97%
Total training time: 10m 11s (611.14s)
Average time per epoch: 20s (20.37s)
Confusion Matrix:
Predicted ā
Actual 0 1 2 3 4 5 6 7 8 9
ā +--------------------------------------------------
0 | 970 1 0 0 2 2 0 0 2 3
1 | 0 1128 1 1 0 1 1 1 2 0
2 | 7 2 1005 2 3 1 1 4 6 1
3 | 0 1 5 982 1 8 1 2 4 6
4 | 0 1 2 1 961 0 4 0 1 12
5 | 4 0 1 7 1 871 4 1 1 2
6 | 6 2 1 1 5 10 930 0 3 0
7 | 2 6 12 1 8 0 0 983 3 13
8 | 4 0 2 10 7 5 1 4 935 6
9 | 2 4 0 2 12 3 0 4 1 981
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 99.0% | 97.5% | 99.0% | 98.2%
1 | 99.4% | 98.5% | 99.4% | 98.9%
2 | 97.4% | 97.7% | 97.4% | 97.5%
3 | 97.2% | 97.5% | 97.2% | 97.4%
4 | 97.9% | 96.1% | 97.9% | 97.0%
5 | 97.6% | 96.7% | 97.6% | 97.2%
6 | 97.1% | 98.7% | 97.1% | 97.9%
7 | 95.6% | 98.4% | 95.6% | 97.0%
8 | 96.0% | 97.6% | 96.0% | 96.8%
9 | 97.2% | 95.8% | 97.2% | 96.5%
Overall Accuracy: 97.46%
The network didnāt hit 100% in training, though it got close, and the test stats were roughly the same. If I run the same code with two Sigmoid functions versus a Sigmod and a Softmax itās about 5% faster. I expected softmax to be much slower so the cost of the Softmax might be offset by easier error calculation. Overall Softmax hasnāt really made much difference ĀÆ\_(ć)_/ĀÆ
ndarray
The next change I made is switching to the ndarray
crate for the matrix maths. I chose ndarray
because thatās what came up first in a web search. I could have also picked ndalgebra
.
ndarray
is a Rust vector library that is kind of the Rust equivalent or numpy
1 Considering how they are supposed to be roughly equivalent my Rust ndarray
code looks nothing like the Python numpy
code in the book2.
The changes to the code to support ndarray
were surprisingly minor. I was able to mostly keep the same interface to Matrix so the network and activation code barely changes. The internals to Matrix
did change. The struct
definition has been simplified, itās now just a wrapper around the ndarray
Array2
struct.
// Before
pub struct Matrix {
pub rows: usize,
pub cols: usize,
pub data: Vec<f64>,
}
// After
pub struct Matrix {
pub data: Array2<f64>,
}
The most obvious changes can be seen in dot_multiply
and transpose
. The original code looped over each row and column of the matrix.
pub fn dot_multiply(&self, other: &Matrix) -> Self {
assert_eq!(
self.cols, other.rows,
"Invalid dimensions for matrix multiplication"
);
let mut result = vec![0.0; self.rows * other.cols];
for i in 0..self.rows {
for j in 0..other.cols {
result[i * other.cols + j] = (0..self.cols)
.map(|k| self.data[i * self.cols + k] * other.data[k * other.cols + j])
.sum();
}
}
Matrix {
rows: self.rows,
cols: other.cols,
data: result,
}
}
pub fn transpose(&self) -> Self {
let mut result = vec![0.0; self.cols * self.rows];
for i in 0..self.rows {
for j in 0..self.cols {
result[j * self.rows + i] = self.data[i * self.cols + j];
}
}
Matrix {
rows: self.cols,
cols: self.rows,
data: result,
}
}
And now we call ndarray
functions.
pub fn dot_multiply(&self, other: &Matrix) -> Self {
assert_eq!(
self.cols(),
other.rows(),
"Invalid dimensions for matrix multiplication"
);
Matrix {
data: self.data.dot(&other.data),
}
}
pub fn transpose(&self) -> Self {
Matrix {
data: self.data.t().to_owned(),
}
}
Iām still creating new Matrices for every function which will be expensive. ndarray
has some tricks to avoid this, eg. it has an inplace transpose but Iām not using it yet.
The new Matrix
struct
could be simplified further by getting rid of the data
field and making a newtype. But if I was going to do that I might as well get rid of Matrix
altogether and use the ndarray
API directly. But for now this is a safer refactoring.
There was some faffing about to get ndarray
to work with BLAS3. I had to add linker flags for openblas
in .cargo/config.toml
[target.aarch64-apple-darwin]
rustflags = ["-L", "/opt/homebrew/opt/openblas/lib", "-l", "openblas"]
And I had to turn on the correct features inmatrix/Cargo.toml
ndarray = { version = "0.16", features = [
"serde",
"rayon",
"blas",
"matrixmultiply-threading",
] }
ndarray
is using rayon
but I havenāt noticed any āmulti-threadingā going on. All the code I have written so far, including this change, has max out a single core and not much else.
Here are stats after switching the maths code to ndarray
Training network...
Epoch 1/30: Error = 0.220905, Accuracy = 98.47%, Time = 15.67s
Epoch 2/30: Error = 0.130687, Accuracy = 99.20%, Time = 17.72s
Epoch 3/30: Error = 0.114105, Accuracy = 99.30%, Time = 17.56s
...
Epoch 28/30: Error = 0.027446, Accuracy = 99.76%, Time = 18.44s
Epoch 29/30: Error = 0.026475, Accuracy = 99.77%, Time = 18.16s
Epoch 30/30: Error = 0.025550, Accuracy = 99.78%, Time = 15.99s
Total training time: 8m 31s (511.86s)
Average time per epoch: 17s (17.06s)
Confusion Matrix:
Predicted ā
Actual 0 1 2 3 4 5 6 7 8 9
ā +--------------------------------------------------
0 | 967 0 1 1 0 7 1 1 2 0
1 | 0 1121 4 2 0 1 2 2 3 0
2 | 6 1 1004 4 3 0 3 4 6 1
3 | 0 0 4 984 0 9 0 6 4 3
4 | 0 0 7 1 956 0 1 2 2 13
5 | 5 1 0 3 1 872 5 1 2 2
6 | 9 3 4 1 6 11 919 0 5 0
7 | 1 4 13 9 2 1 0 984 2 12
8 | 3 2 3 8 4 4 2 4 939 5
9 | 4 4 0 9 11 4 0 4 1 972
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 98.7% | 97.2% | 98.7% | 97.9%
1 | 98.8% | 98.7% | 98.8% | 98.7%
2 | 97.3% | 96.5% | 97.3% | 96.9%
3 | 97.4% | 96.3% | 97.4% | 96.9%
4 | 97.4% | 97.3% | 97.4% | 97.3%
5 | 97.8% | 95.9% | 97.8% | 96.8%
6 | 95.9% | 98.5% | 95.9% | 97.2%
7 | 95.7% | 97.6% | 95.7% | 96.7%
8 | 96.4% | 97.2% | 96.4% | 96.8%
9 | 96.3% | 96.4% | 96.3% | 96.4%
Overall Accuracy: 97.18%
At eight and a half minutes itās 15% faster. Accuracy is down marginally, but that might be a function of the random weights. The difference isnāt as much as I expected but I think the main problem is Iām processing each image individually as I talked about at the end of the last post. There are several other optimsation strategies I can investigate, SIMD4, using the GPU, and Iām doing a lot of cloning. But Iām going to leave these until later5. The next chapter of the book talks about "mini-batch gradient descent" as an optimisation technique so thatās what Iāll tackle next.
I might have a poke around the
ndarray
documentation and see if I can get it to look more similar. This is for another time though.↩BLAS (Basic Linear Algebra Subprograms) or more specifically OpenBLAS .OpenBLAS is an open-source implementation of the BLAS (Basic Linear Algebra Subprograms) and LAPACK APIs with many hand-crafted optimisations for specific processor types. I installed
openblas
with Homebrew↩Premature optimisation yadda yadda↩