Machine Learning Part 7

25 Apr, 2025

Early on I added a configuration file (config.json) that enabled me to play with the network parameters. In machine learning parlance the parameters to the model are called hyper-parameters. Now I have a complete and reasonably fast¹ neural network I can play with the hyper-parameters to see if I can get a more accurate model.

Hyper-parameter tuning

TLDR: It seems I might have accidentally hit the perfect set of parameters for performance. If I change anything the network either runs slower, converges slower, has worse accuracy or a combination of all three. These are the parameters I’ve been using during development.

{
  "layers": [
    { "nodes": 784, "activation": "Sigmoid" },
    { "nodes": 128, "activation": "Softmax" },
    { "nodes": 10 }
  ],
  "learning_rate": 0.01,
  "epochs": 30,
  "momentum": 0.5,
  "batch_size": 32
}

The network hits 100% accuracy (rounded) before the 30th training epoch and the test accuracy is stable at around 98%. This is slightly annoying because I was hoping to get to 99%. In theory I should be able to tune the configuration to get better results.

Learning Rate

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 128, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.0010
  Momentum:      0.5000
  Epochs:        100
  Batch Size:    32
Training network...
  [00:02:51] [################################################################################] 100/100 epochs | Accuracy: Training completed in 171.432437154s         
Final accuracy: 100.00%
Total training time: 2m 51s (171.43s)
---
Overall Accuracy: 97.79%

Reducing the learning rate. The network doesn’t get to 100% so bumped up the epochs to 100. Per epoch time is about the same. If I crank up the epochs some more it might get better accuracy. Though as I said previously it’s probably suffering from over fitting².

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 128, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.1000
  Momentum:      0.5000
  Epochs:        100
  Batch Size:    32
Training network...
  [00:03:15] [################################################################################] 100/100 epochs | Accuracy: Training completed in 195.325258255s         
Final accuracy: 97.03%
Total training time: 3m 15s (195.33s)
---
Overall Accuracy: 95.94%

This is really bad. While it’s running you can see the accuracy bouncing up and down. A 0.1 learning rate is causing the gradient descent to over shoot the minimum error and it never settles.

Hidden Layer Neurons

Doubling the neurons on the hidden layer makes training a little slower as expected but doesn’t improve accuracy.

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 256, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.0100
  Momentum:      0.5000
  Epochs:        30
  Batch Size:    32
Training network...
  [00:01:19] [################################################################################] 30/30 epochs | Accuracy: Training completed in 79.112050226s            
Final accuracy: 100.00%
Total training time: 1m 19s (79.11s)

Testing network predictions...
  [00:00:01] [################################################################################] 10000/10000 (100%)
Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  970    1    1    0    1    1    3    1    1    1
  1   |    0 1126    2    2    0    1    1    2    1    0
  2   |    2    1 1014    2    2    0    1    6    4    0
  3   |    0    0    5  995    0    1    0    4    3    2
  4   |    1    0    2    0  965    0    4    0    1    9
  5   |    4    0    1    8    2  869    4    1    1    2
  6   |    4    2    0    1    5    2  942    1    1    0
  7   |    1    2    8    2    1    0    0 1008    2    4
  8   |    3    0    2    4    3    1    2    3  952    4
  9   |    3    2    0    5   11    1    0    3    1  983

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.0%   |   98.2%   |  99.0%  |   98.6%
   1   |  99.2%   |   99.3%   |  99.2%  |   99.3%
   2   |  98.3%   |   98.0%   |  98.3%  |   98.1%
   3   |  98.5%   |   97.6%   |  98.5%  |   98.1%
   4   |  98.3%   |   97.5%   |  98.3%  |   97.9%
   5   |  97.4%   |   99.2%   |  97.4%  |   98.3%
   6   |  98.3%   |   98.4%   |  98.3%  |   98.4%
   7   |  98.1%   |   98.0%   |  98.1%  |   98.0%
   8   |  97.7%   |   98.4%   |  97.7%  |   98.1%
   9   |  97.4%   |   97.8%   |  97.4%  |   97.6%

Overall Accuracy: 98.24%

Halving the neurons has the opposite effect. The network runs faster but takes longer to reach 100%. Accuracy is worse

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 64, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.0100
  Momentum:      0.5000
  Epochs:        60
  Batch Size:    32
Training network...
  [00:01:02] [################################################################################] 60/60 epochs | Accuracy: Training completed in 62.250689188s            
Final accuracy: 100.00%
Total training time: 1m 2s (62.25s)
---
Overall Accuracy: 97.38%

Hidden Layers

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 256, activation: Some(Sigmoid) }
  Layer 2: { nodes: 128, activation: Some(Softmax) }
  Layer 3: { nodes: 10, activation: None }
  Learning Rate: 0.0100
  Momentum:      0.5000
  Epochs:        30
  Batch Size:    32
Training network...
  [00:01:35] [################################################################################] 30/30 epochs | Accuracy: Training completed in 95.849828317s            
Final accuracy: 100.00%
Total training time: 1m 35s (95.85s)

Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  972    0    1    1    0    0    3    2    1    0
  1   |    0 1128    1    1    0    1    2    1    1    0
  2   |    3    0 1017    1    1    0    2    5    3    0
  3   |    1    0    2  993    0    5    0    4    2    3
  4   |    1    0    4    0  961    0    2    3    0   11
  5   |    2    0    0    7    1  872    4    2    3    1
  6   |    2    2    1    0    2    4  945    0    2    0
  7   |    1    3    7    1    0    0    0 1012    0    4
  8   |    4    0    1    2    2    2    2    3  955    3
  9   |    2    2    0    3    9    3    1    4    3  982

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.2%   |   98.4%   |  99.2%  |   98.8%
   1   |  99.4%   |   99.4%   |  99.4%  |   99.4%
   2   |  98.5%   |   98.4%   |  98.5%  |   98.5%
   3   |  98.3%   |   98.4%   |  98.3%  |   98.4%
   4   |  97.9%   |   98.5%   |  97.9%  |   98.2%
   5   |  97.8%   |   98.3%   |  97.8%  |   98.0%
   6   |  98.6%   |   98.3%   |  98.6%  |   98.5%
   7   |  98.4%   |   97.7%   |  98.4%  |   98.1%
   8   |  98.0%   |   98.5%   |  98.0%  |   98.3%
   9   |  97.3%   |   97.8%   |  97.3%  |   97.6%

Overall Accuracy: 98.37%

training_history_4_layer 2.svg

There are five hyper parameters to play with and maybe there is an ideal that’s better than my defaults but ¯\_(ツ)_/¯ I have thought about writing a program that sets the hyper-parameters plus or minus a small random amount and runs the network and keeps repeating until it finds the best configuration. This sounds a bit like neural network inception.

Performance

I know this is apples and oranges but I’m surprised how much faster my network is compared to the Python network in the book³. The author talks about 10,000 epochs and running his network for hours. I don’t know what the specs of his computer are so I can’t really compare but the difference in speed is dramatic. I was under the impression that numpy was fast. Maybe it’s not. Since the previous post I have done a bit of spot optimisation; remove a few unnecessary copies, Vec allocations etc. My network is 20x faster than when I started.

What’s next

Part 3 of the book starts to talk about deep networks, ie more than one hidden layer. I’ve been able to do that for a while. I accidentally posted the results of a network with two hidden layers in Part 3.

The book also stops with numpy and switches to using a machine learning framework (Keras). I plan to do that at some point but not yet. I still have plenty to learn about coding the low level stuff. So I will have to figure out how to implement new features myself. To be honest the book hasn’t helped much in that regard for a while but now even less so.

Incidentally, running training without —release is about ~30x slower.↩
Overfitting in machine learning refers to a modelling error that occurs when a machine learning algorithm “memorises” the training data instead of the learning the underlying pattern. This results in a model that performs very well on the training dataset but poorly on production datasets.↩
Programming Machine Learning by Paolo Perrotta.↩

#ai #machine_learning #rust