Machine Learning Part 7
Early on I added a configuration file (config.json
) that enabled me to play with the network parameters. In machine learning parlance the parameters to the model are called hyper-parameters. Now I have a complete and reasonably fast1 neural network I can play with the hyper-parameters to see if I can get a more accurate model.
Hyper-parameter tuning
TLDR: It seems I might have accidentally hit the perfect set of parameters for performance. If I change anything the network either runs slower, converges slower, has worse accuracy or a combination of all three. These are the parameters Iāve been using during development.
{
"layers": [
{ "nodes": 784, "activation": "Sigmoid" },
{ "nodes": 128, "activation": "Softmax" },
{ "nodes": 10 }
],
"learning_rate": 0.01,
"epochs": 30,
"momentum": 0.5,
"batch_size": 32
}
The network hits 100% accuracy (rounded) before the 30th training epoch and the test accuracy is stable at around 98%. This is slightly annoying because I was hoping to get to 99%. In theory I should be able to tune the configuration to get better results.
Learning Rate
Loading network configuration...
Creating network...
Network Configuration:
Layer 0: { nodes: 784, activation: Some(Sigmoid) }
Layer 1: { nodes: 128, activation: Some(Softmax) }
Layer 2: { nodes: 10, activation: None }
Learning Rate: 0.0010
Momentum: 0.5000
Epochs: 100
Batch Size: 32
Training network...
[00:02:51] [################################################################################] 100/100 epochs | Accuracy: Training completed in 171.432437154s
Final accuracy: 100.00%
Total training time: 2m 51s (171.43s)
---
Overall Accuracy: 97.79%
Reducing the learning rate. The network doesnāt get to 100% so bumped up the epochs to 100. Per epoch time is about the same. If I crank up the epochs some more it might get better accuracy. Though as I said previously itās probably suffering from over fitting2.
Loading network configuration...
Creating network...
Network Configuration:
Layer 0: { nodes: 784, activation: Some(Sigmoid) }
Layer 1: { nodes: 128, activation: Some(Softmax) }
Layer 2: { nodes: 10, activation: None }
Learning Rate: 0.1000
Momentum: 0.5000
Epochs: 100
Batch Size: 32
Training network...
[00:03:15] [################################################################################] 100/100 epochs | Accuracy: Training completed in 195.325258255s
Final accuracy: 97.03%
Total training time: 3m 15s (195.33s)
---
Overall Accuracy: 95.94%
This is really bad. While itās running you can see the accuracy bouncing up and down. A 0.1 learning rate is causing the gradient descent to over shoot the minimum error and it never settles.
Hidden Layer Neurons
Doubling the neurons on the hidden layer makes training a little slower as expected but doesnāt improve accuracy.
Loading network configuration...
Creating network...
Network Configuration:
Layer 0: { nodes: 784, activation: Some(Sigmoid) }
Layer 1: { nodes: 256, activation: Some(Softmax) }
Layer 2: { nodes: 10, activation: None }
Learning Rate: 0.0100
Momentum: 0.5000
Epochs: 30
Batch Size: 32
Training network...
[00:01:19] [################################################################################] 30/30 epochs | Accuracy: Training completed in 79.112050226s
Final accuracy: 100.00%
Total training time: 1m 19s (79.11s)
Testing network predictions...
[00:00:01] [################################################################################] 10000/10000 (100%)
Confusion Matrix:
Predicted ā
Actual 0 1 2 3 4 5 6 7 8 9
ā +--------------------------------------------------
0 | 970 1 1 0 1 1 3 1 1 1
1 | 0 1126 2 2 0 1 1 2 1 0
2 | 2 1 1014 2 2 0 1 6 4 0
3 | 0 0 5 995 0 1 0 4 3 2
4 | 1 0 2 0 965 0 4 0 1 9
5 | 4 0 1 8 2 869 4 1 1 2
6 | 4 2 0 1 5 2 942 1 1 0
7 | 1 2 8 2 1 0 0 1008 2 4
8 | 3 0 2 4 3 1 2 3 952 4
9 | 3 2 0 5 11 1 0 3 1 983
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 99.0% | 98.2% | 99.0% | 98.6%
1 | 99.2% | 99.3% | 99.2% | 99.3%
2 | 98.3% | 98.0% | 98.3% | 98.1%
3 | 98.5% | 97.6% | 98.5% | 98.1%
4 | 98.3% | 97.5% | 98.3% | 97.9%
5 | 97.4% | 99.2% | 97.4% | 98.3%
6 | 98.3% | 98.4% | 98.3% | 98.4%
7 | 98.1% | 98.0% | 98.1% | 98.0%
8 | 97.7% | 98.4% | 97.7% | 98.1%
9 | 97.4% | 97.8% | 97.4% | 97.6%
Overall Accuracy: 98.24%
Halving the neurons has the opposite effect. The network runs faster but takes longer to reach 100%. Accuracy is worse
Loading network configuration...
Creating network...
Network Configuration:
Layer 0: { nodes: 784, activation: Some(Sigmoid) }
Layer 1: { nodes: 64, activation: Some(Softmax) }
Layer 2: { nodes: 10, activation: None }
Learning Rate: 0.0100
Momentum: 0.5000
Epochs: 60
Batch Size: 32
Training network...
[00:01:02] [################################################################################] 60/60 epochs | Accuracy: Training completed in 62.250689188s
Final accuracy: 100.00%
Total training time: 1m 2s (62.25s)
---
Overall Accuracy: 97.38%
Hidden Layers
Loading network configuration...
Creating network...
Network Configuration:
Layer 0: { nodes: 784, activation: Some(Sigmoid) }
Layer 1: { nodes: 256, activation: Some(Sigmoid) }
Layer 2: { nodes: 128, activation: Some(Softmax) }
Layer 3: { nodes: 10, activation: None }
Learning Rate: 0.0100
Momentum: 0.5000
Epochs: 30
Batch Size: 32
Training network...
[00:01:35] [################################################################################] 30/30 epochs | Accuracy: Training completed in 95.849828317s
Final accuracy: 100.00%
Total training time: 1m 35s (95.85s)
Confusion Matrix:
Predicted ā
Actual 0 1 2 3 4 5 6 7 8 9
ā +--------------------------------------------------
0 | 972 0 1 1 0 0 3 2 1 0
1 | 0 1128 1 1 0 1 2 1 1 0
2 | 3 0 1017 1 1 0 2 5 3 0
3 | 1 0 2 993 0 5 0 4 2 3
4 | 1 0 4 0 961 0 2 3 0 11
5 | 2 0 0 7 1 872 4 2 3 1
6 | 2 2 1 0 2 4 945 0 2 0
7 | 1 3 7 1 0 0 0 1012 0 4
8 | 4 0 1 2 2 2 2 3 955 3
9 | 2 2 0 3 9 3 1 4 3 982
Per-digit Metrics:
Digit | Accuracy | Precision | Recall | F1 Score
-------|----------|-----------|---------|----------
0 | 99.2% | 98.4% | 99.2% | 98.8%
1 | 99.4% | 99.4% | 99.4% | 99.4%
2 | 98.5% | 98.4% | 98.5% | 98.5%
3 | 98.3% | 98.4% | 98.3% | 98.4%
4 | 97.9% | 98.5% | 97.9% | 98.2%
5 | 97.8% | 98.3% | 97.8% | 98.0%
6 | 98.6% | 98.3% | 98.6% | 98.5%
7 | 98.4% | 97.7% | 98.4% | 98.1%
8 | 98.0% | 98.5% | 98.0% | 98.3%
9 | 97.3% | 97.8% | 97.3% | 97.6%
Overall Accuracy: 98.37%
There are five hyper parameters to play with and maybe there is an ideal thatās better than my defaults but ĀÆ\_(ć)_/ĀÆ I have thought about writing a program that sets the hyper-parameters plus or minus a small random amount and runs the network and keeps repeating until it finds the best configuration. This sounds a bit like neural network inception.
Performance
I know this is apples and oranges but Iām surprised how much faster my network is compared to the Python network in the book3. The author talks about 10,000 epochs and running his network for hours. I donāt know what the specs of his computer are so I canāt really compare but the difference in speed is dramatic. I was under the impression that numpy
was fast. Maybe itās not.
Since the previous post I have done a bit of spot optimisation; remove a few unnecessary copies, Vec allocations etc. My network is 20x faster than when I started.
Whatās next
Part 3 of the book starts to talk about deep networks, ie more than one hidden layer. Iāve been able to do that for a while. I accidentally posted the results of a network with two hidden layers in Part 3.
The book also stops with numpy
and switches to using a machine learning framework (Keras). I plan to do that at some point but not yet. I still have plenty to learn about coding the low level stuff. So I will have to figure out how to implement new features myself. To be honest the book hasnāt helped much in that regard for a while but now even less so.
Incidentally, running training without
ārelease
is about ~30x slower.↩Overfitting in machine learning refers to a modelling error that occurs when a machine learning algorithm āmemorisesā the training data instead of the learning the underlying pattern. This results in a model that performs very well on the training dataset but poorly on production datasets.↩
Programming Machine Learning by Paolo Perrotta.↩