A title for your blog

Machine Learning Part 7

Early on I added a configuration file (config.json) that enabled me to play with the network parameters. In machine learning parlance the parameters to the model are called hyper-parameters. Now I have a complete and reasonably fast1 neural network I can play with the hyper-parameters to see if I can get a more accurate model.

Hyper-parameter tuning

TLDR: It seems I might have accidentally hit the perfect set of parameters for performance. If I change anything the network either runs slower, converges slower, has worse accuracy or a combination of all three. These are the parameters I’ve been using during development.

{
  "layers": [
    { "nodes": 784, "activation": "Sigmoid" },
    { "nodes": 128, "activation": "Softmax" },
    { "nodes": 10 }
  ],
  "learning_rate": 0.01,
  "epochs": 30,
  "momentum": 0.5,
  "batch_size": 32
}

The network hits 100% accuracy (rounded) before the 30th training epoch and the test accuracy is stable at around 98%. This is slightly annoying because I was hoping to get to 99%. In theory I should be able to tune the configuration to get better results.

training_history_4_layer.svg

Learning Rate

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 128, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.0010
  Momentum:      0.5000
  Epochs:        100
  Batch Size:    32
Training network...
  [00:02:51] [################################################################################] 100/100 epochs | Accuracy: Training completed in 171.432437154s         
Final accuracy: 100.00%
Total training time: 2m 51s (171.43s)
---
Overall Accuracy: 97.79%

Reducing the learning rate. The network doesn’t get to 100% so bumped up the epochs to 100. Per epoch time is about the same. If I crank up the epochs some more it might get better accuracy. Though as I said previously it’s probably suffering from over fitting2.

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 128, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.1000
  Momentum:      0.5000
  Epochs:        100
  Batch Size:    32
Training network...
  [00:03:15] [################################################################################] 100/100 epochs | Accuracy: Training completed in 195.325258255s         
Final accuracy: 97.03%
Total training time: 3m 15s (195.33s)
---
Overall Accuracy: 95.94%

This is really bad. While it’s running you can see the accuracy bouncing up and down. A 0.1 learning rate is causing the gradient descent to over shoot the minimum error and it never settles.

training_history_hi_lr.svg

Hidden Layer Neurons

Doubling the neurons on the hidden layer makes training a little slower as expected but doesn’t improve accuracy.

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 256, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.0100
  Momentum:      0.5000
  Epochs:        30
  Batch Size:    32
Training network...
  [00:01:19] [################################################################################] 30/30 epochs | Accuracy: Training completed in 79.112050226s            
Final accuracy: 100.00%
Total training time: 1m 19s (79.11s)

Testing network predictions...
  [00:00:01] [################################################################################] 10000/10000 (100%)
Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  970    1    1    0    1    1    3    1    1    1
  1   |    0 1126    2    2    0    1    1    2    1    0
  2   |    2    1 1014    2    2    0    1    6    4    0
  3   |    0    0    5  995    0    1    0    4    3    2
  4   |    1    0    2    0  965    0    4    0    1    9
  5   |    4    0    1    8    2  869    4    1    1    2
  6   |    4    2    0    1    5    2  942    1    1    0
  7   |    1    2    8    2    1    0    0 1008    2    4
  8   |    3    0    2    4    3    1    2    3  952    4
  9   |    3    2    0    5   11    1    0    3    1  983

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.0%   |   98.2%   |  99.0%  |   98.6%
   1   |  99.2%   |   99.3%   |  99.2%  |   99.3%
   2   |  98.3%   |   98.0%   |  98.3%  |   98.1%
   3   |  98.5%   |   97.6%   |  98.5%  |   98.1%
   4   |  98.3%   |   97.5%   |  98.3%  |   97.9%
   5   |  97.4%   |   99.2%   |  97.4%  |   98.3%
   6   |  98.3%   |   98.4%   |  98.3%  |   98.4%
   7   |  98.1%   |   98.0%   |  98.1%  |   98.0%
   8   |  97.7%   |   98.4%   |  97.7%  |   98.1%
   9   |  97.4%   |   97.8%   |  97.4%  |   97.6%

Overall Accuracy: 98.24%

Halving the neurons has the opposite effect. The network runs faster but takes longer to reach 100%. Accuracy is worse

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 64, activation: Some(Softmax) }
  Layer 2: { nodes: 10, activation: None }
  Learning Rate: 0.0100
  Momentum:      0.5000
  Epochs:        60
  Batch Size:    32
Training network...
  [00:01:02] [################################################################################] 60/60 epochs | Accuracy: Training completed in 62.250689188s            
Final accuracy: 100.00%
Total training time: 1m 2s (62.25s)
---
Overall Accuracy: 97.38%

Hidden Layers

Loading network configuration...
Creating network...
Network Configuration:
  Layer 0: { nodes: 784, activation: Some(Sigmoid) }
  Layer 1: { nodes: 256, activation: Some(Sigmoid) }
  Layer 2: { nodes: 128, activation: Some(Softmax) }
  Layer 3: { nodes: 10, activation: None }
  Learning Rate: 0.0100
  Momentum:      0.5000
  Epochs:        30
  Batch Size:    32
Training network...
  [00:01:35] [################################################################################] 30/30 epochs | Accuracy: Training completed in 95.849828317s            
Final accuracy: 100.00%
Total training time: 1m 35s (95.85s)

Confusion Matrix:
      Predicted →
Actual     0    1    2    3    4    5    6    7    8    9
  ↓   +--------------------------------------------------
  0   |  972    0    1    1    0    0    3    2    1    0
  1   |    0 1128    1    1    0    1    2    1    1    0
  2   |    3    0 1017    1    1    0    2    5    3    0
  3   |    1    0    2  993    0    5    0    4    2    3
  4   |    1    0    4    0  961    0    2    3    0   11
  5   |    2    0    0    7    1  872    4    2    3    1
  6   |    2    2    1    0    2    4  945    0    2    0
  7   |    1    3    7    1    0    0    0 1012    0    4
  8   |    4    0    1    2    2    2    2    3  955    3
  9   |    2    2    0    3    9    3    1    4    3  982

Per-digit Metrics:
Digit  | Accuracy | Precision | Recall  | F1 Score
-------|----------|-----------|---------|----------
   0   |  99.2%   |   98.4%   |  99.2%  |   98.8%
   1   |  99.4%   |   99.4%   |  99.4%  |   99.4%
   2   |  98.5%   |   98.4%   |  98.5%  |   98.5%
   3   |  98.3%   |   98.4%   |  98.3%  |   98.4%
   4   |  97.9%   |   98.5%   |  97.9%  |   98.2%
   5   |  97.8%   |   98.3%   |  97.8%  |   98.0%
   6   |  98.6%   |   98.3%   |  98.6%  |   98.5%
   7   |  98.4%   |   97.7%   |  98.4%  |   98.1%
   8   |  98.0%   |   98.5%   |  98.0%  |   98.3%
   9   |  97.3%   |   97.8%   |  97.3%  |   97.6%

Overall Accuracy: 98.37%

training_history_4_layer 2.svg

There are five hyper parameters to play with and maybe there is an ideal that’s better than my defaults but ĀÆ\_(惄)_/ĀÆ I have thought about writing a program that sets the hyper-parameters plus or minus a small random amount and runs the network and keeps repeating until it finds the best configuration. This sounds a bit like neural network inception.

Performance

I know this is apples and oranges but I’m surprised how much faster my network is compared to the Python network in the book3. The author talks about 10,000 epochs and running his network for hours. I don’t know what the specs of his computer are so I can’t really compare but the difference in speed is dramatic. I was under the impression that numpy was fast. Maybe it’s not. Since the previous post I have done a bit of spot optimisation; remove a few unnecessary copies, Vec allocations etc. My network is 20x faster than when I started.

What’s next

Part 3 of the book starts to talk about deep networks, ie more than one hidden layer. I’ve been able to do that for a while. I accidentally posted the results of a network with two hidden layers in Part 3.

The book also stops with numpy and switches to using a machine learning framework (Keras). I plan to do that at some point but not yet. I still have plenty to learn about coding the low level stuff. So I will have to figure out how to implement new features myself. To be honest the book hasn’t helped much in that regard for a while but now even less so.


  1. Incidentally, running training without —release is about ~30x slower.

  2. Overfitting in machine learning refers to a modelling error that occurs when a machine learning algorithm ā€œmemorisesā€ the training data instead of the learning the underlying pattern. This results in a model that performs very well on the training dataset but poorly on production datasets.

  3. Programming Machine Learning by Paolo Perrotta.

#ai #machine_learning #rust