Enough neural networks to be dangerous

How do neural networks work?

Watch the 3blue1brown series before the lecture: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Video 1: But what is a neural network?
Video 2: Gradient descent, how neural networks learn
Video 3: What is backpropagation really doing?
Video 4 (optional): Backpropagation calculus

Moreover, read the Illustrated Guide to Recurrent Neural Networks by Michael Nguyen.

A quick recap

A simple NN. Source: towardsdatascience.com

Let’s review:

Neurons
Weights and biases
Activation function
Hidden layers
Feed forward
Back propagation and gradient descent

Different architectures

Feed Forward (FF)
Recurrent Neural Network (RNN)
Long Short Term Memory (LSTM)
… and others that are still not much used in SE, such as Convolutional Neural Networks.

See the Neural Network Zoo by the Asimov Institute.

Example 1: FF for hand-written recognition

Our goal is to create a Neural Network that is able to recognize numbers that were written by hand.
We use the MNIST dataset, which contains 60k training examples + 10k test examples.
Open the “feed-forward-nn-hand-written-recognition” Jupyter notebook.

RNNs and LSTMS

RNNs: when the order matter!
- RNNs might suffer in keeping the information from way back (“vanishing gradients”).
- LSTM: long/short term memory

An RNN

LSTM. Source: https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm

Example 2: RNNs and LSTMs

Our goal is to create a RNN that write songs like Freddy Mercury.
We use all Queen’s songs as datasets.
Open the “rnn-and-lstm-sing-like-freddy” Jupyter notebook.

“Empirical ML”

It is hard to know in advance the best architecture for your problem.
We have to experiment with different hyper parameters: number of layers, neurons per layer, learning rate, activation functions.
Machine learning is empirical!

Deciding the architecture

Too little layers/neurons: Underfitting. The problem might be too complex to be represented with such a little number of neurons.
Too many layers/neurons: Overfitting. The network might just “memorize” and not learn.

Activation functions

Choose a:

Linear function for regression problems.
Sigmoid for binary classification.
Softmax for probabilities and multiclassification.
ReLU for for the hidden layers.

Read a simple explanation of activation functions here.

Loss functions

Choose:

Binary Cross-entropy for binary problems
Cross-entropy for multi-class classification problem
Mean Squared Error for regression problems

Read this nice explanation on how to choose activation and loss functions.

Batch size and epochs

Should also be tuned.
Read the tradeoff batch size vs number of iterations to train a NN discussion on Stack Overflow.

Dropout

Dropout is a technique used to improve over-fit on neural networks.
Basically, during training half of neurons on a particular layer will be deactivated. This improve generalization because force your layer to learn with different neurons the same “concept.”
During the prediction phase the dropout is deactivated.
(Extracted from Leonardo Araujo Santos’s online book)

Validation

Training vs test
k-fold validation (really needed in Deep Learning?)
Accuracy, precision, recall
Comparison with a baseline

Bibliography

Copyright

The course contents are copyrighted (c) 2018 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.