Tensor: A relation between sets of algebraic objects
(numbers, vectors, etc.)
For our purposes: an N-dimensional array of numbers
We will be using tensors in our models (hence Tensorflow)
Algebra
Tensor operations
Adition and subtraction:
In algebra, we can add or subtract tensors with the same dimensions
The operation is done element by element
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Tensor operations
Matrix multiplication (2D)
Follows algebra rules:
$$\mathbf{C} = \mathbf{AB}$$
$\mathbf{A}$ columns same as $\mathbf{B}$ rows
Algebra
Neuron: linear combination of inputs with non-linear activation
Algebra
Tensor operations
Tensorflow also allows broadcasting like numpy
Element-wise operations aligned by the last dimensions
Algebra
Tensor operations
Tensorflow also allows broadcasting like numpy
Element-wise operations aligned by the last dimensions
tf.matmul() also works on 3D tensors, in batch
Can be used to compute the product of a batch of 2D matrices
Example (from Tensorflow matmul documentation):
In : a = tf.constant(np.arange(1, 13, dtype=np.int32), shape=[2, 2, 3])
In : b = tf.constant(np.arange(13, 25, dtype=np.int32), shape=[2, 3, 2])
In : c = tf.matmul(a, b) # or a * b
Out: <tf.Tensor: id=676487, shape=(2, 2, 2), dtype=int32, numpy=
array([[[ 94, 100],
[229, 244]],
[[508, 532],
[697, 730]]], dtype=int32)>
Algebra
Why is this important?
Our models will be based on this type of operations
Example batches will be tensors (2D or more)
Network layers can be matrices of weights (several neurons)
Loss functions will operate and aggregate on activations and data
In practice mostly hidden
When we use the keras API we don't need to worry about this
But it's important to understand how things work
And necessary to work with basic Tensorflow operations
Training Neural Networks
Basic Example
Basic Example
Classify these data with two weights, sigmoid activation
Basic Example
Computing activation
Input is a matrix with data, two columns for the features, N rows
To compute $\sum\limits_{j=1}^{2} w_jx_j$ use matrix multiplication
Basic Example
Computing activation
Input is a matrix with data, two columns for the features, N rows
To compute $\sum\limits_{j=1}^{2} w_jx_j$ use matrix multiplication
For each example with 2 features we get one weighted sum
Then apply sigmoid function, one activation value per example
Thus, we get activations for a batch of examples
Training Neural Networks
Training (Backpropagation)
Training
Backpropagation
For weight $m$ on hidden layer $i$, propagate error backwards
Averaging over a set of examples gives a (slightly) better estimate of the gradient, improving convergence
(Note that the true gradient is for the mean loss over all points)
The main advantage of batches is in using multicore hardware (GPU, for example)
This is also the reason for power of 2 minibatch sizes (8, 16, 32, ...)
Smaller minibatches improve generalization because of the random error
The best for this is a minibatch of 1, but this takes much longer to train
In practice, minibatch size will probably be limited by RAM.
Training
Minibatch of 10
Minibatch of 1
Note: the actual time is much longer for minibatch of 1
Training Neural Networks
Improving the model
Better Models
Our simple (pseudo) neuron lacks a bias
$$y = \sum\limits_{j=1}^{2} w_jx_j + bias$$
Better Models
Our simple (pseudo) neuron lacks a bias
This means that it is stuck a (0,0)
No bias input
With bias input
Better Models
And one neuron cannot properly separate these sets
We need a better model:
Better Models
Neural Networks stack nonlinear transformations
Training Neural Networks
Other Details
Other Details
Initialization
Weights: random values close to zero (Gaussian or uniform p.d)
Need to break symmetry between neurons (but bias can start the same)
Some activations (e.g. sigmoid) saturate rapidly away from zero
(There are other, more sophisticated methods)
Other Details
Convergence
Since weight initialization and order of examples is random, expect different runs to converge at different epochs
Other Details
Convergence
Standardize the inputs: $x_{new} = \frac{x -\mu(X)}{\sigma(X)}$
It is best to avoid different features weighing differentely
It is also best to avoid very large or very small values due to numerical problems
Shifting the mean of the inputs to 0 and scaling the different dimensions also improves the loss function "landscape"
Other Details
Training schedules
Epoch: one full pass through the training data
Mini-batch: one batch with part of the training data
Generally needs many epochs to train
(the greater the data set, the fewer the epochs, other things being equal)
Other Details
Shuffle the data in each epoch
Otherwise some patterns will repeat
Other Details
Take care with the learning rate
Too small and training takes too long
But if it is too large convergence is poor at the end
Training Neural Networks
Tutorial: Keras Sequential API
Keras Sequential
Building a model with Keras
import numpy as np
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from t01_aux import plot_model #auxiliary plotting function
Create a Sequential model and add layers
model = Sequential()
model.add(Dense(4, activation = 'sigmoid',input_shape=(inputs,)))
In this tutorial, inputs is 2 for the 2D dataset, but it can vary