Network trained to output the input (unsupervised)
In the hidden layers, one layer learns a code describing the input
Cat images: Joaquim Alves Gaspar CC-SA
What are autoencoders?
Network trained to output the input (unsupervised)
The encoder maps from input to latent space
Cat images: Joaquim Alves Gaspar CC-SA
What are autoencoders?
Network trained to output the input (unsupervised)
The decoder maps from latent space back to input space
Cat images: Joaquim Alves Gaspar CC-SA
What are autoencoders?
Network trained to output the input (unsupervised)
Encoder, $h = f(x)$, and decoder, $x = g(h)$
No need for labels, since the target is the input
Why learn $x = g\left(f\left(x\right)\right)$?
Cat images: Joaquim Alves Gaspar CC-SA
What are autoencoders?
Network trained to output the input (unsupervised)
Encoder, $h = f(x)$, and decoder, $x = g(h)$
Why learn $x = g\left(f\left(x\right)\right)$?
Latent representation can have advantages
Lower dimension
Capture structure in the data
Data generation
What are autoencoders?
Network trained to output the input (unsupervised)
Encoder, $h = f(x)$, and decoder, $x = g(h)$
Autoencoders are (usually) feedfoward networks
Can be trained with the same algorithms, such as backpropagation
But since the target is $x$, they are unsupervised learners
Need some "bottleneck" to force a useful representation
Otherwise just copies values
Autoencoders
Different types of autoencoders
Undercomplete Autoencoders
Autoencoder is undercomplete if $h$ is smaller than $x$
Forces the network to learn reduced representation of input
Trained by minimizing a loss function $$L(x,g(f(x)))$$ that penalizes the difference between $x$ and $g(f(x))$
If linear it is similar to PCA (without orthogonality constraint)
With nonlinear transformations, an undercomplete autoencoder can learn more powerful representations
However, we cannot overdo it
With too much power, autoencoder can just index each training example and learn nothing useful:
$$f(x_i) = i,\quad g(i) = x_i$$
Undercomplete Autoencoders
Autoencoder is undercomplete if $h$ is smaller than $x$
Mitchell's autoencoder, hidden layer of 3 neurons
Manifold Learning
Manifold
A set of points such that the neighbourhood of each is homeomorphic to a euclidean space
Example: the surface of a sphere
Manifold Learning
Data may cover a lower dimension manifold of the space
Manifold Learning
Learn lower dimension embeddings of data manifold
Undercomplete Autoencoders
Nonlinearity makes dimensionality reduction adapt to manifold
PCA vs autoencoder 6,4,2,4,6, UCI banknote dataset (4 features)
Manifold Learning
Manifold learning with autoencoders
This works because we force the network in two opposite ways:
We demand the ability to reconstruct the input
But we also constrain how the network can encode the examples
Undercompleteness is just one way of doing this
Beware of overfitting.
If the autoencoder is sufficiently powerful, it can reconstruct the training data accurately but lose generalization power
In the extreme, all information about reconstructing the training set may be in the weights and the latent representation becomes useless
Regularized Autoencoders
An overcomplete autoencoder has $h$ larger than $x$
This, by itself, is a bad idea as $h$ will not represent anything useful
Cat images: Joaquim Alves Gaspar CC-SA
Regularized Autoencoders
An overcomplete autoencoder has $h$ larger than $x$
But we can restrict $h$ with regularization
This way the autoencoder also learns how restricted $h$ should be
Regularized Autoencoders
Sparse Autoencoder
Force $h$ to have few activations
Example: we want the probability of $h_i$ firing $$\hat p_i = \frac{1}{m} \sum_{j=1}^m h_i(x_j)$$ to be equal to $p$ (the sparseness parameter)
Regularized Autoencoders
Sparse Autoencoder
Include in the loss function a penalization term
Use the Kullback-Leibler divergence between Bernoulli variables as a regularization penalty $$L(x,g(f(x)))+\lambda \sum_i\left( p \log \frac{p}{\hat p_i} +(1-p) \log \frac{1-p}{1-\hat p_i}\right)$$
Other options include L1 regularization applied to the activation of the neurons, L2, etc.
Regularized Autoencoders
Sparse Autoencoder
Sparse autoencoders make neurons specialize
Image: Andrew Ng
Trained on 10x10 images
100 neurons on $h$
Images (norm-bounded) that maximize activation
Regularized Autoencoders
Sparse Autoencoder
Sparse autoencoders trained on MNIST, different sparsity penalties
(25 neurons in filter, images correspond to highest activation)
Niang et. al, Empirical Analysis of Different Sparse Penalties... IJCNN 2015,
Regularized Autoencoders
Denoising Autoencoders
We can force $h$ to be learned with noisy inputs
Output the original $x$ from corrupted $\tilde x$: $L(x,g(f(\tilde x)))$
Image: Adil Baaj, Keras Tutorial on DAE
Regularized Autoencoders
Denoising Autoencoders
We can force $h$ to be learned with noisy inputs
Output the original $x$ from corrupted $\tilde x$: $L(x,g(f(\tilde x)))$
This forces the autoencoder to remove the noise by learning the underlying distribution of $x$
Algorithm:
Sample $x_i$ from $\mathcal{X}$
Apply corruption $C(\tilde{x_i}\ |\ x_i)$
Train with $(x,\tilde x)$
Stochastic Autoencoders
We can also use autoencoders to learn probabilities
Just like with other ANN (e.g. softmax classifier)
The decoder is modelling a conditional probability $p_{decoder}(x\ |\ h)$
where $h$ is given by the encoder part of the autoencoder
The decoder output units can be chosen as before:
Linear for estimating the mean of Gaussian distributions
Sigmoid for Bernoulli (binary)
Softmax for discrete categories
We can think of encoder and decoder as modelling conditional probabilities $$p_{encoder}(h\ |\ x) \qquad p_{decoder}(x\ |\ h)$$
Autoencoders
Generating Data
Generating Data
Can we use autoencoders to generate new examples?
Cat images: Joaquim Alves Gaspar CC-SA
Generating Data
Autoencoders create a latent representation from the data
Cat images: Joaquim Alves Gaspar CC-SA
Generating Data
And then decode to recreate the data from this representation
Cat images: Joaquim Alves Gaspar CC-SA
Generating Data
Can we use the decoder to generate new examples?
Discriminative vs Generative
A discriminative model tries to approximate a function $p(y \mid x)$
E.g. Logistic regression or softmax ANN predict the probability of each class given the features
A generative model approximates $p(x,y)$ and then finds $p(y\mid x)$: $$p(x,y) = p(y\mid x)p(x) $$
This is generative because, knowing $p(x,y)$, we can sample from the distribution
With autoencoders
We decode from $h$, so we need to find its distribution in order to generate examples from $p(h,y) = p(y\mid h)p(h)$
Generating Data
Intuition: we need to sample the right part of the latent space
Cat images: Joaquim Alves Gaspar CC-SA
Generating Data
Intuition: if outside the right region, the result is garbage
Cat images: Joaquim Alves Gaspar CC-SA
Generating Data
Generative adversarial networks
Fix the latent space with some distribution
The result will be garbage because net not trained
Cat images: Joaquim Alves Gaspar CC-SA
Generating Data
Generative adversarial networks
Train a network to distinguish the real examples from fakes
Generating Data
Generative adversarial networks
One network creates examples from given distribution
The other distinguishes real from fake
Train both, alternating, so each becomes increasingly better
Ian Goodfellow
Generating Data
Generative adversarial networks
One network creates examples from given distribution
The other distinguishes real from fake
Train both, alternating, so each becomes increasingly better
As a result, the generator learns to map our fixed initial distribution to the space of our target examples.
Generating Data
Can we use autoencoders to generate new examples?
Yes, if we know the "shape" of the latent space
Variational Autoencoders
Train the autoencoder to encode into a given distribution
E.g. mixture of independent Gaussians
This way we learn the distribution for generating examples of each type
Variational Autoencoders
How do we backpropagate through random sampling?
Reparametrize: $z$ is deterministic apart from a normally distributed error