Deep Generative Models(Part 1): Taxonomy and VAEs

A Generative Model learns a probability distribution from data with prior knowledge, producing new images from learned distribution.

Deep Generative Models: A Taxonomy

Key choices

Representation

There are two main choices for learned representation: factorized model and latent variable model.

Factorized model writes probability distribution as a product of simpler terms, via chain rule. Deep Generative Models: A Taxonomy

Latent variable model defines a latent space to extract the core information from data, which is much smaller than the original one.

Deep Generative Models: A Taxonomy

Learning

Max Likelihood Estimation

fully-observed graphical models: PixelRNN & PixelCNN -> PixelCNN++, WaveNet(audio)
latent-variable models: VAE -> VQ-VAE
latent-variable invertible models(Flow-based): NICE, Real NVP -> MAF, IAF, Glow

Adversarial Training

GANs: Vanilla GAN -> improved GAN, DCGAN, cGAN -> WGAN, ProGAN -> SAGAN, StyleGAN, BigGAN

Comparison of GAN, VAE and Flow-based Models Deep Generative Models: A Taxonomy

VAE: Variational AutoEncoder

Auto-Encoding Variational Bayes - Kingma - ICLR 2014

Title: Auto-Encoding Variational Bayes
Task: Image Generation
Author: D. P. Kingma and M. Welling
Date: Dec. 2013
Arxiv: 1312.6114
Published: ICLR 2014

Highlights

A reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods
For i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator

The key idea: approximate the posterior $p_{θ} (z | x)$ with a simpler, tractable distribution $q_{ϕ} (z | x)$ . Auto-Encoding Variational Bayes - Kingma - ICLR 2014

The graphical model involved in Variational Autoencoder. Solid lines denote the generative distribution

p_{θ} (.)

and dashed lines denote the distribution $q_ϕ(z

t o a p p r o x i m a t e t h e i n t r a c t a b l e p o s t e r i o r

p_θ(z

x)$.

Auto-Encoding Variational Bayes - Kingma - ICLR 2014

Loss Function: ELBO Using KL Divergence: $D_{K L} (q_{ϕ} (z | x) ‖ p_{θ} (z | x)) = \log p_{θ} (x) + D_{K L} (q_{ϕ} (z | x) ‖ p_{θ} (z)) - E_{z \sim q_{ϕ} (z | x)} \log p_{θ} (x | z)$

ELOB defined as: $\begin{aligned} L_{V A E} (θ, ϕ) & = - \log p_{θ} (x) + D_{K L} (q_{ϕ} (z | x) ‖ p_{θ} (z | x)) \\ = - E_{z \sim q_{ϕ} (z | x)} \log p_{θ} (x | z) + D_{K L} (q_{ϕ} (z | x) ‖ p_{θ} (z)) \\ θ^{*}, ϕ^{*} & = \arg min_{θ, ϕ} L_{V A E} \end{aligned}$

By minimizing the loss we are maximizing the lower bound of the probability of generating real data samples.

The Reparameterization Trick

The expectation term in the loss function invokes generating samples from $z \sim q_{ϕ} (z | x)$ . Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced: It is often possible to express the random variable $z$ as a deterministic variable $\mathbf{z}=\mathcal{T}{\phi}(\mathbf{x}, \boldsymbol{\epsilon}) $, w h e r e$ ϵ $i s a n a u x i l i a r y i n d e p e n d e n t r a n d o m v a r i a b l e, a n d t h e t r a n s f o r m a t i o n f u n c t i o n$ \mathcal{T}{\phi} $p a r a m e t e r i z e d b y$ ϕ $c o n v e r t s$ ϵ $t o$ z$.

For example, a common choice of the form of $q_{ϕ} (z | x)$ ltivariate Gaussian with a diagonal covariance structure: $\begin{array}{l} z \sim q_{ϕ} (z | x^{(i)}) = N (z; μ^{(i)}, σ^{2 (i)} I) \\ z = μ + σ ⊙ ϵ, where ϵ \sim N (0, I) \end{array}$ where $⊙$ refers to element-wise product.

Auto-Encoding Variational Bayes - Kingma - ICLR 2014

(VQ-VAE)Neural Discrete Representation Learning - van den Oord - NIPS 2017

Title: Neural Discrete Representation Learning
Task: Image Generation
Author: A. van den Oord, O. Vinyals, and K. Kavukcuoglu
Date: Nov. 2017
Arxiv: 1711.00937
Published: NIPS 2017
Affiliation: Google DeepMind

Highlights

Discrete representation for data distribution
The prior is learned instead of random

Vector Quantisation(VQ) Vector quantisation (VQ) is a method to map $K$ -dimensional vectors into a finite set of “code” vectors. The encoder output $E (x) = z_{e}$ goes through a nearest-neighbor lookup to match to one of $K$ embedding vectors and then this matched code vector becomes the input for the decoder $D (.)$ :

z_{q} (x) = e_{k}, where k = {argmin}_{j} {‖ z_{e} (x) - e_{j} ‖}_{2}

The dictionary items are updated using Exponential Moving Averages(EMA), which is similar to EM methods like K-Means.

(VQ-VAE)Neural Discrete Representation Learning

Loss Design

Reconstruction loss
VQ loss: The L2 error between the embedding space and the encoder outputs.
Commitment loss: A measure to encourage the encoder output to stay close to the embedding space and to prevent it from fluctuating too frequently from one code vector to another.

L = \underset{reconstruction loss}{\underset{⏟}{{‖ x - D (e_{k}) ‖}_{2}^{2}}} + \underset{VQ loss}{\underset{⏟}{{‖ sg [E (x)] - e_{k} ‖}_{2}^{2}}} + \underset{commitment loss}{\underset{⏟}{β {‖ E (x) - sg [e_{k}] ‖}_{2}^{2}}}

where sq[.] is the stop_gradient operator.

Training PixelCNN and WaveNet for images and audio respectively on learned latent space, the VA-VAE model avoids “posterior collapse” problem which VAE suffers from.

Generating Diverse High-Fidelity Images with VQ-VAE-2 - Razavi - 2019

Title: Generating Diverse High-Fidelity Images with VQ-VAE-2
Task: Image Generation
Author: A. Razavi, A. van den Oord, and O. Vinyals
Date: Jun. 2019
Arxiv: 1906.00446
Affiliation: Google DeepMind

Highlights

Diverse generated results
A multi-scale hierarchical organization of VQ-VAE
Self-attention mechanism over autoregressive model

Generating Diverse High-Fidelity Images with VQ-VAE-2

Stage 1: Training hierarchical VQ-VAE The design of hierarchical latent variables intends to separate local patterns (i.e., texture) from global information (i.e., object shapes). The training of the larger bottom level codebook is conditioned on the smaller top level code too, so that it does not have to learn everything from scratch.

Generating Diverse High-Fidelity Images with VQ-VAE-2

Stage 2: Learning a prior over the latent discrete codebook The decoder can receive input vectors sampled from a similar distribution as the one in training. A powerful autoregressive model enhanced with multi-headed self-attention layers is used to capture the correlations in spatial locations that are far apart in the image with a larger receptive field.

Generating Diverse High-Fidelity Images with VQ-VAE-2