Deep generative networks (2020)
Published:
Blog post based on an essay on Deep Generative Newtorks. Elaborated for the DD2412 Deep Learning, Advanced Course, at KTH Royal Institute of Technology.
Introduction
Generative modeling is a major task in machine learning. Its main objectives are to learn real-world models and identify meaningful features in the input data without requiring any human supervision \cite{kingma2018glow}. The immediate applications being able to potentially capture all the patterns that are present in the data are very diverse, such as speech synthesis, text analysis, and image generation.
This essay aims to make a comparison between the most recent approaches in deep generative modeling. More specifically, we will analyze three approaches using different methods that achieve decent results in this task. These are the Noveau VAE (NVAE) \cite{vahdat2020nvae}, Glow \cite{kingma2018glow} and StyleGAN2 \cite{karras2019analyzing}. The first two approaches correspond to likelihood-based frameworks, namely hierarchical variational auto-encoders (VAE) and flow-based generative models respectively. The latter corresponds to an extension of a progressive generative adversarial network (GAN) \cite{karras2017progressive}. We will overview the key aspects of each approach and comment on the formulation that is used to generate data from raw observations. Our interest in this brief literature study is to learn about the different methods used in this task and identify the weaknesses and strengths of each method.
Methods
Flow-based Generative models
Given a model $p_{\mathbf{\theta}}$ with parameters $\mathbf{\theta}$ and a dataset $\mathcal{D}$ the log-likelihood objective of a general flow-based model is equivalent to minimizing:
\[\mathcal{L}(\mathcal{D}) = \frac{1}{N}\sum^N_{i=1}-\log p_{\mathbf{\theta}}(\textbf{x}^{(i)})\]Regarding flow-based generative models, the generative process \cite{dinh2014nice, dinh2016density} is defined as:
\[\textbf{z} \sim p_{\mathbf{\theta}}(\textbf{z}), \quad \textbf{x} = \textbf{g}_{\mathbf{\theta}}(\textbf{z})\]where $\textbf{z}$ denotes a sample from the latent space and $p_{\mathbf{\theta}}(\textbf{z})$ is its density, which for simplicity is usually defined as a spherical multivariate Gaussian distribution $p_{\mathbf{\theta}}(\textbf{z}) = \mathcal{N}(\textbf{z}; 0, \textbf{I})$. The decoding function $\textbf{g}_{\theta}(\cdot)$ is invertible, i.e. one can infer the mapping in the latent space from an input variable as follows: $\textbf{z} = \textbf{f}_{\mathbf{\theta}}(\textbf{x}) = \textbf{g}^{-1}_{\mathbf{\theta}}(\textbf{x})$. In flow-based models, the invertible function $\textbf{f}_{\mathbf{\theta}}$ (or $\textbf{g}_{\mathbf{\theta}}$) is composed by a sequence of invertible transformations: $\textbf{f}_{\mathbf{\theta}} = \textbf{f}_{1} \circ \textbf{f}_{2} \circ … \circ \textbf{f}_K$. Such sequence of transformations is named normalizing flow \cite{rezende2015variational}.
Using this assumptions on the encoder-decoder function $\textbf{f}_{\mathbf{\theta}}$, one can rewrite the probability density of a flow-based model given a datapoint using the change of variables rule.
\[\log p_{\mathbf{\theta}}(\textbf{x}) = \log p_{\mathbf{\theta}}(\textbf{z}) + \sum_{i=1}^K \log |\det(d\textbf{h}\_i/d\textbf{h}\_{i-1})|\]$d\textbf{h}_i/d\textbf{h}_{i-1}$ consists on the Jacobian matrix of the i-th transformation $\textbf{f}_i$. In general, we choose transformations that lead to simple expressions of the resulting Jacobian matrices so that the probability density estimation is easier to compute. Our interest here is to notice that the flow-based formulation leads to an immediate calculation of the likelihood of our model given the data, which can be used as a quantitative measure of the results.
Glow \cite{kingma2018glow} consists of a flow-based generative model that builds upon the previous formulation and extends from two previous flow models, NICE \cite{dinh2014nice} and RealNVP \cite{dinh2016density}. The approach consists of a sequence of steps of flow that are combined in a multi-scaled architecture. Each step of flow consists of actnorm, an invertible 1x1 convolution and an affine coupling layer. A sketch of the model is represented in Figure \ref{fig:glow-arch}. Refer to Kingma and Dhariwal \cite{kingma2018glow} for more details of this multi-scale arquitecture.
Hierarchical Variational Auto-encoders
Variational auto-encoders (VAEs) \cite{kingma2013autoencoding} aim to train a generative model by means of the joint probability density $p(\textbf{x}, \textbf{z}) = p(\textbf{x} | \textbf{z})p(\textbf{z})$, where $p(\textbf{z})$ is a prior distribution over the latent space z and $p(\textbf{x} | \textbf{z})$ is the likelihood of the data given a point in the latent space. In this formulation, z is viewed as the set of factors that generate x. Therefore, we are also interested in modelling the posterior distribution $p(\textbf{z} | \textbf{x})$ so as to infer the latent factors given a data point x. Unfortunately, these posterior is usually intractable and we are required to define an approximate posterior distribution $q(\textbf{z} | \textbf{x})$ using \textit{variational inference}. |
In order to increase the expressiveness of the variational posterior and the prior \cite{vahdat2020nvae}, the latent variables z are divided into $L$ disjoined groups: $\textbf{z} = {\textbf{z}_1, …, \textbf{z}_L}$. Then, for each disjoint group $\textbf{z}_l$, dependence on the previous disjoint groups $\textbf{z}_{< l}$ is assumed. This resulting formulation which modifies the topology of the latent space is known as hierarchical VAE. Consequently, our prior and approximate posterior will be represented by $p(\textbf{z}) = \prod_{l} p(\textbf{z}_l | \textbf{z}_{< l})$ and $q(\textbf{z} | \textbf{x})=\prod_{l} q(\textbf{z}_l | \textbf{x}, \textbf{z}_{< l})$ respectively. |
This objective extends from the original VAE formulation from Kingma and Welling \cite{kingma2013autoencoding}. Notice that both the likelihood of the data $p(\textbf{x} | \textbf{z})$ and the variational posterior $q(\textbf{z} | \textbf{x})$ can be modelled using deep neural networks. These networks correspond to the decoder and the encoder respectively. The gradient of this objective can be computed using the reparametrization trick \cite{kingma2013autoencoding} and simple Monte Carlo estimation. |
In figure \ref{fig:nvae-arch} we sketch the NVAE architecture used to implement the hierarchical structure assumed in the latent space. The generative model uses a top-down network that generates the parameters for each conditional. At each stage, the sampling of the latent variables is combined with a residual deterministic map to infer the next group. Inferring the variational posterior $q(\textbf{z} | \textbf{x})$ requires a bottom-up network to extract the representation from the input \textbf{x} and an additional top-down network to infer the latent variables hierarchically. To save computational costs, the top-down networks used to infer the hierarchical latent space are shared. |
Differing from other approaches using hierarchical latent encodings, NVAE proposes a novel neural network architecture using deep residual networks and depthwise convolutions. Moreover, it uses normalizing flows \cite{rezende2015variational} to obtain more expressive approximate posteriors. Refer to \textit{Vahdat and Kautz} \cite{vahdat2020nvae} for more details.
Generative Adversarial Networks
Since StyleGAN2 \cite{karras2019analyzing} relies on the formulation of generative adversarial networks, we will briefly describe its expression. Generative adversarial networks are composed by a generator $G(\textbf{z})$ and a discriminator $D(\textbf{x})$. The first one maps samples from a prior distribution p(\textbf{z}) to input data points. The latter distinguishes between real data points $x$, and fake samples produced by the generator $G(\textbf{z})$. As stated by \textit{Goodfellow et al.} \cite{goodfellow2014generative}, these components $G$ and $D$ play a mini-max game defined by the following objective function:
\[\min\_G \max\_D \mathbb{E}\_{\textbf{x}\sim p(\textbf{x})}\Big[\log\ D(\textbf{x})\Big] + \mathbb{E}\_{\textbf{z} \sim p(\textbf{z})}\Big[\log (1 - D(G(\textbf{z})))\Big]\]Notice that this formulation transforms the generative task into a classification one. Contrary to the previous approaches, this framework does not allow data likelihood estimation.
The objective of the StyleGAN2 approach is to fix image quality issues in its predecessor, StyleGAN \cite{karras2018stylebased}. The model proposes an unconventional generator architecture. First, the latent samples z are mapped into an intermediate latent space w. Using affine transforms, different styles are produced which are forwarded at different stages of the synthesis network g (Figure \ref{fig:latent}). The generator architecture is shown in Figure \ref{fig:styleGAN2}. Instance normalization (used in StyleGAN) is replaced by a demodulation operation, which is applied to the weights associated with each convolutional layer.
To achieve more stable training for high-resolution images StyleGAN uses the progressive growth introduced by Karas et al. \cite{karras2017progressive}. However, it was shown to produce artifacts for higher resolutions. Therefore, StyleGAN2 searches for alternative designs that allow network design with great depth and good training stability. For more details, refer to the original implementation \cite{karras2019analyzing}.
\begin{figure} \begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{glow.png}
\caption{}
\label{fig:glow-arch}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{encoder-decoder.png}
\caption{}
\label{fig:nvae-arch}
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\includegraphics[width=\textwidth]{latent_space.png}
\caption{}
\label{fig:latent}
\end{subfigure}
\begin{subfigure}{0.2\textwidth}
\centering
\includegraphics[width=\textwidth]{weight_demodulation.png}
\caption{}
\label{fig:styleGAN2}
\end{subfigure}
\caption{a) Multi-scale flow-based architecture \cite{dinh2016density}. b) Hierarchical encoder and decoder used in NVAE \cite{vahdat2020nvae}. c) Latent space mapping network to generate styles A \cite{karras2018stylebased}. d) StyleGAN2 architecture: A is a \textit{style} input and B is noise input \cite{karras2019analyzing}.}
\end{figure}
Comparison
In this section, we are interested in outlining the advantages (or disadvantages) of choosing one of the three approaches described previously. We will do so by taking into account the main objectives that are pursued in the generative modeling task.
Likelihood estimation
As we have previously seen, flow-based generative models optimize the generative task towards maximizing the exact log-likelihood of the data. Therefore, this formulation lets us infer the likelihood of the data exactly without any approximation. This situation is not possible in hierarchical VAEs, where the formulation allows only to calculate an approximate value, namely the evidence lower bound \cite{kingma2013autoencoding}. In the case of GANs, we have no encoder to infer the latent variables given a data point. Therefore, an estimation of the log-likelihood of the data is not possible in this case.
Being able to infer the latent space variables from an observation in the data is a very useful tool. Latent variable inference allows us to perform interpolations between datapoints and make modifications of existing observations. This is possible both for Glow and NVAE as being likelihood-based methods.
Furthermore, computation (or estimation) of the log-likelihood of the data brings a direct quantitative measure. This can be used to assess the performance of the generative task. Table \ref{tab:generative} shows the results of Glow \cite{kingma2018glow} and NVAE \cite{vahdat2020nvae} in a variety of datasets. Notice that NVAE outperforms Glow in the generative task according to the likelihood estimation measure.
\begin{table}[] \centering \begin{tabular}{c|c|c|c} \Longstack{ \textbf{Model} \ } & \Longstack{ \textbf{CIFAR-10} \ 32x32} & \Longstack{ \textbf{ImageNet} \ 32x32} & \Longstack{ \textbf{CelebA HQ} \ 256x256}
\hline NVAE w/o flow \cite{vahdat2020nvae} & 2.93 & - & -
NVAE w/ flow \cite{vahdat2020nvae} & \textbf{2.91} & \textbf{3.92} & \textbf{0.70}
GLOW \cite{kingma2018glow} & 3.35 & 4.09 & 1.03 \
\end{tabular}
\caption{Comparison between the two likelihood-based models presented, Glow and NVAE. The performance is measured in bits/dimension (bpd). Results extracted from \cite{vahdat2020nvae}.}
\label{tab:generative} \end{table}
Other quantitative measures
Not being able to infer the latent variables in a generative model is a very noticeable disadvantage since one loses a direct quantitative measure to compare with previous approaches. To overcome these issues, \textit{Karras et al.} \cite{karras2019analyzing} propose several quantitative measures that can be used as an alternative to the likelihood estimation. Some examples are Fréchet inception distance, Precision and Recall, and perceptual path length. In their quantitative analysis, they find that StyleGAN2 reports state-of-the-art performance in generating high-resolution images. Unfortunately, neither Glow nor NVAE report results regarding these measures. Therefore, a quantitative comparison between GAN formulation and likelihood-based models is not possible.
Qualitative evaluation
Figure \ref{fig:qualitative} shows some samples generated by the three aforementioned approaches on different datasets. When comparing NVAE and Glow samples generated on CelebA HQ dataset, we observe that NVAE produces more diverse and high-quality samples. One can notice that NVAE outputs richer backgrounds and can generate detailed and more accurate hair. When comparing NVAE and StyleGAN2 on FFHQ dataset, we find it difficult to assess which model performs better. Both models can output very realistic high-quality data. Notice that StyleGAN2 can generate face wrinkles in the samples which brings more realism in the generated data, contrary to NVAE which generates faces that show very smooth skin patterns.
\begin{figure} \centering \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{celebaHQ-glow.PNG} \caption{Glow samples on CelebA HQ.} \label{fig:my_label} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{styleGAN.PNG} \caption{StyleGAN2 samples on FFHQ.} \label{fig:my_label} \end{subfigure}
\begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{celebahq-nvae.PNG} \caption{NVAE samples on CelebA HQ.} \label{fig:my_label} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{ffhq-nvae.PNG} \caption{NVAE samples on FFHQ.} \label{fig:my_label} \end{subfigure} \caption{Samples drawn from the presented approaches. Extracted from \cite{vahdat2020nvae} and \cite{karras2019analyzing}.} \label{fig:qualitative} \end{figure}
Conclusion
We have presented three different generative modeling approaches: Glow, NVAE, and StyleGAN2. Each of them uses a different formulation to generate data from latent variables. We have analyzed the advantages and disadvantages of each approach and compared their performance in the generative task on high-resolution image datasets. Since qualitative evaluation between the state-of-the-art models StyleGAN2 and NVAE is not easy, we argue that future work on this essay would include reporting NVAE performance on the quantitative measures that are presented in \textit{Karras et al.} \cite{karras2019analyzing}.
\bibliographystyle{plain} \bibliography{bibtex}