Overview of Generative Models | Masahiro Negishi (根岸優大)

Motivation
Learning
Reference

1. Motivation

We assume that there is a true data distribution $p_{data}(x)$, which is only accessible through $\lbrace x_1, x_2, ..., x_N \rbrace$ that are sampled from $p_{data}(x)$. The goal of generative models is to find an approximation of $p_{data}(x)$:

\[p_{\theta}(x) \approx p_{data}(x).\]

A generative model is composed of its architecture and its parameter $\theta$. The architecture reflects people’s thought on how $p_{data}(x)$ looks like. Parameter $\theta$ determines remaining things. There are many applications of generative models including:

generation of new samples
abnormal detection, outlier detection
denoising, missing value completion

2. Learning

As written in Section 1, the goal is to learn $p_{\theta}(x)$ that approximates $p_{data}(x)$. There are two issues to achieve this goal.

Issue 1: $p_{data}(x)$ is unknown
Issue 2: It is unclear how to measure the “distance” between $p_{data}(x)$ and $p_{\theta}(x)$.

The first issue can be solved by approximating $p_{data}(x)$ with an empirical distribution $\hat{p}_{data}^{N}(x) \mathrel{\vcenter{:}}= \frac{1}{N}\sum_{i=1}^{N}\delta(x-x_i)$. The second issue can be solved by introducing KL divergence $D_{KL}[p(x) \Vert q(x)] \mathrel{\vcenter{:}}= \mathbb{E}_{p(x)}[\log\frac{p(x)}{q(x)}]$. As a result, the learning objective is to derive the following $\hat{\theta}$:

\[\begin{aligned} \hat{\theta} &\mathrel{\vcenter{:}}= \underset{\theta}{\text{argmin}} D_{KL}[\hat{p}_{data}^{N}(x) \Vert p_{\theta}(x)] \\ &= \underset{\theta}{\text{argmin}} \mathbb{E}_{\hat{p}_{data}^{N}(x)}[\log \hat{p}_{data}^{N}(x)] - \mathbb{E}_{\hat{p}_{data}^{N}(x)}[\log p_{\theta}(x)] \\ &= \underset{\theta}{\text{argmax}} \mathbb{E}_{\hat{p}_{data}^{N}(x)}[\log p_{\theta}(x)] \\ &= \underset{\theta}{\text{argmax}} \frac{1}{N} \sum_{i=1}^{N}\log p_{\theta}(x_i) \\ \bigl( &= \underset{\theta}{\text{argmax}} \prod_{i=1}^{N}p_{\theta}(x_i) \bigr) \\ \end{aligned}\]

From the above equations, you can understand that minimizing the KL divergence between $\hat{p} _ {data}^{N}(x)$ and $p_{\theta}(x)$ is equivalent to maximum likelihood estimation. Thus, in common maximum likelihood estimation, we should keep in mind that we use KL divergence as a distance metric. Due to the assymmetry property of KL divergence, there may be undesirable effects on learned results. In addition, we use $\hat{p}_{data}^{N}(x)$ instead of $p_{data}(x)$. Therefore, maximum likelihood estimation does not necessarilly lead to generalization. For example, if the number of training samples $N$ is small, you can fall into over-fitting.

3. Reference

Summer seminar “Deep Generative Models” provided by Matsuo Lab

Table of Contents

1. Motivation

2. Learning

3. Reference