Table of Contents

1. Motivation

We assume that there is a true data distribution \(p_{data}(x)\), which is only accessible through \(\lbrace x_1, x_2, ..., x_N \rbrace\) that are sampled from \(p_{data}(x)\). The goal of generative models is to find an approximation of \(p_{data}(x)\):

\[p_{\theta}(x) \approx p_{data}(x).\]

A generative model is composed of its architecture and its parameter \(\theta\). The architecture reflects people’s thought on how \(p_{data}(x)\) looks like. Parameter \(\theta\) determines remaining things. There are many applications of generative models including:

  • generation of new samples
  • abnormal detection, outlier detection
  • denoising, missing value completion

2. Learning

As written in Section 1, the goal is to learn \(p_{\theta}(x)\) that approximates \(p_{data}(x)\). There are two issues to achieve this goal.

  • Issue 1: \(p_{data}(x)\) is unknown
  • Issue 2: It is unclear how to measure the “distance” between \(p_{data}(x)\) and \(p_{\theta}(x)\).

The first issue can be solved by approximating \(p_{data}(x)\) with an empirical distribution \(\hat{p}_{data}^{N}(x) \mathrel{\vcenter{:}}= \frac{1}{N}\sum_{i=1}^{N}\delta(x-x_i)\). The second issue can be solved by introducing KL divergence \(D_{KL}[p(x) \Vert q(x)] \mathrel{\vcenter{:}}= \mathbb{E}_{p(x)}[\log\frac{p(x)}{q(x)}]\). As a result, the learning objective is to derive the following \(\hat{\theta}\):

\[\begin{aligned} \hat{\theta} &\mathrel{\vcenter{:}}= \underset{\theta}{\text{argmin}} D_{KL}[\hat{p}_{data}^{N}(x) \Vert p_{\theta}(x)] \\ &= \underset{\theta}{\text{argmin}} \mathbb{E}_{\hat{p}_{data}^{N}(x)}[\log \hat{p}_{data}^{N}(x)] - \mathbb{E}_{\hat{p}_{data}^{N}(x)}[\log p_{\theta}(x)] \\ &= \underset{\theta}{\text{argmax}} \mathbb{E}_{\hat{p}_{data}^{N}(x)}[\log p_{\theta}(x)] \\ &= \underset{\theta}{\text{argmax}} \frac{1}{N} \sum_{i=1}^{N}\log p_{\theta}(x_i) \\ \bigl( &= \underset{\theta}{\text{argmax}} \prod_{i=1}^{N}p_{\theta}(x_i) \bigr) \\ \end{aligned}\]

From the above equations, you can understand that minimizing the KL divergence between \(\hat{p} _ {data}^{N}(x)\) and \(p_{\theta}(x)\) is equivalent to maximum likelihood estimation. Thus, in common maximum likelihood estimation, we should keep in mind that we use KL divergence as a distance metric. Due to the assymmetry property of KL divergence, there may be undesirable effects on learned results. In addition, we use \(\hat{p}_{data}^{N}(x)\) instead of \(p_{data}(x)\). Therefore, maximum likelihood estimation does not necessarilly lead to generalization. For example, if the number of training samples $N$ is small, you can fall into over-fitting.

3. Reference

  • Summer seminar “Deep Generative Models” provided by Matsuo Lab