Table of Contents

1. Motivation

We assume that there is a true data distribution pdata(x), which is only accessible through {x1,x2,...,xN} that are sampled from pdata(x). The goal of generative models is to find an approximation of pdata(x):

pθ(x)pdata(x).

A generative model is composed of its architecture and its parameter θ. The architecture reflects people’s thought on how pdata(x) looks like. Parameter θ determines remaining things. There are many applications of generative models including:

  • generation of new samples
  • abnormal detection, outlier detection
  • denoising, missing value completion

2. Learning

As written in Section 1, the goal is to learn pθ(x) that approximates pdata(x). There are two issues to achieve this goal.

  • Issue 1: pdata(x) is unknown
  • Issue 2: It is unclear how to measure the “distance” between pdata(x) and pθ(x).

The first issue can be solved by approximating pdata(x) with an empirical distribution p^dataN(x):=1Ni=1Nδ(xxi). The second issue can be solved by introducing KL divergence DKL[p(x)q(x)]:=Ep(x)[logp(x)q(x)]. As a result, the learning objective is to derive the following θ^:

θ^:=argminθDKL[p^dataN(x)pθ(x)]=argminθEp^dataN(x)[logp^dataN(x)]Ep^dataN(x)[logpθ(x)]=argmaxθEp^dataN(x)[logpθ(x)]=argmaxθ1Ni=1Nlogpθ(xi)(=argmaxθi=1Npθ(xi))

From the above equations, you can understand that minimizing the KL divergence between p^dataN(x) and pθ(x) is equivalent to maximum likelihood estimation. Thus, in common maximum likelihood estimation, we should keep in mind that we use KL divergence as a distance metric. Due to the assymmetry property of KL divergence, there may be undesirable effects on learned results. In addition, we use p^dataN(x) instead of pdata(x). Therefore, maximum likelihood estimation does not necessarilly lead to generalization. For example, if the number of training samples $N$ is small, you can fall into over-fitting.

3. Reference

  • Summer seminar “Deep Generative Models” provided by Matsuo Lab