Recap
Previous section introduced Forward Process and Backward Process of Denoising Diffusion Probabilistic Model (DDPM).
Forward Process
$d x_{t} = - \frac{1}{2} x_{t} d t + d W_{t},$
where $t \in [0, T]$ is the forward diffusion time. This process describes a gradual noising operation that transforms clean images into Gaussian noise.
$d x_{t^{'}} = (\frac{1}{2} x_{t^{'}} + s (x_{t^{'}}, T - t^{'})) d t^{'} + d W_{t^{'}},$
where $t^{'} = T - t$ is the backward diffusion time, $s (x, t) = \nabla_{x} lo g p_{t} (x)$ is the score function of the density of $x_{t}$ in the forward process.

In this section, we will show how to train a neural network that models the score function $s (x, t)$ .

Prerequisites: Calculus.

Implementation of the Denoising Diffusion Probabilistic Model (DDPM)#

Numerical Implementation of the Forward Process#

To numerically simulate the forward diffusion process, we divide the time range $[0, T]$ into intervals of length $β_{i}$ , where $i = 1, \dots, n$ . We denote the intermediate times as $t_{i} = \sum_{j = 0}^{i - 1} β_{j}$ .

The vanilla discretization of the $Forward Process$ is given by:

x_{i} = x_{i - 1} - \frac{1}{2} x_{i - 1} β_{i - 1} + β_{i - 1} ϵ_{i - 1},

where we approximate $d W_{t}$ as $d t ϵ_{i}$ , $ϵ_{i}$ is standard Gaussian random variable. (refer to the previous section).

A more subtle but equivalent implementation is the variance-preserving (VP) form ¹:

x_{i} = 1 - β_{i - 1} x_{i - 1} + β_{i - 1} ϵ_{i - 1},

This formulation ensures that if $x_{0}$ is initialized with unit variance, then the variance of $x_{i}$ remains equal to 1. It gradually adds a small amount of Gaussian noise to the image at each time step $i$ , gradually contaminating the image until $x_{n} \sim N (0, I)$ .

WARNING
Note that our interpretation of $β$ differs from that in ¹, treating $β$ as a varying time-step size to solve the autonomous SDE (1.5 OU process noise) instead of a time-dependent SDE. Our interpretation greatly simplifies future analysis, but it holds only if every $β_{i}$ is sufficiently small.

Instead of expressing the iterative relationship between $x_{i}$ and $x_{i - 1}$ , we can directly represent the dependency of $x_{i}$ on $x_{0}$ using the following forward relation:

x_{i} = \overset{α}{ˉ}_{i} x_{0} + 1 - \overset{α}{ˉ}_{i} \overset{ˉ}{ϵ}_{i}; 1 \leq i \leq n,

where $\overset{α}{ˉ}_{i} = \prod_{j = 0}^{i - 1} (1 - β_{j})$ denotes the contamination weight, and $\overset{ˉ}{ϵ}_{i}$ represents standard Gaussian noise.

TIP
An useful property we shall exploit later is that for infinitesimal time steps $β$ , the contamination weight $\overset{α}{ˉ}_{i}$ is the exponential of the diffusion time $t_{i}$
$m a x_{j} β_{j} 0 lim \overset{α}{ˉ}_{i} e^{- t_{i}} .$

Numerical Implementation of the Backward Process#

The backward diffusion process is used to sample from the DDPM by removing the noise of an image step by step. It is the time reversed version of the OU process, starting at $x_{0^{'}} \sim N (x ∣ 0, I)$ , using the reverse of the OU process (1.5 reverse diffusion process).

The vanilla discretization of the $Backward Process$ is given by:

x_{i^{'} + 1} = (1 + \frac{1}{2} β_{n - i^{'}}) x_{i^{'}} + s (x_{i^{'}}, T - t_{i^{'}}^{'}) β_{n - i^{'}} + β_{n - i^{'}} ϵ_{i^{'}},

where $i^{'} = 0, \dots, n$ represents the backward time step, and $x_{i^{'}}$ is the image at the $i^{'}$ th step with time $t_{i^{'}}^{'} = \sum_{j = 0}^{i^{'} - 1} β_{n - 1 - j} = T - t_{n - i^{'}}$ .

A more common discretization is:

x_{i^{'} + 1} = \frac{x _{i^{'}} + s ( x _{i^{'}} , T - t _{i^{'}}^{'} ) β _{n - i^{'}}}{1 - β _{n - i^{'}}} + β_{n - i^{'}} ϵ_{i^{'}},

This formulation is equivalent to the vanilla discretization when $β_{i}$ is small. The score function $s (x_{i^{'}}, T - t_{i^{'}}^{'})$ is typically modeled by a neural network trained using a denoising objective.

Training the Score Function#

Training the score function requires a training objective. We will show that the score function could be trained with a denoising objective.

DDPM is trained to removes the noise $\overset{ˉ}{ϵ}_{i}$ from $x_{i}$ in the forward diffusion process, by training a denoising neural network $ϵ_{θ} (x, t_{i})$ to predict and remove the noise $\overset{ˉ}{ϵ}_{i}$ . This means that DDPM minimizes the denoising objective ²:

L_{d e n o i se} (ϵ_{θ}) = \frac{1}{n} i = 1 \sum n E_{x_{0} \sim p_{0} (x)} E_{x_{i} \sim p (x_{i} ∣ x_{0})} ∥ \overset{ˉ}{ϵ}_{i} - ϵ_{θ} (x_{i}, t_{i}) ∥_{2}^{2},

where $\overset{ˉ}{ϵ}_{i}$ is determined from $x_{i}$ according to the $discrete forward diffusion$ process.

Now we show that $ϵ_{θ}$ trained with the above objective is proportional to the score function $s$ . There are two important properties regarding the relationship between the noise $\overset{ˉ}{ϵ}_{i}$ and the score function $s$ :

Gaussian Distribution of $x_{i}$ :
According to the $discrete forward diffusion$ , the distribution of $x_{i}$ given $x_{0}$ is a Gaussian distribution, expressed as:
$p (x_{i} ∣ x_{0}) = N (x_{i} ∣ \overset{α}{ˉ}_{i} x_{0}, (1 - \overset{α}{ˉ}_{i}) I) .$
Proportionality of Noise to Score Function:
The noise $\overset{ˉ}{ϵ}_{i}$ is directly proportional to a score function, given by:
$\overset{ˉ}{ϵ}_{i} = - 1 - \overset{α}{ˉ}_{i} s (x_{i} ∣ x_{0}, t_{i}),$
where $s (x_{i} ∣ x_{0}, t_{i}) = \nabla_{x_{i}} lo g p (x_{i} ∣ x_{0})$ represents the score of the conditional probability density $p (x_{i} ∣ x_{0})$ at $x_{i}$ .

These properties indicate that the noise $\overset{ˉ}{ϵ}_{i}$ is directly related to a conditional score function, which connects to the score function $s (x, t)$ through the above equations.

Now we are very close to our target. The conditional score function $s (x_{i} ∣ x_{0}, t_{i})$ is connected to the score function $s (x, t)$ through the following equation:

E_{x_{0} \sim p_{0} (x)} E_{x_{i} \sim p (x_{i} ∣ x_{0})} f (x_{i}) s (x_{i} ∣ x_{0}) = \int\int f (x_{i}) \nabla_{x_{i}} p (x_{i} ∣ x_{0}) p_{0} (x_{0}) d x_{i} d x_{0} = \int f (x_{i}) \nabla_{x_{i}} \int p (x_{i} ∣ x_{0}) p_{0} (x_{0}) d x_{0} d x_{i} = E_{x_{i} \sim p_{t_{i}} (x)} f (x_{i}) s (x, t_{i})

where $f$ is an arbitrary function and $s (x, t) = \nabla_{x} lo g p_{t} (x)$ is the score function of the probability density of $x_{t}$ .

Substituting the $score-noise relationship$ into the $denoising objective$ , expanding the squares, and utilizing the above equation, we can derive that the denoising objective is equivalent to a denoising score matching objective:

L_{d e n o i se} (ϵ_{θ}) = \frac{1}{n} i = 1 \sum n E_{x_{i} \sim p_{t_{i}} (x)} ∥ 1 - \overset{α}{ˉ}_{i} s (x_{i}, t_{i}) + ϵ_{θ} (x_{i}, t_{i}) ∥_{2}^{2},

This objectives says that the denoising neural network $ϵ_{θ} (x, t_{i})$ is trained to approximate a scaled score function $ϵ (x, t_{i})$ ³

ϵ_{θ} (x, t_{i}) \approx - 1 - \overset{α}{ˉ}_{i} s (x, t_{i}) .

Summary:#

We have covered all aspects of the DDPM theory. You can now find a suitable dataset, perform the $discrete forward diffusion$ , train a denoising neural network using the $denoising objective$ , and subsequently generate new samples with the $eps-score relation$ and the $discrete backward process$ .

What is Next#

In the next section, we will discuss an alternative version of the backward diffusion process: ordinary differential equation (ODE) based backward sampling. This approach serves as the foundation for several modern architectures, such as rectified flow diffusion models.

Stay tuned for the next installment!

Discussion#

If you have questions, suggestions, or ideas to share, please visit the discussion post.

Yang Song, et al. “Score-Based Generative Modeling through Stochastic Differential Equations.” ArXiv (2020). ↩ ↩²
Jonathan Ho, et al. “Denoising Diffusion Probabilistic Models.” ArXiv (2020). ↩
Ling Yang, et al. “Diffusion Models: A Comprehensive Survey of Methods and Applications.” ACM Computing Surveys (2022). ↩

Implementation of the Denoising Diffusion Probabilistic Model (DDPM)#

Numerical Implementation of the Forward Process#

Numerical Implementation of the Backward Process#

Training the Score Function#

Summary:#

What is Next#

Discussion#

Footnotes#