When I first learned about diffusion models, I was introduced to them as a type of variational autoencoder (VAE) applied to a series of quantities . Deriving the forward and reverse processes required lengthy derivations spanning multiple pages, dense with priors, posteriors, Bayesian theorems, and mathematical intricacies. Later, I encountered the stochastic differential equation (SDE) perspective, which frames diffusion models through Fokker-Planck and Kolmogorov backward equations—concepts no simpler to grasp than the VAE approach.
Instead, this blog series aims to deliver a concise, self-contained, and rigorous introduction to diffusion models. While some concepts may be challenging, I believe this approach offers the fastest and most straightforward pathway to understanding diffusion model theory. We will focus exclusively on fundamental SDE principles and calculus to intuitively derive the core theory, uncovering its intrinsic structure without relying on advanced machinery.
In this section, we cover the basics of Stochastic Differential Equations (SDEs), focusing on two fundamental concepts:
- Brownian noise (): The core random process driving SDE dynamics
- Langevin Dynamics: The basic SDE to generate samples from a probability distribution.
Prerequisites: Calculus, particularly series expansions and vector calculus (gradients, Laplacians).
Diffusion Process
The Diffusion Process forms the mathematical foundation of diffusion models, describing a system’s evolution through deterministic drift and stochastic noise. Here we consider a diffusion process of the following form of stochastic differential equation (SDE):
where the drift term governs deterministic motion, while adds Brownian noise.
The Brownian noise is a key characteristic of SDEs, capturing their stochastic nature. It represents a series of infinitesimal Gaussian noise. A good way to understand it is through a formal definition:
where are independent standard Gaussian noises with mean and identity covariance matrix . The limit in this definition shows that is not just a single Gaussian random variable with mean , but rather the cumulative effect of infinitely many independent Gaussian increments. Such cumulation allows us to compute the covariance of as vector product:
where is the identity matrix.
NOTEWhen no quadratic terms of are involved, can often be roughly treated as , where is a standard Gaussian random variable.
The Brownian noise scales as , which fundamentally alters the rules of calculus for SDEs. A change of variable in ordinary calculus has , but for Brownian noise it is . Moreover, the differentiation of a function is in ordinary calculus, but for SDE, it follows the Itô’s lemma:
This is derived by differentiating using the chain rule with the help of the SDE and covariance of , while keeping all terms up to order (note that scales as ). The emergence of the second-order Laplacian term is the key distinction from ordinary calculus. We will later use this lemma to analyze the evolution of the distribution of .
Langevin Dynamics
Langevin Dynamics is a special diffusion process that aims to generate samples from a distribution . It is defined as:
where is the score function.
This dynamics is often used as a Monte Carlo sampler to draw samples from , since is its stationary distribution—the distribution that converges to as , regardless of the initial distribution of . More precisely, this means that if an ensemble of particles at positions evolves according to the given SDE, and their initial positions follow a distribution , then their positions will continue to be distributed according to at all future times .
To verify stationarity, we will show that after evolution from time to , the distribution of is still . Consider a test function and initial positions , stationary can be assessed by tracking the change in the expectation . Using and note that for any distribution of , we compute:
where is obtained by substituting . Because for any test function , this means the distribution of must have been kept the same as .
Alternative form of the Langevin Dynamics:
Recall that the term in Langevin dynamics scales as . We can reformulate the Langevin dynamics by substituting with , resulting in the alternative form of the Langevin Dynamics:
IMPORTANTLangevin Dynamics as ‘Identity’
The stationary of is very important: The Langevin dynamics for acts as an “identity” operation on the distribution, transforming samples from into new samples from the same distribution. This property enables what I believe is the simplest way to derive the forward and backward diffusion processes in diffusion models.
Langevin Dynamics as Monte Carlo Sampler
Langevin dynamics can be used to generate samples from a distribution , given its score function . But its success hinges on two critical factors. First, the method is highly sensitive to initialization - a poorly chosen may trap the sampling process in local likelihood maxima, failing to explore the full distribution. Second, inaccuracies in the score estimation, particularly near , can prevent convergence altogether. These limitations led to the development of diffusion models, which eliminate the difficulty in choosing : all samples are generated by gradually denoising pure Gaussian noise.
What is Next
In the next section, we will use Langevin dynamics as a stepping stone to derive the forward and backward diffusion processes. We will examine their mathematical formulation and how they form a dual pair—each reversing the other’s evolution.
Stay tuned for the next installment!
Discussion
If you have questions, suggestions, or ideas to share, please visit the discussion post.