When I first learned about diffusion models, I was introduced to them as a type of variational autoencoder (VAE) applied to a series of quantities $x_{0}, \dots, x_{T}$ . Deriving the forward and reverse processes required lengthy derivations spanning multiple pages, dense with priors, posteriors, Bayesian theorems, and mathematical intricacies. Later, I encountered the stochastic differential equation (SDE) perspective, which frames diffusion models through Fokker-Planck and Kolmogorov backward equations—concepts no simpler to grasp than the VAE approach.

This blog series aims to provide a concise, self-contained, and rigorous introduction to diffusion models, specifically from the perspective of Langevin dynamics. The key to understanding diffusion models lies in understanding the following triangle relation: foo
which illustrates the connection among the forward, backward diffusion process and the Langevin dynamics.

While some concepts may be challenging, I believe this approach offers the fastest and most straightforward pathway to understanding diffusion model theory. We will focus exclusively on fundamental principles of stochastic differential equations (SDEs) and calculus to intuitively derive the core theory, revealing its intrinsic structure without the need for advanced machinery.

Contents

Prelude: Langevin Dynamics as Identity#

In this section, we cover the basics of Stochastic Differential Equations (SDEs), focusing on two fundamental concepts:

Brownian noise ( $d W$ ): The core random process driving SDE dynamics
Langevin Dynamics: The basic SDE to generate samples from a probability distribution.

By the end of this section, you will grasp one edge of the triangle relation: the “identity” property of Langevin dynamics on its stationary distribution $p (x)$ . foo

Prerequisites: Calculus, particularly series expansions and vector calculus (gradients, Laplacians).

Diffusion Process#

The Diffusion Process forms the mathematical foundation of diffusion models, describing a system’s evolution through deterministic drift and stochastic noise. Here we consider a diffusion process of the following form of stochastic differential equation (SDE):

d x_{t} = μ (x_{t}, t) d t + σ (x_{t}, t) d W_{t},

where the drift term $μ (x_{t}, t) d t$ governs deterministic motion, while $d W_{t}$ adds Brownian noise.

NOTE
When no quadratic terms of $d W_{t}$ are involved, $d W_{t}$ can often be roughly treated as $d t ϵ$ ,
$d W_{t} \approx d t ϵ,$
where $ϵ \sim N (0, 1)$ is a standard Gaussian random variable.

Brownian noise, denoted as $d W$ , is a core feature of stochastic differential equations (SDEs), highlighting their random behavior. It acts like tiny bursts of Gaussian random noise.

To grasp it better, consider this formal definition:

d W_{t} = d t n \to \infty lim i = 1 \sum n \frac{1}{n} ϵ_{i},

where each $ϵ_{i}$ is an independent standard Gaussian random variable with mean $0$ and covariance matrix $I$ (the identity matrix).

This limit emphasizes that $d W$ isn’t simply one Gaussian random variable with mean $0$ . Instead, it’s the buildup from countless tiny, independent Gaussian steps. This buildup lets us calculate its covariance as a vector product:

d W_{t} \cdot d W_{t}^{T} = Cov (d W_{t}, d W_{t}) = I d t,

where $I$ is the identity matrix. This covariance structure reveals that $d W$ has variance proportional to the infinitesimal time increment $d t$ , linking it intrinsically to $d t$ while differing from ordinary calculus.

Itô’s lemma#

Because Brownian noise $d W_{t}$ scales like $d t$ , it bends ordinary calculus rules for SDEs.

For instance, rescaling time from $t$ to $s$ in regular calculus gives $d s = \frac{d s}{d t} d t$ , but for Brownian noise $d W_{t}$ , since its scales with $d t$ , the transformation becomes $d W_{s} = \frac{d s}{d t} d W_{t}$ to preserve that scaling.

Similarly, differentiating a function $f (t, x_{t})$ in ordinary calculus yields $df (t, x_{t}) = \partial_{t} f d t + \nabla_{x} f \cdot d x_{t}$ . But for $SDE$ , it follows the Itô’s lemma:

df (t, x_{t}) = \partial_{t} f d t + \nabla_{x} f \cdot d x_{t} + stochastic effect \frac{σ ^{2}}{2} \nabla_{x}^{2} f d t = \partial_{t} f d t + \nabla_{x} f \cdot (μ (x_{t}, t) d t + σ (x_{t}, t) d W_{t}) + stochastic effect \frac{σ ^{2}}{2} \nabla_{x}^{2} f d t

We won’t derive it step by step here. Intuitively, it comes from a Taylor expansion: plug in the $SDE$ , and since $(d W_{t})^{2} \approx d t$ , the second-order Laplacian term $\nabla_{x}^{2} f$ persists as a first-order $d t$ contribution. This Laplacian highlights the key difference from deterministic calculus. We’ll later use this lemma to analyze how the probability distribution of $x_{t}$ evolves.

Langevin Dynamics#

Langevin Dynamics is a special diffusion process that aims to generate samples from a probability distribution $p (x)$ . It is defined as:

d x_{t} = s (x_{t}) d t + 2 d W_{t},

where $s (x) = \nabla_{x} lo g p (x)$ is the score function of $p (x)$ . This dynamics is often used as a Monte Carlo sampler to draw samples from $p (x)$ , since $p (x)$ is its stationary distribution—the distribution that $x_{t}$ converges to and and remains at as $t \to \infty$ , regardless of the initial distribution of $x_{0}$ .

NOTE
Stationary distribution#
$p (x)$ is the stationary distribution of the Langevin dynamics. This means: If you start with particles whose initial positions ${x_{0}^{(1)}, x_{0}^{(2)}, \dots, x_{0}^{(N)}}$ already follow $p (x)$ (like sampling $x_{0}$ from $p (x)$ ), then when you evolve those same particles using Langevin dynamics, their positions ${x_{t}^{(1)}, x_{t}^{(2)}, \dots, x_{t}^{(N)}}$ at any future time $t > 0$ will still follow $p (x)$ . The distribution doesn’t change over time.

If you’re comfortable assuming that $p (x)$ is the stationary distribution for the Langevin dynamics, that’s fine. If not, here’s a short proof.

To check stationarity, we show that after a small time step from 0 to $Δ t$ , the distribution of $x_{Δ t}$ remains $p (x)$ .

Pick any smooth test function $f$ . Start with initial points $x_{0}$ drawn from $p (x)$ . We track the change in the average value of $f$ at $x_{Δ t}$ , i.e., $E_{x_{0} \sim p (x)} [f (x_{Δ t})]$ .

Using $It \overset{o}{ˆ} ’s lemma$ (substitute $d t$ with $Δ t$ ) we have, to the first order accuracy,

f (x_{Δ t}) - f (x_{0}) \approx = 0 \partial_{t} f Δ t + \nabla_{x} f \cdot μ = s (x) μ Δ t + zero average σ d W_{t} + σ = 2 \frac{σ ^{2}}{2} \nabla_{x}^{2} f Δ t = Δ t (\nabla_{x} f \cdot s + \nabla_{x}^{2} f) + 2 \nabla_{x} f \cdot d W_{t}

and noting that the noise term averages to zero ( $E_{x} [d W] = 0$ ), we get:

E_{x_{0} \sim p (x)} [f (x_{Δ t}) - f (x_{0})] = Δ t \int p (x) (\nabla_{x} f \cdot s + \nabla_{x}^{2} f) d x (to first order) = Δ t \int f (x) (- \nabla_{x} \cdot (p s) + \nabla_{x}^{2} p) d x (integration by parts) = Δ t \int f (x) \nabla_{x} \cdot (- p s + \nabla_{x} p) d x = 0,

where the zero comes from plugging in $s = \nabla_{x} lo g p$ (making the term inside the divergence vanish).

Since this average change is zero for any test function $f$ , the distribution must stay unchanged.

Alternative form of the Langevin Dynamics:#

Recall that the term $d W_{t}$ in Langevin dynamics scales as $d t$ . We can reformulate the Langevin dynamics by substituting $d t$ with $\frac{1}{2} d t$ , resulting in the alternative form of the Langevin Dynamics:

d x_{t} = \frac{1}{2} s (x_{t}) d t + d W_{t} .

IMPORTANT
Langevin Dynamics as ‘Identity’#
The stationary of $p (x)$ is very important: The Langevin dynamics for $p (x)$ acts as an “identity” operation on the distribution, transforming samples from $p (x)$ into new samples from the same distribution. This property enables a simple way to derive the forward and backward diffusion processes of diffusion models.

Langevin Dynamics as Monte Carlo Sampler#

Langevin dynamics can be used to generate samples from a distribution $p (x)$ , given its score function $s$ . But its success hinges on two critical factors. First, the method is highly sensitive to initialization - a poorly chosen $x_{0}$ may trap the sampling process in local likelihood maxima, failing to explore the full distribution. Second, inaccuracies in the score estimation, particularly near $x_{0}$ , can prevent convergence altogether. These limitations led to the development of diffusion models, which eliminate the difficulty in choosing $x_{0}$ : all samples are generated by gradually denoising pure Gaussian noise.

What is Next#

In the next section, we will use Langevin dynamics as a stepping stone to derive the forward and backward diffusion processes. We will examine their mathematical formulation and how they form a dual pair—each reversing the other’s evolution.

Stay tuned for the next installment!

Discussion#

If you have questions, suggestions, or ideas to share, please visit the discussion post.

Cite this blog#

This blog is a reformulation of the appendix of the following paper.

1
@misc{zheng2025lanpainttrainingfreediffusioninpainting,
2
      title={LanPaint: Training-Free Diffusion Inpainting with Asymptotically Exact and Fast Conditional Sampling},
3
      author={Candi Zheng and Yuan Lan and Yang Wang},
4
      year={2025},
5
      eprint={2502.03491},
6
      archivePrefix={arXiv},
7
      primaryClass={eess.IV},
8
      url={https://arxiv.org/abs/2502.03491},
9
}