Diffusion Model笔记

Diffusion Model笔记

扩散过程

对初始数据分布$x_0$~q(x),不断添加高斯噪声,最终使数据分布$X_T$变成各项独立的高斯分布。

  • 前向扩散过程的定义

    $q(x_t|x_{t-1})=N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI)$

    $q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1})$(马尔科夫链过程)

  • 通过重参数化技巧,可以推导出任意时刻的$q(x_t)$,无需做迭代

    $x_t=\sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}z_{t-1}=…=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}z$

    其中$\overline{\alpha_t}=\prod_{i=1}^T\alpha_i$;参数重整化体现为$\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}z_{t-2})+\sqrt{1-\alpha_t}z_{t-1}$中,$\sqrt{\alpha_t-\alpha_t\alpha_{t-1}}z_{t-2}+\sqrt{1-\alpha_t}z_{t-1}$为两个正态分布叠加,可以重参数化为$\sqrt{1-\alpha_t\alpha_{t-1}}\overline{z}_{t-2}$

  • 每个时间步所添加的噪声的标准差$\beta_t$给定,且随t增大而增大

  • 每个时间步所添加的噪声的均值与$\beta_t$有关,为了使$x_T$稳定收敛到$N(0,1)$

  • 由$\mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z}$可得

    • $\mathbf{q(x_t|x_0)=N(x_t;\sqrt{\overline{\alpha}_t}x_0,(1-\overline{\alpha}_t)I)}$

    • 随着不断加噪,$x_t$逐渐接近纯高斯噪声

    • $\mathbf{x_0=\frac{1}{\sqrt{\overline{\alpha}_t}}(x_t-\sqrt{1-\overline{\alpha}_t}z_t)}$

  • 扩散过程中的后验条件概率$q(x_{t-1}|x_t,x_0)$可以用公式表达,即给定$x_t$、$x_0$,可计算出$x_{t-1}$

    假设$\beta_t$足够小时,$\mathbf{q(x_{t-1}|x_t,x_0)=N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)}$

    通过高斯分布的概率密度函数和贝叶斯以及二次函数的均值和方差计算可得(具体推导过程省略)$\mathbf{\tilde{\mu}_t=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_t)}$

    即在$x_0$条件下,后验条件概率分布可通过$x_t$和$z_t$计算得到

逆扩散过程

从高斯噪声$x_T$中逐步还原出原始数据$x_0$。马尔科夫链过程。

  • $\mathbf{p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)}$$p_{\theta}(x_{0:T})=p(x_T)\prod_{t-1}^Tp_{\theta}(x_{t-1}|x_t)$

目标函数

对负对数似然$L=E_{q(x_0)}[-logp_\theta(x_0)]$使用变分下限(VLB),并进一步推导化简得到最终loss

  • $\mathbf{L_t^{simple}=E_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||^2]}$

  • 在推导的过程中,loss转换为$q(x_{t-1}|x_t,x_0)=N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)$与$p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)$两个高斯分布之间的KL散度,将$\mu$与$x_t$的公式代入将loss转化为$\epsilon$、$x_0$、$t$的公式

  • DDPM作者采用了预测随机变量(噪声)法,并不直接预测后验分布的期望值或原始数据

  • DDPM作者将方差$\Sigma_{\theta}(x_t,t)$用给定的$\beta_t$或$\tilde{\beta_t}$代替,训练参数只存在均值中,为了使训练更加稳定

训练过程

  1. 给出原始数据$x_0 \sim q(x_0)$

  2. 设定$t \sim Uniform({1,…,T})$

  3. 从标准高斯分布采样一个噪声$\epsilon \sim N(0,I)$

  4. 采用梯度下降法优化目标函数$||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||$

推断过程

  1. 每个时间步通过$x_t$和$t$计算$p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t))$

  2. 通过重参数从$p_{\theta}(x_{t-1}|x_t)$中采样得到$x_{t-1}$

  3. 通过不断迭代最终得到$x_0$

代码实现

  • 定义时间步数、$\beta_t$、$\sqrt{\overline{\alpha_t}}$等公式计算中需要用到的常量

    $\mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z}$

    $\mathbf{\mu_\theta=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t))}$

    $\mathbf{\Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t}$

    DDPM论文中作者将时间步数$T$设置为1000,$\beta_t$为0.0001到0.02之间的线性插值

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    num_timesteps = 1000
    schedule_low = 1e-4
    schedule_high = 0.02
    betas = torch.tensor(np.linspace(schedule_low, schedule_high, num_timesteps), dtype=torch.float32)

    alphas = 1 - betas
    alphas_cumprod = np.cumprod(alphas)
    sqrt_alphas_cumprod = np.sqrt(alphas_cumprod)
    sqrt_one_minus_alphas_cumprod = np.sqrt(1 - alphas_cumprod)
    reciprocal_sqrt_alphas = np.sqrt(1 / alphas)
    betas_over_sqrt_one_minus_alphas_cumprod = (betas / sqrt_one_minus_alphas_cumprod)
    sqrt_betas = np.sqrt(betas)
  • 前向扩散过程

    $\mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z}$

    $\mathbf{||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||}$

    1
    2
    3
    4
    5
    6
    7
    8
    def forward_diffusion_process(model, x0, num_timesteps, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod):
    batch_size = x0.shape[0]
    t = torch.randint(0, num_timesteps, size=(batch_size,))
    noise = torch.randn_like(x0)
    xt = sqrt_alphas_cumprod[t] * x0 + sqrt_one_minus_alphas_cumprod[t] * noise
    estimated_noise = model(xt, t)
    loss = (noise - estimated_noise).square().mean()
    return loss
  • 逆向扩散过程

    $\mathbf{p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t))}$

    $\mathbf{\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t))}$

    $\mathbf{\Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t}$

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    def reverse_diffusion_process(model, shape, num_timesteps, reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas):
    current_x = torch.randn(shape)
    x_seq = [current_x]
    for t in reversed(range(num_timesteps)):
    current_x = sample(model, current_x, t, shape[0], reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas)
    x_seq.append(current_x)
    return x_seq

    def sample(model, xt, t, batch_size, reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas):
    ts = torch.full([batch_size, 1], t)
    estimated_noise = model(xt, ts)
    mean = reciprocal_sqrt_alphas[ts] * (xt - betas_over_sqrt_one_minus_alphas_cumprod[ts] * estimated_noise)
    if t > 0:
    z = torch.randn_like(xt)
    else:
    z = 0
    sample = mean + sqrt_betas[t] * z
    return sample

Diffusion Model笔记
https://wangaaayu.github.io/blog/posts/33d4eadd/
作者
WangAaayu
发布于
2023年2月2日
更新于
2023年6月4日
许可协议