Part A: The Power of Diffusion Models!

In this project, I explored the potential of diffusion models, specifically DeepFloyd, to generate and manipulate images. I started by experimenting with pre-trained models, implementing noise addition and denoising processes. I then moved on to more advanced tasks like image inpainting, iterative denoising, and even creating optical illusions.

Part 0: Setup

In this section, I used DeepFloyd's stage_1 and stage_2 models to generate images from various text prompts. By adjusting parameters such as num_inference_steps, I explored how different settings influenced the detail and quality of the output images. This helped me better understand the model's capabilities in generating images that align with the given descriptions.

I used a random seed of 180 for all experiments to ensure consistent results across the project.

The text prompts I selected for this part include: "an oil painting of a snowy mountain village", "a man wearing a hat", and "a rocket ship". The generated images are displayed below:

stage1

stage2 num_inference_steps=50

stage2 num_inference_steps=100

stage2 num_inference_steps=200

stage2 num_inference_steps=500

The model did a decent job of generating images that matched the prompts. For each prompt, the outputs captured the overall theme and were recognizable, though some finer details varied slightly with different parameter settings. Adjusting the num_inference_steps generally improved the clarity and structure of the images. While the results weren't perfect, they were consistent with the expectations for each prompt.

Part 1: Sampling Loops

In this part of the project, I implemented my own "sampling loops" using the pretrained DeepFloyd denoisers to produce high-quality images. I then modified these loops to tackle different tasks, including inpainting and generating optical illusions.

1.1 Implementing the Forward Process

In this section, I implemented the forward process of a diffusion model, which progressively adds noise to a clean image. The process is defined by the equation:

\( q(x_t | x_0) = \mathcal{N}(x_t ; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)\mathbf{I}) \)

This means that given a clean image \( x_0 \), we generate a noisy image \( x_t \) at timestep \( t \) using the following equation:

\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where}~ \epsilon \sim \mathcal{N}(0, 1) \)

I used the 'alphas_cumprod' variable to add noise at different timesteps and visualized the results for noise levels at \( t = 250, 500, 750 \), showing progressively noisier images.

Berkeley Campanile

campanile noise level=250

campanile noise level=500

campanile noise level=750

1.2 Classical Denoising

I tried to use a Gaussian blur filter to denoise the images.

campanile noise level=250

campanile noise level=500

campanile noise level=750

Gaussian Blur Denoising noise_level=250

Gaussian Blur Denoising noise_level=500

Gaussian Blur Denoising noise_level=750

1.3 One-Step Denoising

In this section, I used a pretrained diffusion model, specifically the UNet found in stage_1.unet, to denoise images. The model, trained on a large dataset of image pairs, allows us to estimate and remove Gaussian noise from the noisy images, recovering an approximation of the original ones.

Since the model is conditioned on text prompts, I used the provided embedding for "a high quality photo" to guide the denoising process. This ensures that the output conforms to the intended quality and structure of a real image.

Gaussian Blur Denoising noise_level=250

Gaussian Blur Denoising noise_level=500

Gaussian Blur Denoising noise_level=750

One-Step Denoised Campanile at t=250

One-Step Denoised Campanile at t=500

One-Step Denoised Campanile at t=750

1.4 Iterative Denoising

In this part, I implemented an iterative denoising process to gradually clean up noisy images. Instead of running the diffusion model for all 1000 timesteps, we create a shorter list of `strided_timesteps` to skip steps and speed up the process. This allows us to move from the noisiest image to a cleaner one more efficiently.

At each step, we transition from timestep \( t \) to \( t' \) using the following formula:

\( x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma \)

Where:

\( x_t \) is the image at timestep \( t \)
\( x_{t'} \) is the noisy image at timestep \( t' \), where \( t' < t \) (less noisy)
\( \bar\alpha_t \) is defined by 'alphas_cumprod', as explained above
\( \alpha_t = \bar\alpha_t / \bar\alpha_{t'} \)
\( \beta_t = 1 - \alpha_t \)
\( x_0 \) is our current estimate of the clean image using equation 2, just like in section 1.3

Below are the images generated at each step of the denoising process, showing the gradual reduction of noise.

Noisy Campanile at t=690

Noisy Campanile at t=540

Noisy Campanile at t=390

Noisy Campanile at t=240

Noisy Campanile at t=90

When comparing the iterative denoising results with the single-step method, both produce good quality images. However, the iterative approach tends to capture finer details more effectively, resulting in a clearer and more polished image.

Original

Iteratively Denoised Campanile

One-Step Denoised Campanile

Gaussian Blurred Campanile

1.5 Generating Images from Scratch

In this part, I used the diffusion model's iterative denoising function to generate images from random noise. By setting i_start = 0 and starting with pure noise, the model progressively denoises the noise into images. Below are five results generated from the text prompt "a high quality photo."

Sample1

Sample2

Sample3

Sample4

Sample5

1.6 Classifier-Free Guidance

In this section, I implemented the classifier-free guidance (CFG) technique to improve the quality of generated images. CFG works by combining a noise estimate conditioned on a text prompt (\( \epsilon_c \)) and an unconditional noise estimate (\( \epsilon_u \)). The final noise estimate is calculated as:

\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \)

Here, \( \gamma \) adjusts the strength of the guidance, with values greater than 1 leading to higher-quality images. I used a CFG scale of \( \gamma = 7 \) to generate five images of "a high quality photo," showing significant improvements over the previous results.

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

1.7 Image-to-Image Translation

In this part, I applied the diffusion model to perform image-to-image translation by adding noise to a test image and then denoising it using the iterative_denoise_cfg function. The more noise we add, the more the model "hallucinates" changes to the image, effectively editing it. I ran the process with different starting noise levels (1, 3, 5, 7, 10, 20 steps) and observed how the model progressively restored the image to the natural image manifold. Below are the results, showing gradual edits to the original image.

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Campanile

Tests on my own images:

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Berkeley Oski

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Golden Gate Bridge

1.7.1 Editing Hand-Drawn and Web Images

In this part, I experimented with projecting non-realistic images, such as hand-drawn sketches or images from the web, onto the natural image manifold using noise and denoising steps. By applying different noise levels (1, 3, 5, 7, 10, 20), the model creatively "hallucinates" edits that make the images look more realistic. Below are the results for one web image and two hand-drawn images, progressively edited at various noise levels.

Web Image:

pumpkin at i_start=1

pumpkin at i_start=3

pumpkin at i_start=5

pumpkin at i_start=7

pumpkin at i_start=10

pumpkin at i_start=20

pumpkin

Hand-Drawn:

Abstract at i_start=1

Abstract at i_start=3

Abstract at i_start=5

Abstract at i_start=7

Abstract at i_start=10

Abstract at i_start=20

Abstract

Starry Night at i_start=1

Starry Night at i_start=3

Starry Night at i_start=5

Starry Night at i_start=7

Starry Night at i_start=10

Starry Night at i_start=20

Starry Night

1.7.2 Inpainting

In this part, I implemented an inpainting function using the diffusion model, based on the technique from the RePaint paper. The goal is to fill in parts of an image based on a binary mask. For each timestep, we update the noisy image but "force" the pixels outside the mask to match the original image. This is done using the following equation:

\( x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m}) \text{forward}(x_{orig}, t) \tag{5} \)

Here, \( \textbf{m} \) is the binary mask, \( x_t \) is the noisy image at timestep \( t \), and \( \text{forward}(x_{orig}, t) \) adds noise to the original image at timestep \( t \). This ensures that the content outside the mask remains unchanged, while the model generates new content inside the mask. Below are the results for inpainting the top of the Campanile and two additional images using custom masks.

Campanile

Mask

Mask to fill

Campanile Inpainted

Beach

Mask

Mask to fill

Beach Inpainted

Sun

Mask

Mask to fill

Sun Inpainted

1.7.3 Text-Conditional Image-to-Image Translation

In this part, I used text prompts to guide the image-to-image translation process. By incorporating text guidance during denoising, the model not only restores the image but also aligns it with the prompt. Below are the results for the test image and two additional images, progressively edited with noise levels (1, 3, 5, 7, 10, 20).

"a rocket ship" on Campanile

Rocket Ship at noise level 1

Rocket Ship at noise level 3

Rocket Ship at noise level 5

Rocket Ship at noise level 7

Rocket Ship at noise level 10

Rocket Ship at noise level 20

Campanile

"an oil painting of people around a campfire" on Golden Gate Bridge

campfire at noise level 1

campfire at noise level 3

campfire at noise level 5

campfire at noise level 7

campfire at noise level 10

campfire at noise level 20

Golden Gate Bridge

"a lithograph of a skull" on pumpkin

skull at noise level 1

skull at noise level 3

skull at noise level 5

skull at noise level 7

skull at noise level 10

skull at noise level 20

pumpkin

1.8 Visual Anagrams

In this part, I implemented Visual Anagrams using a diffusion model. The goal is to create an optical illusion where the image looks like prompt1 in one orientation, but reveals prompt2 when flipped upside down. To achieve this, I denoise the image twice—once using the first prompt and once using the flipped image with the second prompt. The noise estimates are averaged, and the reverse diffusion step is performed using the combined noise estimate.

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \]

Below are examples of the visual anagram illusions, where flipping the image reveals different scenes.

An oil painting of an old man

An oil painting of people around a campfire

An oil painting of an old man

An oil painting of people around a campfire

An oil painting of an old man

An oil painting of people around a campfire

1.10 Hybrid Images

In this part, I implemented a hybrid image generation technique using diffusion models. The method involves creating a composite noise estimate by combining low frequencies from one noise estimate and high frequencies from another, each conditioned on different text prompts. The algorithm is as follows:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2) \]

Here, \( f_\text{lowpass} \) and \( f_\text{highpass} \) are Gaussian blur filters used to separate the low and high frequency components, while \( p_1 \) and \( p_2 \) represent two different text prompt embeddings. The final noise estimate \( \epsilon \) is then used in the diffusion process to generate a hybrid image. I used a Gaussian blur with a kernel size of 33 and sigma 2.

Below are examples of hybrid images:

A lithograph of waterfalls

A lithograph of a skull

A photo of a dog

A photo of a man

A rocket ship

An oil painting of a snowy mountain village'

Part B: Diffusion Models from Scratch!

In this part, we will train our own diffusion model on MNIST.

Part 1: Training a Single-Step Denoising UNet

In this part, we will build and train a simple UNet-based denoiser. The goal is to map a noisy image to its clean version by minimizing the L2 loss between the denoised and clean images. The UNet architecture involves downsampling, upsampling, and skip connections to preserve spatial information.

We will train the model on the MNIST dataset, adding noise to the images during training. The model will be optimized using the Adam optimizer, and the training will run for 5 epochs.

Below is a visualization of the noising process using σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:

Training

Training Loss Curve

Here are some results of test set after the 1st and the 5th epoch:

After 1 epoch

After 5 epochs

Out-of-Distribution Testing

Here are the results of test set with out-of-distribution noise levels after the model is trained:

Part 2: Training a Diffusion Model

In this part, we extend the UNet to iteratively denoise images using the Denoising Diffusion Probabilistic Model (DDPM) framework. Instead of predicting the clean image directly, the UNet is trained to predict the noise added to the image. This reformulates the objective into minimizing the noise prediction error, as given by the equation:

\[ L = \mathbb{E}_{x,t}\left[\|e_\theta(x_t, t) - e\|^2\right] \]

To model the diffusion process, we introduce a variance schedule \( \beta_t \), where \( \beta_0 = 0.0001 \) and \( \beta_T = 0.02 \), and use it to iteratively add noise to clean images \( x_0 \), producing noisy images \( x_t \). This process follows:

\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}e, \quad e \sim \mathcal{N}(0, 1) \]

Here, \( \alpha_t = 1 - \beta_t \) and \( \bar{\alpha}_t = \prod_{s=1}^t \alpha_s \) control the noise schedule. The goal is to denoise \( x_t \) step-by-step, starting from pure noise \( x_T \sim \mathcal{N}(0, I) \), back to a clean image \( x_0 \). Training involves normalizing timesteps \( t \) to the range [0, 1] and conditioning the UNet on \( t \) to handle varying noise levels.

Adding Time Conditioning to UNet

This iterative denoising framework enables higher-quality image generation compared to single-step denoising approaches. For further reading, refer to the original paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020) .

Training

Sampling

Adding Class-Conditioning to UNet

In this section, we enhance the UNet by introducing class-conditioning, allowing the model to generate images conditioned on digit classes (0-9). This provides better control over image generation and improves the quality of results.

CS 280A: Intro to Computer Vision and Computational Photography, Fall 2024

Project 5 FUN WITH DIFFUSION MODELS

Jasper Liu

Part A: The Power of Diffusion Models!

Part 0: Setup

Part 1: Sampling Loops

1.1 Implementing the Forward Process

1.2 Classical Denoising

1.3 One-Step Denoising

1.4 Iterative Denoising

1.5 Generating Images from Scratch

1.6 Classifier-Free Guidance

1.7 Image-to-Image Translation

1.7.1 Editing Hand-Drawn and Web Images

1.7.2 Inpainting

1.7.3 Text-Conditional Image-to-Image Translation

1.8 Visual Anagrams

1.10 Hybrid Images

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

Training

Out-of-Distribution Testing

Part 2: Training a Diffusion Model

Adding Time Conditioning to UNet

Training

Sampling

Adding Class-Conditioning to UNet

Training

Sampling