Part A: The Power of Diffusion Models!
In this project, I explored the potential of diffusion models, specifically DeepFloyd, to generate and manipulate images. I started by experimenting with pre-trained models, implementing noise addition and denoising processes. I then moved on to more advanced tasks like image inpainting, iterative denoising, and even creating optical illusions.
Part 0: Setup
In this section, I used DeepFloyd's stage_1 and stage_2 models to generate images from various text prompts. By adjusting parameters such as num_inference_steps, I explored how different settings influenced the detail and quality of the output images. This helped me better understand the model's capabilities in generating images that align with the given descriptions.
I used a random seed of 180 for all experiments to ensure consistent results across the project.
The text prompts I selected for this part include: "an oil painting of a snowy mountain village", "a man wearing a hat", and "a rocket ship". The generated images are displayed below:
The model did a decent job of generating images that matched the prompts. For each prompt, the outputs captured the overall theme and were recognizable, though some finer details varied slightly with different parameter settings. Adjusting the num_inference_steps generally improved the clarity and structure of the images. While the results weren't perfect, they were consistent with the expectations for each prompt.
Part 1: Sampling Loops
In this part of the project, I implemented my own "sampling loops" using the pretrained DeepFloyd denoisers to produce high-quality images. I then modified these loops to tackle different tasks, including inpainting and generating optical illusions.
1.1 Implementing the Forward Process
In this section, I implemented the forward process of a diffusion model, which progressively adds noise to a clean image. The process is defined by the equation:
\( q(x_t | x_0) = \mathcal{N}(x_t ; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)\mathbf{I}) \)
This means that given a clean image \( x_0 \), we generate a noisy image \( x_t \) at timestep \( t \) using the following equation:
\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where}~ \epsilon \sim \mathcal{N}(0, 1) \)
I used the 'alphas_cumprod' variable to add noise at different timesteps and visualized the results for noise levels at \( t = 250, 500, 750 \), showing progressively noisier images.
1.2 Classical Denoising
I tried to use a Gaussian blur filter to denoise the images.
1.3 One-Step Denoising
In this section, I used a pretrained diffusion model, specifically the UNet found in stage_1.unet
, to denoise images. The model, trained on a large dataset of image pairs, allows us to estimate and remove Gaussian noise from the noisy images, recovering an approximation of the original ones.
Since the model is conditioned on text prompts, I used the provided embedding for "a high quality photo" to guide the denoising process. This ensures that the output conforms to the intended quality and structure of a real image.
1.4 Iterative Denoising
In this part, I implemented an iterative denoising process to gradually clean up noisy images. Instead of running the diffusion model for all 1000 timesteps, we create a shorter list of `strided_timesteps` to skip steps and speed up the process. This allows us to move from the noisiest image to a cleaner one more efficiently.
At each step, we transition from timestep \( t \) to \( t' \) using the following formula:
\( x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma \)
Where:
- \( x_t \) is the image at timestep \( t \)
- \( x_{t'} \) is the noisy image at timestep \( t' \), where \( t' < t \) (less noisy)
- \( \bar\alpha_t \) is defined by 'alphas_cumprod', as explained above
- \( \alpha_t = \bar\alpha_t / \bar\alpha_{t'} \)
- \( \beta_t = 1 - \alpha_t \)
- \( x_0 \) is our current estimate of the clean image using equation 2, just like in section 1.3
Below are the images generated at each step of the denoising process, showing the gradual reduction of noise.
When comparing the iterative denoising results with the single-step method, both produce good quality images. However, the iterative approach tends to capture finer details more effectively, resulting in a clearer and more polished image.
1.5 Generating Images from Scratch
In this part, I used the diffusion model's iterative denoising function to generate images from random noise. By setting i_start = 0
and starting with pure noise, the model progressively denoises the noise into images. Below are five results generated from the text prompt "a high quality photo."
1.6 Classifier-Free Guidance
In this section, I implemented the classifier-free guidance (CFG) technique to improve the quality of generated images. CFG works by combining a noise estimate conditioned on a text prompt (\( \epsilon_c \)) and an unconditional noise estimate (\( \epsilon_u \)). The final noise estimate is calculated as:
\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \)
Here, \( \gamma \) adjusts the strength of the guidance, with values greater than 1 leading to higher-quality images. I used a CFG scale of \( \gamma = 7 \) to generate five images of "a high quality photo," showing significant improvements over the previous results.
1.7 Image-to-Image Translation
In this part, I applied the diffusion model to perform image-to-image translation by adding noise to a test image and then denoising it using the iterative_denoise_cfg
function. The more noise we add, the more the model "hallucinates" changes to the image, effectively editing it. I ran the process with different starting noise levels (1, 3, 5, 7, 10, 20 steps) and observed how the model progressively restored the image to the natural image manifold. Below are the results, showing gradual edits to the original image.
Tests on my own images:
1.7.1 Editing Hand-Drawn and Web Images
In this part, I experimented with projecting non-realistic images, such as hand-drawn sketches or images from the web, onto the natural image manifold using noise and denoising steps. By applying different noise levels (1, 3, 5, 7, 10, 20), the model creatively "hallucinates" edits that make the images look more realistic. Below are the results for one web image and two hand-drawn images, progressively edited at various noise levels.
Web Image:
Hand-Drawn:
1.7.2 Inpainting
In this part, I implemented an inpainting function using the diffusion model, based on the technique from the RePaint paper. The goal is to fill in parts of an image based on a binary mask. For each timestep, we update the noisy image but "force" the pixels outside the mask to match the original image. This is done using the following equation:
\( x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m}) \text{forward}(x_{orig}, t) \tag{5} \)
Here, \( \textbf{m} \) is the binary mask, \( x_t \) is the noisy image at timestep \( t \), and \( \text{forward}(x_{orig}, t) \) adds noise to the original image at timestep \( t \). This ensures that the content outside the mask remains unchanged, while the model generates new content inside the mask. Below are the results for inpainting the top of the Campanile and two additional images using custom masks.
1.7.3 Text-Conditional Image-to-Image Translation
In this part, I used text prompts to guide the image-to-image translation process. By incorporating text guidance during denoising, the model not only restores the image but also aligns it with the prompt. Below are the results for the test image and two additional images, progressively edited with noise levels (1, 3, 5, 7, 10, 20).
"a rocket ship" on Campanile
"an oil painting of people around a campfire" on Golden Gate Bridge
"a lithograph of a skull" on pumpkin
1.8 Visual Anagrams
In this part, I implemented Visual Anagrams using a diffusion model. The goal is to create an optical illusion where the image looks like prompt1 in one orientation, but reveals prompt2 when flipped upside down. To achieve this, I denoise the image twice—once using the first prompt and once using the flipped image with the second prompt. The noise estimates are averaged, and the reverse diffusion step is performed using the combined noise estimate.
\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = \frac{\epsilon_1 + \epsilon_2}{2} \]
Below are examples of the visual anagram illusions, where flipping the image reveals different scenes.
1.10 Hybrid Images
In this part, I implemented a hybrid image generation technique using diffusion models. The method involves creating a composite noise estimate by combining low frequencies from one noise estimate and high frequencies from another, each conditioned on different text prompts. The algorithm is as follows:
\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2) \]
Here, \( f_\text{lowpass} \) and \( f_\text{highpass} \) are Gaussian blur filters used to separate the low and high frequency components, while \( p_1 \) and \( p_2 \) represent two different text prompt embeddings. The final noise estimate \( \epsilon \) is then used in the diffusion process to generate a hybrid image. I used a Gaussian blur with a kernel size of 33 and sigma 2.
Below are examples of hybrid images:
Part B: Diffusion Models from Scratch!
In this part, we will train our own diffusion model on MNIST.
Part 1: Training a Single-Step Denoising UNet
In this part, we will build and train a simple UNet-based denoiser. The goal is to map a noisy image to its clean version by minimizing the L2 loss between the denoised and clean images. The UNet architecture involves downsampling, upsampling, and skip connections to preserve spatial information.
We will train the model on the MNIST dataset, adding noise to the images during training. The model will be optimized using the Adam optimizer, and the training will run for 5 epochs.
Below is a visualization of the noising process using σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
Training
Training Loss Curve
Here are some results of test set after the 1st and the 5th epoch:
Out-of-Distribution Testing
Here are the results of test set with out-of-distribution noise levels after the model is trained:
Part 2: Training a Diffusion Model
In this part, we extend the UNet to iteratively denoise images using the Denoising Diffusion Probabilistic Model (DDPM) framework. Instead of predicting the clean image directly, the UNet is trained to predict the noise added to the image. This reformulates the objective into minimizing the noise prediction error, as given by the equation:
\[ L = \mathbb{E}_{x,t}\left[\|e_\theta(x_t, t) - e\|^2\right] \]
To model the diffusion process, we introduce a variance schedule \( \beta_t \), where \( \beta_0 = 0.0001 \) and \( \beta_T = 0.02 \), and use it to iteratively add noise to clean images \( x_0 \), producing noisy images \( x_t \). This process follows:
\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}e, \quad e \sim \mathcal{N}(0, 1) \]
Here, \( \alpha_t = 1 - \beta_t \) and \( \bar{\alpha}_t = \prod_{s=1}^t \alpha_s \) control the noise schedule. The goal is to denoise \( x_t \) step-by-step, starting from pure noise \( x_T \sim \mathcal{N}(0, I) \), back to a clean image \( x_0 \). Training involves normalizing timesteps \( t \) to the range [0, 1] and conditioning the UNet on \( t \) to handle varying noise levels.
Adding Time Conditioning to UNet
This iterative denoising framework enables higher-quality image generation compared to single-step denoising approaches. For further reading, refer to the original paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020) .
Training
Sampling
Adding Class-Conditioning to UNet
In this section, we enhance the UNet by introducing class-conditioning, allowing the model to generate images conditioned on digit classes (0-9). This provides better control over image generation and improves the quality of results.