CS 180 Programming Project 5
Part A
Sampling Loops
In the first section, we learned how to add noise to the diffusion model. During training, we started with the clean image x_0
and added noise to it iteratively. Part A focuses on the inference process, where we begin with random noise and denoise it to a clean image. Our goal is to start at timestamp T
and denoise it back to the initial state.
1.1 Implementing the Forward Process
In the first section, we implemented the algorithm for adding noise. At each timestep t
, we add a different amount of noise compared to the clean image. We use alphas_cumprod
as the noise schedule. Different diffusion models may utilize different noise schedules, which can significantly impact the model's performance.
The formula is as follows:
$$ x_t = \sqrt{\alpha_t} x_0 + \sqrt{1 - \alpha_t} \cdot \epsilon $$
Deliverables:
Test Image at noise level [250, 500, 750]

1.2 Classical Denoising
In this section, we try to use the Gaussian blur filtering to remove the noise. Here's the result:

As we can see, the result is not satisfactory, especially when the noise level is high. This is because simple Gaussian blur filtering cannot effectively utilize the image's latent information.
1.3 One-Step Denoising
In this section, we implemented one-step denoising. Since our UNet, the pretrained denoiser, can estimate the noise added at each step, we can use it for one-step estimation to denoise the noisy Campanile image. We tested different results at t = [250, 500, 750].

1.4 Iterative Denoising
As we can see from the results, our one-step denoising performed relatively well at earlier t
. However, when the image becomes noisier, the one-step denoiser is unable to produce a high-quality image.
To address the issue, we can use an iterative approach to denoise the image, which the formula can be expressed as the following (reference: project website):
$$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'} \beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + v_\sigma $$
As we can see in the image, our iterative approach produces higher quality image (more details and less blurry)
1.5 Diffusion Model Sampling
To random sample images, we would start at i_start = 0
and use random noise. The following is the 5 results of the high-quality photo.

1.6 Classifier-Free Guidance (CFG)
While we use the prompt "high-quality photo," the resulting images still do not look satisfactory. The Classifier-Free Guidance (CFG) technique is a method to enhance image quality and make the output more aligned with the prompt.
The general idea is to use both a conditioned prompt and an unconditioned prompt, amplifying the differences between them. This approach produces a larger gradient, leading to better results (though at the cost of reduced image diversity).

As we can see, the image quality is much higher than our approach without CFG.
1.7 Image-to-image Translation
To edit the image (or, in other words, perform image-to-image translation), we need to encode the information from the original image and use it as a condition for the input. In this section, we add noise to the existing images and then denoise them from their noisy versions. As we can see, if we add a lower noise level (with a higher i_start
), the resulting images will look closer to the original images.
Here's the result:

Own testing image:


1.7.1 Hand-Drawn And Web Images

Hand-drawing


1.7.2 Inpainting
Test Image:

Own choices:


1.7.3 Text-Conditional Image-to-image Translation
Test Image:

Own choices:


1.8 Visual Anagrams
In visual anagrams, the goal is to create optical illusions by combining the noise predicted through the UNet. The essential idea is that when we view the image in its normal orientation, we should see the features corresponding to Prompt 1. However, when we flip the image, it should reveal the features of another prompt's image.
As mentioned in the assignment note, we can use the formula
$$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) $$ $$\epsilon = \frac{\epsilon_1 + \epsilon_2}{2}$$
Visual anagram (testing)

Flip it back:

Visual anagram (own example)
Example 1: We utilizes the prompt "a photo of dinosaur" and "a photo of cat" in our testing.
Normal Direction | Flip |
Example 2: We utilizes the prompt "an oil painting of a snowy mountain village" and "a man wearing a hat" in our testing.
Normal Direction | Flip |
During our testing, we observed that the ratio between two noises does not always produce the best results. We manually tried different ratios and found that a ratio of 0.73:0.27 worked best for our dinosaur example. However, even a slight change in this ratio (as little as 1%) led to significantly inferior results. We believe this may be due to the CFG amplifying the gradient, making the ratio more sensitive.
In our test image, we used a ratio of 0.7:0.3, while in our second example, we used 0.8:0.2.
1.9 / 1.10 Hybrid Images
As in the previous assignment, we aim to build hybrid images by combining the high-frequency components of Prompt 1 with the low-frequency components of Prompt 2.
To implement the algorithm, we first estimate the noise for both prompts. Then, we apply a high-pass filter to Noise 1 and a low-pass filter to Noise 2. Formula from the note:
$$\epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2)$$
An image about a skull from far and waterfall from close:

An image about a pencil from far and rocket from close:

An image about a cat from far and rocket from close:
Part B
1.1 Implementing the UNet
In the first part, we implemented the essential components of the UNet. To start, we implemented blocks such as Conv
, DownConv
, Flatten
, Unflatten
, and others using the PyTorch library.
The key idea behind implementing the Conv block is to apply a convolutional kernel to the input image, with the number of kernels equal to the number of output channels. Additionally, we applied batch normalization and used GELU
as the activation function.
1.2 Using the UNet to Train a Denoiser
Various Level of Noises
We can use the formula: $$z = x + \sigma \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, I). $$ to add different level of noises. The sigma determines the level of noises.
Different Noise Level:

Training Loss:

Left (original); middle (noisy); right (predicted)
1 epoch

5 epoch

2.1 Adding Time Condition to UNet
To train a UNet that incorporates t as a condition, we need to create a new operator, FCBlock, which is essentially a fully connected block that embeds the signal in a way the model can better understand.
As we can see, the result is not very good because the condition t alone is insufficient to generate high-quality output. The details of the digits are blurry, and some numbers are not distinguishable.
Deliverable:
-
training loss curve 
-
Sampling result for 5 epoch (above) and 20 epoches (below)


2.4 Adding Class-Conditioning to UNet
In this section, to further improve the results, we plan to add class conditioning to the UNet. We use a one-hot encoder to incorporate class information and inject the class as a condition in a manner similar to the previous section.
To increase diversity (which is a common issue in generative models), we apply a 10% dropout by setting the class condition to an all-zero vector.
Deliverable:
-
training loss curve 
-
Sampling result for 5 epoch (above) and 20 epoches (below)