CS 180 Programming Project 5

Part A

Sampling Loops

In the first section, we learned how to add noise to the diffusion model. During training, we started with the clean image x_0 and added noise to it iteratively. Part A focuses on the inference process, where we begin with random noise and denoise it to a clean image. Our goal is to start at timestamp T and denoise it back to the initial state.

1.1 Implementing the Forward Process

In the first section, we implemented the algorithm for adding noise. At each timestep t, we add a different amount of noise compared to the clean image. We use alphas_cumprod as the noise schedule. Different diffusion models may utilize different noise schedules, which can significantly impact the model's performance.

The formula is as follows:

$$ x_t = \sqrt{\alpha_t} x_0 + \sqrt{1 - \alpha_t} \cdot \epsilon $$

Deliverables:

Test Image at noise level [250, 500, 750]

1.2 Classical Denoising

In this section, we try to use the Gaussian blur filtering to remove the noise. Here's the result:

As we can see, the result is not satisfactory, especially when the noise level is high. This is because simple Gaussian blur filtering cannot effectively utilize the image's latent information.

1.3 One-Step Denoising

In this section, we implemented one-step denoising. Since our UNet, the pretrained denoiser, can estimate the noise added at each step, we can use it for one-step estimation to denoise the noisy Campanile image. We tested different results at t = [250, 500, 750].

1.4 Iterative Denoising

As we can see from the results, our one-step denoising performed relatively well at earlier t. However, when the image becomes noisier, the one-step denoiser is unable to produce a high-quality image.

To address the issue, we can use an iterative approach to denoise the image, which the formula can be expressed as the following (reference: project website):

$$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'} \beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + v_\sigma $$

As we can see in the image, our iterative approach produces higher quality image (more details and less blurry)

1.5 Diffusion Model Sampling

To random sample images, we would start at i_start = 0 and use random noise. The following is the 5 results of the high-quality photo.

1.6 Classifier-Free Guidance (CFG)

While we use the prompt "high-quality photo," the resulting images still do not look satisfactory. The Classifier-Free Guidance (CFG) technique is a method to enhance image quality and make the output more aligned with the prompt.

The general idea is to use both a conditioned prompt and an unconditioned prompt, amplifying the differences between them. This approach produces a larger gradient, leading to better results (though at the cost of reduced image diversity).

As we can see, the image quality is much higher than our approach without CFG.

1.7 Image-to-image Translation

To edit the image (or, in other words, perform image-to-image translation), we need to encode the information from the original image and use it as a condition for the input. In this section, we add noise to the existing images and then denoise them from their noisy versions. As we can see, if we add a lower noise level (with a higher i_start), the resulting images will look closer to the original images.

Here's the result:

Own testing image:

1.7.1 Hand-Drawn And Web Images

Hand-drawing

1.7.2 Inpainting

Test Image:

Own choices:

1.7.3 Text-Conditional Image-to-image Translation

Test Image:

Own choices:

1.8 Visual Anagrams

In visual anagrams, the goal is to create optical illusions by combining the noise predicted through the UNet. The essential idea is that when we view the image in its normal orientation, we should see the features corresponding to Prompt 1. However, when we flip the image, it should reveal the features of another prompt's image.

As mentioned in the assignment note, we can use the formula

$$ \epsilon_1 = \text{UNet}(x_t, t, p_1) $$ $$ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) $$ $$\epsilon = \frac{\epsilon_1 + \epsilon_2}{2}$$

Visual anagram (testing)

Flip it back:

Visual anagram (own example)

Example 1: We utilizes the prompt "a photo of dinosaur" and "a photo of cat" in our testing.


Normal Direction	Flip

Example 2: We utilizes the prompt "an oil painting of a snowy mountain village" and "a man wearing a hat" in our testing.


Normal Direction	Flip

During our testing, we observed that the ratio between two noises does not always produce the best results. We manually tried different ratios and found that a ratio of 0.73:0.27 worked best for our dinosaur example. However, even a slight change in this ratio (as little as 1%) led to significantly inferior results. We believe this may be due to the CFG amplifying the gradient, making the ratio more sensitive.

In our test image, we used a ratio of 0.7:0.3, while in our second example, we used 0.8:0.2.

1.9 / 1.10 Hybrid Images

As in the previous assignment, we aim to build hybrid images by combining the high-frequency components of Prompt 1 with the low-frequency components of Prompt 2.

To implement the algorithm, we first estimate the noise for both prompts. Then, we apply a high-pass filter to Noise 1 and a low-pass filter to Noise 2. Formula from the note:

$$\epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2)$$

An image about a skull from far and waterfall from close:

An image about a pencil from far and rocket from close:

An image about a cat from far and rocket from close:

Part B

1.1 Implementing the UNet

In the first part, we implemented the essential components of the UNet. To start, we implemented blocks such as Conv, DownConv, Flatten, Unflatten, and others using the PyTorch library.

The key idea behind implementing the Conv block is to apply a convolutional kernel to the input image, with the number of kernels equal to the number of output channels. Additionally, we applied batch normalization and used GELU as the activation function.

1.2 Using the UNet to Train a Denoiser

Various Level of Noises

We can use the formula: $$z = x + \sigma \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, I). $$ to add different level of noises. The sigma determines the level of noises.

Different Noise Level:

Training Loss:

Left (original); middle (noisy); right (predicted)

1 epoch

5 epoch

2.1 Adding Time Condition to UNet

To train a UNet that incorporates t as a condition, we need to create a new operator, FCBlock, which is essentially a fully connected block that embeds the signal in a way the model can better understand.

As we can see, the result is not very good because the condition t alone is insufficient to generate high-quality output. The details of the digits are blurry, and some numbers are not distinguishable.

Deliverable:

training loss curve
Sampling result for 5 epoch (above) and 20 epoches (below)

2.4 Adding Class-Conditioning to UNet

In this section, to further improve the results, we plan to add class conditioning to the UNet. We use a one-hot encoder to incorporate class information and inject the class as a condition in a manner similar to the previous section.

To increase diversity (which is a common issue in generative models), we apply a 10% dropout by setting the class condition to an all-zero vector.

Deliverable:

training loss curve
Sampling result for 5 epoch (above) and 20 epoches (below)