Introduction to Image Generation

Hello everybody! Today, I’m going to provide an introduction to diffusion models, a family of models that have recently shown tremendous promise in the image generation space.

While many approaches have been implemented for image generation, some of the more promising ones over time have been model families like:

  • Variational Autoencoders (VAEs): These models encode images to a compressed size and decode them back to the original size while learning the distribution of the data itself.
  • Generative Adversarial Networks (GANs): They pit two neural networks against each other: one creates images (the generator) while the other predicts if the image is real or fake (the discriminator). Over time, the discriminator gets better at distinguishing between real and fake images, while the generator improves its ability to create realistic fakes.
  • Autoregressive Models: These generate images by treating an image as a sequence of pixels. The modern approach to autoregressive models actually draws much of its inspiration from how LLMs handle text.

In this talk, the focus will be on diffusion models. Diffusion models draw inspiration from physics, specifically thermodynamics. While they were first introduced for image generation in 2015, it took a few years for the idea to gain traction. However, within the last few years, we have seen a massive increase in research and industry applications of these models.

Diffusion models underpin many state-of-the-art image generation models that you may be familiar with today.

Applications of Diffusion Models

Diffusion models show promise across several different use cases:

  • Unconditioned Diffusion Models: These models have no additional input or instruction and can be trained on images of a specific category to generate new images of that category. For example, they can create realistic faces or enhance image quality through super resolution.
  • Conditioned Diffusion Models: These enable functionalities such as text-to-image generation (generating an image from a text prompt), image editing, or customizing an image using a text prompt.

How Diffusion Models Work

Let's dive into diffusion models and discuss at a high level how they actually work. For simplicity, let’s focus on unconditioned diffusion.

This approach starts by destroying the structure of a data distribution through an iterative forward diffusion process, followed by learning a reverse diffusion process that restores this structure and data.

The goal is to train a model to denoise images, enabling it to take pure noise and create a novel image from it.

Starting with a large dataset of images, we take a single image, which we will denote as x0, and begin the forward diffusion process. This first adds noise to the image, leading to a sequence where we progressively add more noise over several iterative steps.

The distribution of this noise is denoted as Q and only depends on the previous step. As we iterate this process multiple times, ideally, we reach a state of pure noise after a sufficient amount of iterations, initially implemented with T = 1000.

Next, we want to reverse this process—from a noisy image xT back to a less noisy image xT-1 at each step. This involves training a machine learning model that takes in the noisy image and the current time step T and predicts the noise.

Visualization of the Training Step

To visualize a training step:

  • We have our initial image X.
  • We sample at time step T to create a noisy image.
  • A denoising model is trained to predict the noise by minimizing the difference between the predicted noise and the actual noise added to the image.

With this model, we can start with pure noise and send it through our denoising model. By iteratively applying this process, we will eventually produce a generated image. Essentially, the model learns the real distribution of images it has seen and samples from that learned distribution to create new, novel images.

Recent Advancements

As we all know, there have been many advancements in the field just in the last few years. Many exciting new technologies on Vertex AI for image generation are underpinned by diffusion models. A lot of work has been done in generating images faster and with more precision.

We have also seen impressive results by combining the power of diffusion models with LLMs for incredible context-aware photorealistic image generation. A prime example of this is Imagine from Google Research, which integrates LLMs with several diffusion-based models.

This is indeed an exciting space and I am thrilled to witness how this wonderful technology is making its way into enterprise-grade products on Vertex AI.

Thank you for watching! Please feel free to check out our other videos for more topics like this one.