Exploring the Underlying Principles of Image Generation in GANs, VAEs, ViTs, and Stable Diffusion Models
- 9 minutes read - 1798 wordsTable of Contents
In artificial intelligence, generative models have emerged as a powerful tool for creating realistic images, video, audio, and text.
This article examines four prominent generative AI model architectures: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Vision Transformers (ViTs), and Stable Diffusion (SD).
This article examines how each model works, their strengths and weaknesses, and their impact on the broader AI landscape.
Most common AI Model Architectures for Generative AI
Stable Diffusion is probably the most prominent AI architecture to create images, but it is by far not the only one; there are:
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)
- Vision Transformers (ViTs)
- Stable Diffusion (SD)
All Generative AI models are inherently probalistic, that means results can be very different from inference to inference, this is a source of creativity.
How GANs do work
Generative Adversarial Networks (GANs) are a class of deep learning models used for generating realistic images, among other tasks. GANs consist of two neural networks, the generator and the discriminator, who are trained together in a process that can be summarized in the following steps:
- Generator network: The generator’s primary goal is to create synthetic images that resemble those in the training dataset. It takes random noise as input and produces an image as output. The generator can be considered a counterfeit artist trying to create convincing forgeries.
- Discriminator network: The discriminator’s task is to distinguish between real images from the training dataset and fake images created by the generator. The discriminator can be considered an art expert trying to identify forgeries.
- Training process: The generator and discriminator are trained simultaneously in a two-player minimax game:
- The generator creates a batch of fake images using random noise as input.
- A batch of real images is sampled from the training dataset.
- The discriminator is trained to correctly classify the real images as real and fake images as fake. This is typically done using binary cross-entropy loss.
- The generator’s parameters are updated to maximize the discriminator’s error in classifying the generated images as fake. In other words, the generator is trained to “fool” the discriminator.
- Iterative process: Steps 3a through 3d are repeated for a predefined number of iterations or until a convergence criterion is met. As training progresses, the generator improves at producing realistic images, while the discriminator becomes better at identifying fake images. Ideally, the generator would eventually have images the discriminator cannot distinguish from real ones.
- Generating new images: The generator can create unique, realistic images by providing random noise as an input once the training is complete.
GANs have been highly successful in generating realistic images across various domains. However, training GANs can be challenging due to mode collapse, vanishing gradients, and instability during training.
Researchers have proposed numerous GAN variants and training techniques to address these challenges and improve the performance of GANs for image creation. GigaGAN shows that the GAN approach still has a lot of potentials.
How Stable Diffusion works
Stable Diffusion became a prevalent AI model in 2022 for creating AI images. The Diffusion Model training comprises two parts: the Forward Diffusion Process, which adds noise to an image, and the Reverse Diffusion Process, which removes noise from the image. While a Stable Diffusion model is trained, it learns to remove noise, and by learning to remove noise, it learns how to create images.
The diffusion model training for image generation can be divided into the forward and reverse diffusion processes. This approach is used in deep learning models like denoising score matching and diffusion probabilistic models to generate high-quality images.
- Forward diffusion process: This model part adds controlled noise to an image over several time steps. The idea is to transform the original image into pure noise gradually. The forward process can be seen as the “corrupting” step, where the model learns to understand the data distribution and the structure of the noise added to the image at each step.
- Reverse diffusion process: This part of the model aims to remove the noise from the noisy image, recovering its original or close approximation. The model is trained to learn the conditional distribution of the clean image given the noisy image at each time step. During this process, the model learns to generate images by removing noise, essentially “denoising” the images.
When training a stable diffusion model, it learns to remove noise effectively by understanding the data distribution and structure. As the model becomes better at denoising, it simultaneously learns the underlying structure and features of the images in the dataset. This allows the model to generate new, high-quality images by reversing the forward diffusion process, starting from a random noise image and gradually removing noise to reveal a coherent image.
Research into new models beyond Stable Diffusion goes on primary objective is performance during inference and training accuracy.
How Vision Transformers do work
ViT (Vision Transformer) is a model architecture initially designed for image recognition tasks rather than creation. It applies the transformer architecture, initially designed for natural language processing tasks, to computer vision problems. ViTs divide the input images into fixed-size non-overlapping patches and process these patches as sequences, similar to how transformer models process text sequences.
However, you can still use a transformer-based architecture for image creation tasks by employing “image-to-image translation.” One such example is the architecture called Pix2Pix, which can be adapted to use transformers instead of traditional convolutional neural networks (CNNs).
Here is a high-level overview of how a transformer-based model like ViT can be adapted for image creation:
- Preprocessing: Divide the input image into fixed-size non-overlapping patches and linearly embed each patch into a flat vector. Then, create a sequence of patch vectors that the transformer will process.
- Positional encoding: Add positional encoding to the patch vectors to provide information about the spatial position of each patch within the image.
- Transformer architecture: Replace the convolutional layers in the original image-to-image translation model with a transformer architecture. This will consist of multiple layers of multi-head self-attention and feed-forward networks, along with layer normalization and residual connections.
- Training: Train the model on pairs of input-output images. For example, if the task is to convert black and white pictures to color images, the input-output pairs would consist of black and white images and their corresponding color versions. During training, the model will learn the mapping between the input and output image domains.
- Inference: Once the model is trained, it can be used for image creation tasks by providing a new input image, converting it into a sequence of patch vectors, and processing it through the transformer model. The output will be a sequence of patch vectors representing the generated image. These patch vectors can then be reconstructed into the final rendered image.
While ViTs are not primarily designed for image creation, adapting them can produce exciting results. However, it’s worth noting that other architectures, like GANs or Variational Autoencoders (VAEs), are more commonly used for image-generation tasks.
How Variational Autoencoders (VAEs) do work
Variational Autoencoders (VAEs) are generative models for various tasks, including image generation, unsupervised learning, and representation learning. VAEs are based on the principles of Bayesian inference and are designed to learn the underlying data distribution by combining deep learning techniques with probabilistic modeling.
A VAE consists of two main components: the encoder and the decoder. Here is a high-level overview of how VAEs work:
- Encoder: The encoder network takes input data (e.g., an image) and compresses it into a lower-dimensional latent space representation. The encoder learns to map the input data to a set of parameters (mean and standard deviation) of a probabilistic distribution, often a Gaussian distribution, in the latent space.
- Latent space: The latent space is a lower-dimensional representation that captures the essential characteristics of the input data. It is designed to model the underlying structure and variations in the data.
- Sampling: A random sample is drawn from the distribution defined by the mean and standard deviation parameters produced by the encoder. This sampling step introduces a stochastic element, encouraging the model to learn a smooth and continuous latent space.
- Decoder: The decoder network takes the sampled latent vector and reconstructs the input data (e.g., an image). The decoder learns to map the latent space back to the original data space, effectively “decoding” the compressed representation.
- Training: VAEs are trained using a combination of two loss functions:
- Reconstruction loss: This measures the difference between the input data and the reconstructed data produced by the decoder. The goal is to minimize this loss to ensure accurate reconstructions.
- KL-divergence loss: This measures the difference between the learned distribution in the latent space and a predefined prior distribution (usually a standard Gaussian distribution). The goal is to minimize this loss to ensure the latent space distribution is close to the initial distribution, enabling smooth interpolation and generalization in the latent space.
Combining these loss functions results in a VAE that can generate realistic data samples by learning a smooth and continuous latent space representation of the input data.
Once the VAE is trained, it can be used for various tasks, including image generation, by sampling latent vectors from the prior distribution and passing them through the decoder to obtain new samples in the original data space. VAEs can also be used for data denoising, dimensionality reduction, and representation learning.
Conclusions
The generative AI model landscape is vast and diverse, with each architecture offering distinct strengths and weaknesses.
GANs have successfully produced high-quality images but are susceptible to training inconsistency. VAEs have a smooth latent space and probabilistic modeling, but their images may be less sharp than GANs.
Although primarily intended for image recognition, Vision Transformers can be adapted for image generation with promising results.
Finally, through forward and reverse diffusion processes, Stable Diffusion models denies and generate high-quality images.
Understanding these generative model architectures is critical for maximizing their potential in various AI applications.
Sources:
- https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
- https://arxiv.org/abs/2202.06709
- https://www.vegaitglobal.com/media-center/knowledge-base/what-is-stable-diffusion-and-how-does-it-work
- https://towardsdatascience.com/demystifying-gans-cc1ac011355
- https://mingukkang.github.io/GigaGAN/