How to Develop Generative AI Models?
There are multiple types of generative models, and combining the positive attributes of each results in the ability to create even more powerful models.
Below is a breakdown:
- Diffusion models: Also known as denoising diffusion probabilistic models (DDPMs), diffusion models are generative models that determine vectors in latent space through a two-step process during training. The two steps are forward diffusion and reverse diffusion. The forward diffusion process slowly adds random noise to training data, while the reverse process reverses the noise to reconstruct the data samples. Novel data can be generated by running the reverse denoising process starting from entirely random noise.
Figure 2: The diffusion and denoising process.
A diffusion model can take longer to train than a variational autoencoder (VAE) model, but thanks to this two-step process, hundreds, if not an infinite amount, of layers can be trained, which means that diffusion models generally offer the highest-quality output when building generative AI models.
Additionally, diffusion models are also categorized as foundation models, because they are large-scale, offer high-quality outputs, are flexible, and are considered best for generalized use cases. However, because of the reverse sampling process, running foundation models is a slow, lengthy process.
Learn more about the mathematics of diffusion models in this blog post.
- Variational autoencoders (VAEs): VAEs consist of two neural networks typically referred to as the encoder and decoder.
When given an input, an encoder converts it into a smaller, more dense representation of the data. This compressed representation preserves the information that’s needed for a decoder to reconstruct the original input data, while discarding any irrelevant information. The encoder and decoder work together to learn an efficient and simple latent data representation. This allows the user to easily sample new latent representations that can be mapped through the decoder to generate novel data.
While VAEs can generate outputs such as images faster, the images generated by them are not as detailed as those of diffusion models. - Generative adversarial networks (GANs): Discovered in 2014, GANs were considered to be the most commonly used methodology of the three before the recent success of diffusion models. GANs pit two neural networks against each other: a generator that generates new examples and a discriminator that learns to distinguish the generated content as either real (from the domain) or fake (generated).
The two models are trained together and get smarter as the generator produces better content and the discriminator gets better at spotting the generated content. This procedure repeats, pushing both to continually improve after every iteration until the generated content is indistinguishable from the existing content.
While GANs can provide high-quality samples and generate outputs quickly, the sample diversity is weak, therefore making GANs better suited for domain-specific data generation.
Another factor in the development of generative models is the architecture underneath. One of the most popular is the transformer network. It is important to understand how it works in the context of generative AI.
Transformer networks: Similar to recurrent neural networks, transformers are designed to process sequential input data non-sequentially.
Two mechanisms make transformers particularly adept for text-based generative AI applications: self-attention and positional encodings. Both of these technologies help represent time and allow for the algorithm to focus on how words relate to each other over long distances
Figure 3: Image from a presentation by Aidan Gomez, one of eight co-authors of the 2017 paper that defined transformers (source).
A self-attention layer assigns a weight to each part of an input. The weight signifies the importance of that input in context to the rest of the input. Positional encoding is a representation of the order in which input words occur.
A transformer is made up of multiple transformer blocks, also known as layers. For example, a transformer has self-attention layers, feed-forward layers, and normalization layers, all working together to decipher and predict streams of tokenized data, which could include text, protein sequences, or even patches of images.