While the hype in the AI landscape has been increasing steadily over the years, it has reached a crescendo, thanks to Generative AI.
The advances in technology driving this phenomenon have happened in the last decade.
In this article, I will talk through the lens of three Machine Learning frameworks that have laid the foundation for the field of Generative AI.
GAN is a deep learning model employing a game theoretic approach where two models are pitted against each other, mimicking a minimax two-player game. [1]
At a high level, GANs consist of two components -
Generator network - It generates new content resembling the source data.
Discriminator network - It determines whether the generated examples are real, from the training set, or fake, coming from Generator.
The goal of the Generator is to trick the Discriminator into making a mistake. The goal of the algorithm is to keep adjusting the Loss function until the model reaches a near equilibrium.
GANs pioneered the Generative AI field and paved the way for the next two big shots.
The Transformer architecture was presented initially in a 2017 paper [2].
Transformer models use an encoder-decoder architecture and have become the de facto state-of-the-art deep learning frameworks.
Along with the groundbreaking results in NLP, the Transformer models have also found uses in genomic sequencing, time series data prediction, image processing, and computer vision.
Its famous implementations include BERT, and GPT-3, powering the ChatGPT application. Transformer models are the largest ML models built to date. GPT-3, for example, has 175 billion parameters.
The figure shows an architecture of Transformers representing a single Encoder on the left and a single Decoder on the right, each marked with “Nx”. In actual implementation, there are usually multiple encoder blocks stacked on top of each other, as are decoder blocks, typically Nx = 6. [2]
Transformers have largely replaced the earlier models that were used for NLP - RNN (Recurrent Neural Network), LSTM (Long Short Term Memory Network), and GRU (Gated Recurrent Units).
Unlike RNN or LSTM, Transformers do not use either recurrent units that feed back into the network or any convolutions, as seen in CNNs (Convolutional Neural Networks). Instead, a component called positional encoding works together with input embeddings (vector representation of each word) to establish the relative positions of the words in a sequence.
In recent months, the surreal imagery mixing real portraits and fantastical art has created a frenzy.
Using “prompt engineering”, AI is used for text-to-image and image-to-image synthesis.
The underlying architecture is called Latent Diffusion Model and was first made publicly available by Stability AI through its open-source offering - Stable Diffusion.
Pure diffusion Models consist of two distinct phases:
Forward Diffusion - Gaussian noise is added to an image until it gets filled with completely random noise.
Reverse Diffusion - Through “de-noising”, data is recovered from the Gaussian noise by predicting noise at each step using Markov Chains.
A Latent Diffusion Model combines the details preservation capability of a diffusion network to the perceptual power of GANs. Moreover, It does so using a very small memory footprint compared to a pure Diffusion model because it operates in a latent space and not in the actual pixel manipulation space.
The real power of LDMs lies in its ability to incorporate the encoding-decoding characteristic from the Transformer model. Doing this allows it to fully capture the semantic relationship between various aspects of the content. [3]
The interest level in AI startups has risen steadily over the past few years.
The sudden leaps in Generative AI technology built on these AI models are poised to bring about a Cambrian explosion in the startup landscape in the coming months and years.
Footnotes:
[1] Research paper that introduced Generative Adversarial Networks (GANs) to the world:
https://arxiv.org/abs/1406.2661
[2] “Attention is all you need” - Paper that describes Transformers architecture
https://arxiv.org/abs/1706.03762
[3] How does the Latent Diffusion Model work? Read this paper
https://arxiv.org/abs/2112.10752
Ajay Bam is the CEO and Co-founder at Vyrill, a first-of-its-kind video intelligence company launched in 2017 through UC Berkeley’s Skydeck Incubator program. Vyrill helps brands and shoppers find the “moments that matter” inside videos. Its AI-powered “In-Video’' search technology analyzes & shares insights hidden within videos to improve personalization, SEO, and conversion. Before Vyrill, Ajay launched Boston-based, mobile shopping app company Modiv Media. He is a proven and accomplished product management professional, entrepreneurial thinker, and innovator with more than 13 years of experience leading startups and world-class brands.