AI has been making significant strides in recent years, and one of the areas that has seen considerable growth is generative AI. Generative AI is a subfield of AI and DL that focuses on generating new content, such as images, text, music, and video, by using algorithms and models that have been trained on existing data using ML techniques.
In order to better understand the relationship between AI, ML, DL, and generative AI, consider AI as the foundation, while ML, DL, and generative AI represent increasingly specialized and focused areas of study and application:
- AI represents the broad field of creating systems that can perform tasks, showing human intelligence and ability and being able to interact with the ecosystem.
- ML is a branch that focuses on creating algorithms and models that enable those systems to learn and improve themselves with time and training. ML models learn from existing data and automatically update their parameters as they grow.
- DL is a sub-branch of ML, in the sense that it encompasses deep ML models. Those deep models are called neural networks and are particularly suitable in domains such as computer vision or Natural Language Processing (NLP). When we talk about ML and DL models, we typically refer to discriminative models, whose aim is that of making predictions or inferencing patterns on top of data.
- And finally, we get to generative AI, a further sub-branch of DL, which doesn’t use deep Neural Networks to cluster, classify, or make predictions on existing data: it uses those powerful Neural Network models to generate brand new content, from images to natural language, from music to video.
The following figure shows how these areas of research are related to each other:
Figure 1.1 – Relationship between AI, ML, DL, and generative AI
Generative AI models can be trained on vast amounts of data and then they can generate new examples from scratch using patterns in that data. This generative process is different from discriminative models, which are trained to predict the class or label of a given example.
Domains of generative AI
In recent years, generative AI has made significant advancements and has expanded its applications to a wide range of domains, such as art, music, fashion, architecture, and many more. In some of them, it is indeed transforming the way we create, design, and understand the world around us. In others, it is improving and making existing processes and operations more efficient.
The fact that generative AI is used in many domains also implies that its models can deal with different kinds of data, from natural language to audio or images. Let us understand how generative AI models address different types of data and domains.
Text generation
One of the greatest applications of generative AI—and the one we are going to cover the most throughout this book—is its capability to produce new content in natural language. Indeed, generative AI algorithms can be used to generate new text, such as articles, poetry, and product descriptions.
For example, a language model such as GPT-3, developed by OpenAI, can be trained on large amounts of text data and then used to generate new, coherent, and grammatically correct text in different languages (both in terms of input and output), as well as extracting relevant features from text such as keywords, topics, or full summaries.
Here is an example of working with GPT-3:
Figure 1.2 – Example of ChatGPT responding to a user prompt, also adding references
Next, we will move on to image generation.
Image generation
One of the earliest and most well-known examples of generative AI in image synthesis is the Generative Adversarial Network (GAN) architecture introduced in the 2014 paper by I. Goodfellow et al., Generative Adversarial Networks. The purpose of GANs is to generate realistic images that are indistinguishable from real images. This capability had several interesting business applications, such as generating synthetic datasets for training computer vision models, generating realistic product images, and generating realistic images for virtual reality and augmented reality applications.
Here is an example of faces of people who do not exist since they are entirely generated by AI:
Figure 1.3 – Imaginary faces generated by GAN StyleGAN2 at https://this-person-does-not-exist.com/en
Then, in 2021, a new generative AI model was introduced in this field by OpenAI, DALL-E. Different from GANs, the DALL-E model is designed to generate images from descriptions in natural language (GANs take a random noise vector as input) and can generate a wide range of images, which may not look realistic but still depict the desired concepts.
DALL-E has great potential in creative industries such as advertising, product design, and fashion, among others, to create unique and creative images.
Here, you can see an example of DALL-E generating four images starting from a request in natural language:
Figure 1.4 – Images generated by DALL-E with a natural language prompt as input
Note that text and image generation can be combined to produce brand new materials. In recent years, widespread new AI tools have used this combination.
An example is Tome AI, a generative storytelling format that, among its capabilities, is also able to create slide shows from scratch, leveraging models such as DALL-E and GPT-3.
Figure 1.5 – A presentation about generative AI entirely generated by Tome, using an input in natural language
As you can see, the preceding AI tool was perfectly able to create a draft presentation just based on my short input in natural language.
Music generation
The first approaches to generative AI for music generation trace back to the 50s, with research in the field of algorithmic composition, a technique that uses algorithms to generate musical compositions. In fact, in 1957, Lejaren Hiller and Leonard Isaacson created the Illiac Suite for String Quartet (https://www.youtube.com/watch?v=n0njBFLQSk8), the first piece of music entirely composed by AI. Since then, the field of generative AI for music has been the subject of ongoing research for several decades. Among recent years’ developments, new architectures and frameworks have become widespread among the general public, such as the WaveNet architecture introduced by Google in 2016, which has been able to generate high-quality audio samples, or the Magenta project, also developed by Google, which uses Recurrent Neural Networks (RNNs) and other ML techniques to generate music and other forms of art. Then, in 2020, OpenAI also announced Jukebox, a neural network that generates music, with the possibility to customize the output in terms of musical and vocal style, genre, reference artist, and so on.
Those and other frameworks became the foundations of many AI composer assistants for music generation. An example is Flow Machines, developed by Sony CSL Research. This generative AI system was trained on a large database of musical pieces to create new music in a variety of styles. It was used by French composer Benoît Carré to compose an album called Hello World (https://www.helloworldalbum.net/), which features collaborations with several human musicians.
Here, you can see an example of a track generated entirely by Music Transformer, one of the models within the Magenta project:
Figure 1.6 – Music Transformer allows users to listen to musical performances generated by AI
Another incredible application of generative AI within the music domain is speech synthesis. It is indeed possible to find many AI tools that can create audio based on text inputs in the voices of well-known singers.
For example, if you have always wondered how your songs would sound if Kanye West performed them, well, you can now fulfill your dreams with tools such as FakeYou.com (https://fakeyou.com/), Deep Fake Text to Speech, or UberDuck.ai (https://uberduck.ai/).
Figure 1.7 – Text-to-speech synthesis with UberDuck.ai
I have to say, the result is really impressive. If you want to have fun, you can also try voices from your all your favorite cartoons as well, such as Winnie The Pooh...
Next, we move to see generative AI for videos.
Video generation
Generative AI for video generation shares a similar timeline of development with image generation. In fact, one of the key developments in the field of video generation has been the development of GANs. Thanks to their accuracy in producing realistic images, researchers have started to apply these techniques to video generation as well. One of the most notable examples of GAN-based video generation is DeepMind’s Motion to Video, which generated high-quality videos from a single image and a sequence of motions. Another great example is NVIDIA’s Video-to-Video Synthesis (Vid2Vid) DL-based framework, which uses GANs to synthesize high-quality videos from input videos.
The Vid2Vid system can generate temporally consistent videos, meaning that they maintain smooth and realistic motion over time. The technology can be used to perform a variety of video synthesis tasks, such as the following:
- Converting videos from one domain into another (for example, converting a daytime video into a nighttime video or a sketch into a realistic image)
- Modifying existing videos (for example, changing the style or appearance of objects in a video)
- Creating new videos from static images (for example, animating a sequence of still images)
In September 2022, Meta’s researchers announced the general availability of Make-A-Video (https://makeavideo.studio/), a new AI system that allows users to convert their natural language prompts into video clips. Behind such technology, you can recognize many of the models we mentioned for other domains so far – language understanding for the prompt, image and motion generation with image generation, and background music made by AI composers.
Overall, generative AI has impacted many domains for years, and some AI tools already consistently support artists, organizations, and general users. The future seems very promising; however, before jumping to the ultimate models available on the market today, we first need to have a deeper understanding of the roots of generative AI, its research history, and the recent developments that eventually lead to the current OpenAI models.