Designing Decoder-only Transformer Models like ChatGPT

Introduction

Embark on an enlightening journey into the ChatGPT stack, a remarkable feat in AI-driven language generation. Unveiling its evolution from inception to a proficient AI assistant, we delve into decoder-only transformers, specialized for crafting Shakespearean verses and informative responses.
Throughout this exploration, we dissect the four integral stages that constitute the ChatGPT stack. From exhaustive pretraining to fine-tuned supervised training, we unravel how rewards and reinforcement learning refine response generation to align with context and user intent.
In this blog, we will get acquainted briefly with the ChatGPT stack and then implement a simple decoder-only transformer to train on Shakespeare.

Creating ChatGPT models consists of four main stages:
1.    Pretraining:
2.    Supervised Fine Tuning
3.    Reward modeling
4.    Reinforcement learning

The Pretraining stage takes most of the computational time since we train the language model on trillions of tokens. The following table shows the Data Mixtures used for pretraining of LLaMA Meta Models [0]:

designing-decoder-only-transformer-models-like-chatgpt-img-0

The datasets come and mix together, according to the sampling proportion, to create the pretraining data. The table shows the datasets along with their corresponding sampling proportion (What portion of the pre-trained data is the dataset?), epochs (How many times do we train the model on the corresponding datasets?), and dataset size. It is obvious that the epoch of high-quality datasets such as Wikipedia, and Books is high and as a result, the model grasps high-quality datasets better.
After we have our dataset ready, the next step is Tokenization before training. Tokenizing data means mapping all the text data into a large list of integers. In language modeling repositories, we usually have two dictionaries for mapping tokens (a token is a sub word. Like ‘wait’, and ‘ing’ are two tokens.) into integers and vice versa. Here is an example:

In [1]: text = "it is obvious that the epoch of high .."
In [2]: tokens = list(set(text.split()))
In [3]: stoi = {s:i for i,s in enumerate(tokens)}
In [4]: itos = {i:s for s,i in stoi.items()}
In [5]: stoi['it']
Out[5]: 22
In [6]: itos[22]
Out[6]: 'it'

Now, we can tokenize texts with the following functions:

In [7]: encode = lambda text: [stoi[x] for x in text.split()]
In [8]: decode = lambda encoded: ' '.join([itos[x] for x in encoded])
In [9]: tokenized = encode(text)
In [10]: tokenized
Out[10]: [22,  19,  18,  5, ...]
In [11]: decode(tokenized)
Out[11]: 'it is obvious that the epoch of high ..'

Suppose the tokenized variable contains all the tokens converted to integers (say 1 billion tokens). We select 3 chunks of the list randomly that each chunk contains 10 tokens and feed-forward them into a transformer language model to predict the next token. The model’s input has a shape (3, 10), here 3 is batch size and 5 is context length. The model tries to predict the next token for each chunk independently. We select 3 chunks and predict the next token for each chunk to speed up the training process. It is like running the model on 3 chunks of data at once. You can increase the batch size and context length depending on the requirements and resources. Here’s an example:

designing-decoder-only-transformer-models-like-chatgpt-img-1

For convenience, we wrote the token indices along with the corresponding tokens. For each chunk or sequence, the model predicts the whole sequence. Let’s see how this works:

designing-decoder-only-transformer-models-like-chatgpt-img-2

By seeing the first token (it), the model predicts the next token (is). The context token(s) is ‘it’ and the target token for the model is ‘is’. If the model fails to predict the target token, we do backpropagation to adjust model parameters so the model can predict correctly.
During the process, we mask out or hide the future tokens so that the model can’t have access to the future tokens. Because it is kind of cheating. We want the model itself to predict the future by only seeing the past tokens. That makes sense, right? That’s why we used a gray background for future tokens, which means the model is not able to see them.

designing-decoder-only-transformer-models-like-chatgpt-img-3

After predicting the second token, we have two tokens [it, is] as context to predict what token comes next in the sequence. Here is the third token (obvious).

designing-decoder-only-transformer-models-like-chatgpt-img-4

By using the three previous tokens [it, is, obvious], the model needs to predict the fourth token (that). And as usual, we hide the future tokens (in this case ‘the’).

designing-decoder-only-transformer-models-like-chatgpt-img-5

We give [it, is, obvious, that] to the model as the context in order to predict ‘the’. And finally, we give all the sequence as context [it, is, obvious, that, the] to predict the next token.
We have five predictions for a sequence with a length of five.
After training the model on a lot of randomly selected sequences from the pre-trained dataset, the model should be ready to autocomplete your sequence. Give it a sequence of tokens, and then, it predicts the next token and based on what was predicted plus previous tokens, the model predicts the next tokens one by one. We call it an autoregressive model. That’s it.

designing-decoder-only-transformer-models-like-chatgpt-img-6

But, at this stage, the model is not an AI assistant or a chatbot. It only receives a sequence and tries to complete the sequence. That’s how we trained the model. We don’t train it to answer questions and listen to the instructions. We give it context tokens and the model tries to predict the next token based on the context.

You give it this:

“In order to be irrational, you first need to”

And the model continues the sequence:

“In order to be irrational, you first need to abandon logical reasoning and disregard factual evidence.”

Sometimes, you ask it an instruction:

“Write a function to count from 1 to 100.”

And instead of trying to write a function, the model answers with more similar instructions:

“Write a program to sort an array of integers in ascending order.”

“Write a script to calculate the factorial of a given number.”

“Write a method to validate a user's input and ensure it meets the specified criteria.”

“Write a function to check if a string is a palindrome or not.”

That’s where prompt engineering came in. People tried to use some tricks to get the answer to a question out of the model.

Give the model the following prompt:

“London is the capital of England.
Copenhagen is the capital of Denmark.
Oslo is the capital of”
The model answers like this:
“Norway.”

So, we managed to get something helpful out of it with prompt engineering. But we don’t want to provide examples every time. We want to ask it a question and receive an answer. To prepare the model to be an AI assistant, we need further training named Supervised Fine Tuning for instructional purposes.
In the Supervised Fine-Tuning stage, we make the model instructional. To achieve this goal the model needs training on a high quality 15k-100K of prompt and response dataset.

Here’s an example of it:
{
   "instruction": "When was the last flight of Concorde?",
   "context": "",
   "response": "On 26 November 2003",
   "category": "open_qa"
}

This example was taken from the databricks-dolly-15k dataset that is an open-source dataset for Supervised/Instruction Fine Tuning[1]. You can download the dataset from here. Instructions have seven categorizations including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This is because we want to train the model in different tasks. For instance, the above instruction is open QA, meaning the question is a general one and does not require reasoning abilities. It teaches the model to answer general questions. Closed QA requires reasoning abilities.
During Instruction fine-tuning, nothing will change algorithmically. We do the same process as the previous stage (Pretraining). We gave instructions as context tokens and we want the model to continue the sequence with response.

We continue this process for thousands of examples and then, the model is ready to be instructional. But that’s not the end of the story of the model behind ChatGPT. OpenAI designed a supervised reward modeling that returns a reward for the sequences that were made by the base model for the same input prompt. They give the model a prompt and run the model four times, for instance, to have four different answers for the same prompt. The model produces different answers each time because of the sampling method they use. Then, the reward model receives the input prompt and the produced answers to get a reward score for each answer. The better the answer, the better the reward score is. The model requires ground-truth scores to be trained and these scores came from labelers who worked for OpenAI. Labelers were given prompt text and model responses and they ranked them from the best to the worst.

At the final stage, the ChatGPT uses Reinforcement Learning with Human Feedback (RLHF) to generate responses that get the best scores from the rewarding model. RL is an architecture that tries to find the best way of achieving a goal. The goal can be checkmate in chess or creating the best answer for the input prompt. The RL learning process is like doing an action and getting a reward or penalty for the action. And we do not take actions that end up penalizing. RLHF is what made ChatGPT so good:

designing-decoder-only-transformer-models-like-chatgpt-img-7

The PPO-ptx shows the win rate of GPT + RLHF compared to SFT (Supervised Fine-Tuned model), GPT with prompt engineering, and GPT base.

Conclusion

In summation, the ChatGPT stack exemplifies AI's potent fusion with language generation. From inception to proficient AI assistant, we've traversed core stages – pretraining, fine-tuning, and reinforcement learning. Decoder-only transformers have enlivened Shakespearean text and insights.
Tokenization's role in enabling ChatGPT's prowess concludes our journey. This AI evolution showcases technology's synergy with creative text generation.
ChatGPT's ascent highlights AI's potential to emulate human-like language understanding. With ongoing refinement, the future promises versatile conversational AI that bridges artificial intelligence and language's artistry, fostering human-AI understanding.

Author Bio

Saeed Dehqan trains language models from scratch. Currently, his work is centered around Language Models for text generation, and he possesses a strong understanding of the underlying concepts of neural networks. He is proficient in using optimizers such as genetic algorithms to fine-tune network hyperparameters and has experience with neural architecture search (NAS) by using reinforcement learning (RL). He implements models starting from data gathering to monitoring, and deployment on mobile, web, cloud, etc.

Designing Decoder-only Transformer Models like ChatGPT

Introduction

Conclusion

Author Bio

Recommendations for you

Comments (0)

No comments for this article yet!