IntroductionTransformers employ potent techniques to preprocess tokens before sequentially inputting them into a neural network, aiding in the selection of the next token. At the transformer's apex is a basic neural network, the transformer head. The text generator model processes input tokens and generates a probability distribution for subsequent tokens. Context length, termed context length or block size, is recognized as a hyperparameter denoting input token count. The model's primary aim is to predict the next token based on input tokens (referred to as context tokens or context windows). Our goal with n tokens is to predict the subsequent fitting token following previous ones. Thus, we rely on these n tokens to anticipate the next. As humans, we attempt to grasp the conversation's context - our location and a loose foresight of the path's culmination. Upon gathering pertinent insights, relevant words emerge, while irrelevant ones fade, enabling us to choose the next word with precision. We occasionally err but backtrack, a luxury transformers lack. If they incorrectly predict (an irrelevant token), they persist, though exceptions exist, like beam search. Unlike us, transformers can't forecast. Revisiting n prior tokens, our human assessment involves inspecting them individually, and discerning relationships from diverse angles. By prioritizing pivotal tokens and disregarding superfluous ones, we evaluate tokens within various contexts. We scrutinize all n prior tokens individually, ready to prognosticate. This embodies the essence of the multihead attention mechanism in transformers. Consider a context window with 5 tokens. Each wears a distinct mask, predicting its respective next token:"To discern the void amidst, we must first grasp the fullness within." To understand what token is lacking, we must first identify what we are and possess. We need communication between tokens since tokens don’t know each other yet and in order to predict their own next token, they first need to know each other well and pair together in such a way that tokens with similar characteristics stay near each other (technically having similar vectors). Each token has three vectors that represent:● What tokens they are looking for (known as query)● What they really have (known as key)● What they are (known as value)Each token with its query starts looking for similar keys, finds each other, and starts to know one another by adding up their values:Similar tokens find each other and if a token is somehow dissimilar, here Token 4, other tokens don’t consider it much. But please note that every token (much or less) has its own effect on other tokens. Also, in self-attention, all tokens ask all other tokens with their query and keys to find familiar tokens, but not the future tokens, named masked self-attention. We prohibit tokens from communicating to future tokens. After exchanging information between tokens and mixing up their values, similar tokens become more similar:As you can see, the color of similar tokens becomes more similar(in action, their vectors become more similar). Since tokens in the group wear a mask, we cannot access the true tokens’ values. We just know and distinguish them from their mask(value). This is because every token has different characteristics in different contexts, and they don’t show their true essence.So far so good; we have finished the self-attention process and now, the group is ready to predict their next tokens. This is because individuals are aware of each other very well, and as a result, they can guess the next token better. Now, each token separately needs to go to a nonlinear network and then to the transformer head, to predict its own next token. We ask each one of the tokens separately to tell their opinion about the probability of what token comes next. Finally, we collect the probability distributions of all tokens in the context window. A probability distribution sums up to 100, or actually in action to 1. We give probability to every token the model has in its vocabulary. The simplest method to extract the next token from probability distributions is to select the one with the highest probability:As you can see, each token goes to the neural network and the network returns a probability distribution. The result is the following sentence: “It looks like a bug”.Voila! We managed to go through a simple Transformer model.Let’s recap everything we’ve said. A transformer receives n tokens as input, does some stuff (like self-attention, layer normalization, etc.) and feed-forward them into a neural network to get probability distributions of the next token. Each token goes to the neural network separately; if the number of tokens is 10, there are 10 probability distributions.At this point, you know intuitively how the main building blocks of a transformer work. But let us better understand them by implementing a transformer model.Clone the repository tiny-transformer:git clone https://github.com/saeeddhqan/tiny-transformerExecute simple_model.py in the repository If you simply want to run the model for training.Create a new file, and import the necessary modules:import math
import torch
import torch.nn as nn
import torch.nn.functional as F
Load the dataset and write the tokenizer:
with open('shakespeare.txt') as fp:
text = fp.read()
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {c:i for i,c in enumerate(chars)}
itos = {i:c for c,i in stoi.items()}
encode = lambda s: [stoi[x] for x in s]
decode = lambda e: ''.join([itos[x] for x in e])● Open the dataset, and define a variable that is a list of all unique characters in the text.● The set function splits the text character by character and then removes duplicates, just like sets in set theory. list(set(myvar)) is a way of removing duplicates in a list or string.● vocab_size is the number of unique characters (here 65). ● stoi is a dictionary where its keys are characters and values are their indices.● itos is used to convert indices to characters. ● encode function receives a string and returns indices of characters. ● decode receives a list of indices and returns a string. Split the dataset into test and train and write a function that returns data for training:device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.manual_seed(1234)
data = torch.tensor(encode(text), dtype=torch.long).to(device)
train_split = int(0.9 * len(data))
train_data = data[:train_split]
test_data = data[train_split:]
def get_batch(split='train', block_size=16, batch_size=1) -> 'Create a random batch and returns batch along with targets':
data = train_data if split == 'train' else test_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i + block_size] for i in ix])
y = torch.stack([data[i+1:i + block_size + 1] for i in ix])
return x, y● Choose a suitable device.● Set a seed to make the training reproducible.● Convert the text into a large list of indices with the encode function.● Since the character indices are integer, we use torch.long data type to make the data suitable for the model. ● 90% for training and 10% for testing.● If the batch_size is 10, we select 10 chunks or sequences from the dataset and stack them up to process them simultaneously. ● If the batch_size is 1, get_batch function selects 1 random chunk (n consequence characters) from the dataset and returns x and y, where x is 16 characters’ indices and y is the target characters for x.The shape, value, and decoded version of the selected chunk are as follows:shape x: torch.Size([1, 16])
shape y: torch.Size([1, 16])
value x: tensor([[41, 43, 6, 1, 60, 47, 50, 50, 39, 47, 52, 2, 1, 52, 43, 60]])
value y: tensor([[43, 6, 1, 60, 47, 50, 50, 39, 47, 52, 2, 1, 52, 43, 60, 43]])
decoded x: ce, villain! nev
decoded y: e, villain! neveWe usually process multiple chunks or sequences at once with batching in order to speed up the training. For each character, we have an equivalent target, which is its next token. The target for ‘c’ is ‘e’, for ‘e’ is ‘,’, for ‘v’ is ‘i’, and so on. Let us talk a bit about the input shape and output shape of tensors in a transformer model. The model receives a list of token indices like the above(named a sequence, or chunk) and maps them into their corresponding vectors.● The input shape is (batch_size, block_size).● After mapping indices into vectors, the data shape becomes (batch_size, block_size, embed_size).● Then, through the multihead attention and feed-forward layers, the data shape does not change.● Finally, the data with shape (batch_size, block_size, embed_size) goes to the transformer head (a simple neural network) and the output shape becomes (batch_size, block_size, vocab_size). vocab_size is the number of unique characters that can come next (for the Shakespeare dataset, the number of unique characters is 65).Self-attentionThe communication between tokens happens in the head class; we define the scores variable to save the similarity between vectors. The higher the score is, the more two vectors have in common. We then utilize these scores to do a weighted sum of all the vectors: class head(nn.Module):
def __init__(self, embeds_size=32, block_size=16, head_size=8):
super().__init__()
self.key = nn.Linear(embeds_size, head_size, bias=False)
self.query = nn.Linear(embeds_size, head_size, bias=False)
self.value = nn.Linear(embeds_size, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(0.1)
def forward(self, x):
B,T,C = x.shape
# What am I looking for?
q = self.query(x)
# What do I have?
k = self.key(x)
# What is the representation value of me?
# Or: what's my personality in the group?
# Or: what mask do I have when I'm in a group?
v = self.value(x)
scores = q @ k.transpose(-2,-1) * (1 / math.sqrt(C)) # (B,T,head_size) @ (B,head_size,T) --> (B,T,T)
scores = scores.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
scores = F.softmax(scores, dim=-1)
scores = self.dropout(scores)
out = scores @ v
return outUse three linear layers to transform the vector into key, query, and value, but with a smaller dimension (here same as head_size).Q, K, V: Q and K are for when we want to find similar tokens. We calculate the similarity between vectors with a dot product: q @ k.transpose(-2, -1). The shape of scores is (batch_size, block_size, block_size), which means we have the similarity scores between all the vectors in the block. V is used when we want to do the weighted sum. Scores: Pure dot product scores tend to have very high numbers that are not suitable for softmax since it makes the scores dense. Therefore, we rescale the results with a ratio of (1 / math.sqrt(C)). C is the embedding size. We call this a scaled dot product.Register_buffer: We used register_buffer to register a lower triangular tensor. In this way, when you save and load the model, this tensor also becomes part of the model.Masking: After calculating the scores, we need to replace future scores with -inf to shut them off so that the vectors do not have access to the future tokens. By doing so, these scores effectively become zero after applying the softmax function, resulting in a probability of zero for the future tokens. This process is referred to as masking. Here’s an example of masked scores with a block size 4:[[-0.1710, -inf, -inf, -inf],
[ 0.2007, -0.0878, -inf, -inf],
[-0.0405, 0.2913, 0.0445, -inf],
[ 0.1328, -0.2244, 0.0796, 0.1719]]Softmax: It converts a vector into a probability distribution that sums up to 1. Here’s the scores after softmax: [[1.0000, 0.0000, 0.0000, 0.0000], [0.5716, 0.4284, 0.0000, 0.0000], [0.2872, 0.4002, 0.3127, 0.0000], [0.2712, 0.1897, 0.2571, 0.2820]]The scores of future tokens are zero; after doing a weighted sum, the future vectors become zero and the vectors receive none data from future vectors(n*0=0)Dropout: Dropout is a regularization technique. It drops some of the numbers in vectors randomly. Dropout helps the model to generalize, not memorize the dataset. We don’t want the model to memorize the Shakespeare model, right? We want it to create new texts like the dataset.Weighted sum: Weighted sum is used to combine different representations or embeddings based on their importance. The scores are calculated by measuring the relevance or similarity between each pair of vectors. The relevance scores are obtained by applying a scaled dot product between the query and key vectors, which are learned during the training process. The resulting weighted sum emphasizes the more important elements and reduces the influence of less relevant ones, allowing the model to focus on the most salient information. We dot product scores with values and the result is the outcome of self-attention.Output: since the embedding size and head size are 32 and 8 respectively, if the input shape is (batch_size, block_size, 32), the output has the shape of (batch_size, block_size, 8).Multihead self-attention“I have multiple personalities(v), tendencies and needs (q), and valuable things (k) in different spaces”. Vectors said.We transform the vectors into small dimensions, and then run self-attention on them; we did this in the previous class. In multihead self-attention, we call the head class four times, and then, concatenate the smaller vectors to have the same input shape. We call this multihead self-attention. For instance, if the shape of input data is (1, 16, 32), we transform it into four (1, 16, 8) tensors and run self-attention on these tensors. Why four times? 4 * 8 = initial shape. By using multihead self-attention and running self-attention multiple times, what we do is consider the different aspects of vectors in different spaces. That’s all!Here is the code:class multihead(nn.Module):
def __init__(self, num_heads=4, head_size=8):
super().__init__()
self.multihead = nn.ModuleList([head(head_size) for _ in range(num_heads)])
self.output_linear = nn.Linear(embeds_size, embeds_size)
self.dropout = nn.Dropout(0.1)
def forward(self, hidden_state):
hidden_state = torch.cat([head(hidden_state) for head in self.multihead], dim=-1)
hidden_state = self.output_linear(hidden_state)
hidden_state = self.dropout(hidden_state)
return hidden_state● self.multihead: The variable creates four heads and we do this with nn.ModuleList.● self.output_linear: Another transformer linear layer we apply at the end of the multihead self-attention process.● self.dropout: Using dropout on the final results.● hidden_state 1: Concatenating the output of heads so that we have the same shape as input. Heads transform data into different spaces with smaller dimensions, and then do the self-attention.● hidden_state 2: After doing communication between tokens with self-attention, we use the self.output_linear projector to let the model adjust vectors further based on the gradients that flow through the layer.● dropout: Run dropout on the output of the projection with a 10% probability of turning off values (make them zero) in the vectors.Transformer blockThere are two new techniques, including layer normalization and residual connection, that need to be explained:class transformer_block(nn.Module):
def __init__(self, embeds_size=32, num_heads=8):
super().__init__()
self.head_count = embeds_size // num_heads
self.n_heads = multihead(num_heads, self.head_count)
self.ffn = nn.Sequential(
nn.Linear(embeds_size, 4 * embeds_size),
nn.ReLU(),
nn.Linear(4 * embeds_size, embeds_size),
nn.Dropout(drop_prob),
)
self.ln1 = nn.LayerNorm(embeds_size)
self.ln2 = nn.LayerNorm(embeds_size)
def forward(self, hidden_state):
hidden_state = hidden_state + self.n_heads(self.ln1(hidden_state))
hidden_state = hidden_state + self.ffn(self.ln2(hidden_state))
return hidden_state self.head_count: Calculates the head size. The number of heads should be divisible by the embedding size so that we can concatenate the output of heads.self.n_heads: The multihead self-attention layer. self.ffn: This is the first time that we have non-linearity in our model. Non-linearity helps the model to capture complex relationships and patterns in the data. By introducing non-linearity through ReLU activation functions, or GLUE, the model can make a correlation for the data. As a result, it better models the intricacies of the input data. Non-linearity is like “you go to the next layer”, “You don’t go to the next layer”, or “Create y from x for the next layer”. The recommended hidden layer size is a number four times bigger than the embedding size. That’s why “4 * embeds_size”. You can also try SwiGLU as the activation function instead of ReLU.self.ln1 and self.ln2: Layer normalizers make the model more robust and they also help the model to converge faster. Layer normalization rescales the data in such a way that the mean is zero and the standard deviation is one. hidden_state 1: Normalize the vectors with self.ln1 and forward the vectors to the multihead attention. Next, we add the input to the output of multihead attention. It helps the model in two ways:○ First, the model has some information from the original vectors. ○ Second, when the model becomes deep, during backpropagation, the gradients will be weak for earlier layers and the model will converge too slowly. We recognize this effect as gradient vanishing. Adding the input helps to enrich the gradients and mitigate the gradient vanishing. We recognize it as a residual connection.hidden_state 2: Hidden_state 1 goes to a layer normalization and then to a nonlinear network. The output will be added to the hidden state with the aim of keeping gradients for all layers.The modelAll the necessary parts are ready, let us stack them up to make the full model:class transformer(nn.Module):
def __init__(self):
super().__init__()
self.stack = nn.ModuleDict(dict(
tok_embs=nn.Embedding(vocab_size, embeds_size),
pos_embs=nn.Embedding(block_size, embeds_size),
dropout=nn.Dropout(drop_prob),
blocks=nn.Sequential(
transformer_block(),
transformer_block(),
transformer_block(),
transformer_block(),
transformer_block(),
),
ln=nn.LayerNorm(embeds_size),
lm_head=nn.Linear(embeds_size, vocab_size),
))● self.stack: A list of all necessary layers.● tok_embs: This is a learnable lookup table that receives a list of indices and returns their vectors.● pos_embs: Just like tok_embs, it is also a learnable look-up table, but for positional embedding. It receives a list of positions and returns their vectors.● dropout: Dropout layer.● blocks: We create multiple transformer blocks sequentially.● ln: A layer normalization.● lm_heas: Transformer head receives a token and returns probabilities of the next token. To change the model to be a classifier, or a sentimental analysis model, we just need to change this layer and remove masking from the self-attention layer.The forward method of the transformer class: def forward(self, seq, targets=None):
B, T = seq.shape
tok_emb = self.stack.tok_embs(seq) # (batch, block_size, embed_dim) (B,T,C)
pos_emb = self.stack.pos_embs(torch.arange(T, device=device))
x = tok_emb + pos_emb
x = self.stack.dropout(x)
x = self.stack.blocks(x)
x = self.stack.ln(x)
logits = self.stack.lm_head(x) # (B, block_size, vocab_size)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)
return logits, loss● tok_emb: Convert token indices into vectors. Given the input (B, T), the output is (B, T, C), where C is the embeds_size.● pos_emb: Given the number of tokens in the context window or block_size, it returns the positional embedding of each position.● x 1: Add up token embeddings and position embeddings. A little bit lossy but it works just fine.● x 2: Run dropout on embeddings.● x 3: The embeddings go through all the transformer blocks, and multihead self-attention. The input is (B, T, C) and the output is (B, T, C).● x 4: The outcome of transformer blocks goes to the layer normalization.● logits: We usually recognize the unnormalized values extracted from the language model head as logits :)● if-else block: Were the targets specified, we calculated cross-entropy loss. Otherwise, the loss will be None. Before calculating loss in the else block, we change the shape as the cross entropy function expects.● Output: The method returns logits with shape (batch_size, block_size, vocab_size) and loss if any.For generating a text, add this to the transformer class: def autocomplete(self, seq, _len=10):
for _ in range(_len):
seq_crop = seq[:, -block_size:] # crop it
logits, _ = self(seq_crop)
logits = logits[:, -1, :] # we care about the last token
probs = F.softmax(logits, dim=-1)
next_char = torch.multinomial(probs, num_samples=1)
seq = torch.cat((seq, next_char), dim=1)
return seq● autocomplete: Given a tokenized sequence, and the number of tokens that need to be created, this method returns _len tokens.● seq_crop: Select the last n tokens in the sequence to give it to the model. The sequence length might be larger than the block_size and it causes an error if we don’t crop it.● logits 1: Forward the sequence into the model to receive the logits.● logits 2: Select the last logit that will be used to select the next token.● probs: Run the softmax on logits to get a probability distribution.● next_char: Multinomial selects one sample from the probs. The higher the probability of a token, the higher the chance of being selected.● seq: Add the selected character to the sequence.TrainingThe rest of the code is downstream tasks such as training loops, etc. The codes that are provided here are slightly different from the tiny-transformer repository. I trained the model with the following hyperparameters:block_size = 256
learning_rate = 9e-4
eval_interval = 300 # Every n step, we do an evaluation.
iterations = 5000 # Like epochs
batch_size = 64
embeds_size = 195
num_heads = 5
num_layers = 5
drop_prob = 0.15And here’s the generated text:If you need to improve the quality, increase embeds_size, num_layers, and heads.ConclusionThe article explores transformers' text generation role, detailing token preprocessing through self-attention and neural network heads. Transformers predict tokens using context length as a hyperparameter. Human context comprehension is paralleled, highlighting relevant word emergence and fading of irrelevant words for precise selection. Transformers lack human foresight and backtracking. Key components—self-attention, multihead self-attention, and transformer blocks—are explained, and supported by code snippets. Token and positional embeddings, layer normalization, and residual connections are detailed. The model's text generation is exemplified via the autocomplete method. Training parameters and text quality enhancement are addressed, showcasing transformers' potential.Author BioSaeed Dehqan trains language models from scratch. Currently, his work is centered around Language Models for text generation, and he possesses a strong understanding of the underlying concepts of neural networks. He is proficient in using optimizers such as genetic algorithms to fine-tune network hyperparameters and has experience with neural architecture search (NAS) by using reinforcement learning (RL). He implements models starting from data gathering to monitoring, and deployment on mobile, web, cloud, etc.
Read more