Fine-Tuning LLaMA 2

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

Large Language Models have recently become the talk of the town. I am very sure, you must have heard of ChatGPT. Yes, that’s an LLM, and that’s what I am talking about. Every few weeks, we have been witnessing newer, better but not necessarily larger LLMs coming out either as open-source or closed-source. This is probably the best time to learn about them and make these powerful models work for your specific use case.

In today’s blog, we will look into one of the recent open-source models called Llama2 and try to fine-tune it on a standard NLP task of recognizing entities from text. We will first look into what are large language models, what are open-source and closed-source models, and some examples of them. We will then move to learning about Llama2 and why is it so special. We then describe our NLP task and dataset. Finally, we get into coding.

About Large Language Models (LLMs)

Language models are artificial intelligence systems that have been trained to understand and generate human language. Large Language Models (LLMs) like GPT-3, ChatGPT, GPT-4, Bard, and similar can perform diverse sets of tasks out of the box. Often the quality of output from these large language models is highly dependent on the finesse of the prompt given by the user.

These Language models are trained on vast amounts of text data from the Internet. Most of the language models are trained in an auto-regressive way i.e. they try to maximize the probability of the next word based on the words they have produced or seen in the past. This data includes a wide range of written text, from books and articles to websites and social media posts. Language models have a wide range of applications, including chatbots, virtual assistants, content generation, and more. They can be used in industries like customer service, healthcare, finance, and marketing.

Since these models are trained on enormous data, they are already good at zero-shot inference and can be steered to perform better with few-shot examples. Zero-shot is a setup in which a model can learn to recognize things that it hasn't explicitly seen before in training. In a Few-shot setting, the goal is to make predictions for new classes based on the few examples of labeled data that is provided to it at inference time.

Despite their amazing capabilities of generating text, these humongous models come with a few limitations that must be thought of when building an LLM-based production pipeline. Some of these limitations are hallucinations, biases, and more.

Closed and Open-source Language Models

Large language models from closed-source are those employed by some companies and are not readily accessible to the public. Training data for these models are typically kept private. While they can be highly sophisticated, this limits transparency, potentially leading to concerns about bias, and data privacy.

In contrast, open-source projects like GPT-3, are designed to be freely available to researchers and developers. These models are trained on extensive, publicly available datasets, allowing for a degree of transparency and collaboration.

The decision between closed- and open-source language models is influenced by several variables, such as the project's goals, the need for openness, and others.

About LLama2

Meta's open-source LLM is called Llama 2. It was trained with 2 trillion "tokens" from publicly available sources like Wikipedia, Common Crawl, and books from the Gutenberg project. Three different parameter level model versions are available, i.e. 7 billion, 13 billion, and 70 billion parameter models. There are two types of completion models available: Chat-tuned and General. The chat-tuned models that have been fine-tuned for chatbot-like dialogue are denoted by the suffix '-chat'. We will use general Meta's 7b Llama-2 huggingface model as the base model that we fine-tune. Feel free to use any other version of llama2-7b.

Also, if you are interested, there are several threads that you can go through to understand how good is Llama family w.r.t GPT family is - source, source, source.

About Named Entity Recognition

As a component of information extraction, named-entity recognition locates and categorizes specific entities inside the unstructured text by allocating them to pre-defined groups, such as individuals, organizations, locations, measures, and more. NER offers a quick way to understand the core idea or content of a lengthy text.

There are many ways of extracting entities from a given text, in this blog, we will specifically delve into fine-tuning Llama2-7b using PEFT techniques on Colab Notebook.

We will transform the SMSSpamCollection classification data set for NER. Pretty interesting 😀

We search through all 10 letter words and tag them as 10_WORDS_LONG. And this is the entity that we want our Llama to extract. But why this bizarre formulation? I did it intentionally to show that this is something that the pre-trained model would not have seen during the pre-training stage. So it becomes essential to fine-tune it and make it work for our use case 👍. But surely we can add logic to our formulation - think of these words as probable outliers/noisy words. The larger the words, the higher the possibility of it being noise/oov. However, you will have to come up with the extract letter count after seeing the word length distribution. Please note that the code is generic enough for fine-tuning any number of entities. It’s just a change in the data preparation step that we will make to slice out only relevant entities.

Code for Fine-tuning Llama2-7b

# Importing Libraries
from transformers import LlamaTokenizer, LlamaForCausalLM 
import torch
from datasets import 
Dataset import transformers 
import pandas as pd
from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_int8_training, get_peft_model_state_dict, PeftModel 
from sklearn.utils import shuffle

Data Preparation Phase

df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)  
 all_text = df[1].str.lower().tolist()  
 
 input, output = [], []  
 for text in all_text:  
               input.append(text)  
               output.append({word: '10_WORDS_LONG' for word in text.split() if len(word)==10}) 
 
 df = pd.DataFrame([input, output]).T 
 df.rename({0:'input_text', 1: 'output_text'}, axis=1, inplace=True) 
 print (df.head(5))

 total_ds = shuffle(df, random_state=42) 
 total_train_ds = total_ds.head(4000) 
 total_test_ds = total_ds.tail(1500)
 
 
 total_train_ds_hf = Dataset.from_pandas(total_train_ds) 
 total_test_ds_hf = Dataset.from_pandas(total_test_ds) 
 
 tokenized_tr_ds = total_train_ds_hf.map(generate_and_tokenize_prompt) 
 tokenized_te_ds = total_test_ds_hf.map(generate_and_tokenize_prompt)

Fine-tuning Phase

# Loading Model

model_name = "meta-llama/Llama-2-7b-hf" 
 tokenizer = AutoTokenizer.from_pretrained(model_name) 
 model = AutoModelForCausalLM.from_pretrained(model_name)
 
 
def create_peft_config(m): 
 peft_cofig = LoraConfig( 
 task_type=TaskType.CAUSAL_LM, 
 inference_mode=False, 
 r=8, 
 lora_alpha=16, 
 lora_dropout=0.05, 
 target_modules=['q_proj', 'v_proj'], 
 ) 
 model = prepare_model_for_int8_training(model) 
 model.enable_input_require_grads() 
 model = get_peft_model(model, peft_cofig) 
 model.print_trainable_parameters() 
 return model, peft_cofig
 
model, lora_config = create_peft_config(model) 
 
def generate_prompt(data_point): 
 return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. 
 ### Instruction: 
 Extract entity from the given input: 
 ### Input: 
 {data_point["input_text"]} 
 ### Response: 
 {data_point["output_text"]}"""
 
tokenizer.pad_token_id = 0 
def tokenize(prompt, add_eos_token=True): 
 result = tokenizer( 
 prompt, 
 truncation=True, 
 max_length=128, 
 padding=False, 
 return_tensors=None, 
 ) 
 if ( result["input_ids"][-1] != tokenizer.eos_token_id and len(result["input_ids"]) < 128 and add_eos_token 
 ): 
 result["input_ids"].append(tokenizer.eos_token_id) 
 result["attention_mask"].append(1) 
 result["labels"] = result["input_ids"].copy() 
 return result
 
 
 
def generate_and_tokenize_prompt(data_point): 
 full_prompt = generate_prompt(data_point) 
 tokenized_full_prompt = tokenize(full_prompt) 
 return tokenized_full_prompt
 
 
training_arguments = transformers.TrainingArguments(  
 per_device_train_batch_size=1, gradient_accumulation_steps=16,  
 learning_rate=4e-05,  
 logging_steps=100,  
 optim="adamw_torch",  
 evaluation_strategy="steps",  
 save_strategy="steps",  
 eval_steps=100,  
 save_steps=100,  
 output_dir="saved_models/" 
 ) 
data_collator = transformers.DataCollatorForSeq2Seq(tokenizer) 
trainer = transformers.Trainer(model=model, tokenizer=tokenizer, 
            train_dataset=tokenized_tr_ds, eval_dataset=tokenized_te_ds, args=training_arguments, 
               data_collator=data_collator) 
 
 with torch.autocast("cuda"): 
       trainer.train()

Inference

Loaded_tokenizer = LlamaTokenizer.from_pretrained(model_name) 
Loaded_model = LlamaForCausalLM.from_pretrained(model_name, load_in_8bit=True, torch.dtype=torch.float16, device_map=’auto’) 
Model = PeftModel.from_pretrained(Loaded_model, “saved_model_path”, torch.dtype=torch.float16) 
Model.config.pad_tokeni_id = loaded_tokenizer.pad_token_id = 0 
Model.eval() 
 
 def extract_entity(text): 
   inp = Loaded_tokenizer(prompt, return_tensor=’pt’).to(“cuda”) 
   with torch.no_grad(): 
       P_ent = Loaded_tokenizer.decode(model.generate(**inp, max_new_tokens=128)[0], skip_special_tokens=True) 
       int_idx = P_ent.find(‘Response:’) 
       P_ent = P_ent[int_idx+len(‘Response:’):] 
   return P_ent.strip() 
 extracted_entity = extract_entity(text) 
 print (extracted_entity)

Conclusion

We covered the process of optimizing the llama2-7b model for the Named Entity Recognition job in this blog post. For that matter, it can be any task that you are interested in. The core concept that one must learn from this blog is PEFT-based training of large language models. Additionally, as pre-trained LLMs might not always perform well in your work, it is best to fine-tune these models.

Author Bio

Prakhar Mishra has a Master’s in Data Science with over 4 years of experience in industry across various sectors like Retail, Healthcare, Consumer Analytics, etc. His research interests include Natural Language Understanding and generation, and has published multiple research papers in reputed international publications in the relevant domain. Feel free to reach out to him on LinkedIn