Introduction to LLaMA

It seems like everyone, and their grandmothers, are discussing Large Language Models (LLMs) these days. These models got all the hype since ChatGPT's release in late 2022. The average user might get lost in acronyms such as GPT, PaLM, or LLaMA, and that’s understandable. This article will shed some light on why you should generally care about LLMs and exactly what they bring to the table.

By the end of this article, you’ll have a fundamental understanding of the LLaMA model, how it compares to other large language models, and will have the 7B flavor of LLaMA running locally on your machine.

There’s no time to waste, so let’s dive straight in!

The Purpose of LLaMA and Other Large Language Models

The main idea behind LLMs is to understand and generate human-like text based on the input you feed into them. Ask a human-like question and you’ll get a human-like response back. You know what we’re talking about if you’ve ever tried ChatGPT.

These models are typically trained on huge volumes of data, sometimes even as large as everything that has been written on the Internet over some time span. This data is then fed into the algorithms using unsupervised learning which has the task of learning words and relationships between them.

Large Language Models can be generic or domain-specific. You can use a generic LLM and fine-tune it for a certain task, similar to what OpenAI did with Codex (LLM for programming).

As the end-user, you can benefit from LLMs in several ways:

Content generation – You can use LLMs to generate content for personal or professional purposes, such as articles, emails, social media posts, and so on.
Information retrieval – LLMs help you find relevant information quickly and often do a better job when compared to a traditional web search. Just be aware of the training date cap the model has – it might not do as well on the recent events.
Language assistance and translation – These models can detect spelling errors and grammar mistakes, suggest writing improvements, provide synonyms, idioms, and even provide a meaningful translation from one language to another.

At the end of the day, probably everyone can find a helpful use case in a large language model.

But which one should you choose? There are many publicly available models, but the one that stands out recently is LLaMA. Let’s see why and how it works next.

What is LLaMA and How it Works?

LLaMA stands for “Large Language Model Meta AI” and is a large language model published by – you’ve guessed it – Meta AI. It was released in February 2023 in a variety of flavors – from 7 billion to 65 billion parameters.

A LLaMA model uses the Transformer architecture and works by generating probability distributions over sequences of words (or tokens). In plain English, this means the LLaMA model predicts the next most reasonable word given the sequence of input words.

It’s interesting to point out that LLaMA-13B (13 billion parameters) outperforms GPT-3 on most benchmarks, even though GPT-3 has 13 times more parameters (175 billion). The more parameter-rich LLaMA (65B parameters) is on par with the best large language models we have available today, according to the official paper by Meta AI.

In fact, let’s take a look at these performance differences by ourselves. The following table from the official paper summarizes it well:

introduction-to-llama-img-0

Figure 1 - LLaMA performance comparison with other LLMs

Generally speaking, the more parameters the LLaMA model contains, the better it performs. The interesting fact is that even the 7B version is comparable in performance – or even outperforms – the models with significantly more parameters.

The 7B model performs reasonably well, so how can you try it out? In the next section, you’ll have LLaMA running locally with only two shell commands.

How to Run LLaMA Locally?

You’ll need a couple of things to run LLaMA locally – decent hardware (doesn’t have to be the newest), a lot of hard drive space, and a couple of software dependencies installed. It doesn’t matter which operating system you’re using, as the implementation we’re about to show you is cross-platform.

For reference, we ran the 7B parameter model on an M1 Pro MacBook with 16 GB of RAM. The model occupied 31 GB of storage, and you can expect this amount to grow if you choose a LLaMA flavor with more parameters.

Regarding software dependencies, you’ll need a recent version of Node. We used version 18.16.0 with npm version 9.5.1.

Once you have Node installed, open up a new Terminal/CMD window and run the following command. It will install the 7B LLaMA model:

npx dalai llama install 7B

You might get a prompt to install dalai first, so just type y into the console. Once Dalai is installed, it will proceed to download the model weights. You should see something similar during this process:

introduction-to-llama-img-1

Figure 2 - Downloading LLaMA 7B model weights

It will take some time, depending on your Internet speed. Once done, you’ll have the 7B model available in the Dalie web UI. Launch it with the following shell command:

npx dalai serve

This is the output you should see:

introduction-to-llama-img-2

Figure 3 - Running dalai web UI locally

The web UI is now running locally on port 3000. As soon as you open http://localhost:3000, you’ll be presented with the interface that allows you to choose the model, tweak the parameters, and select a prompting template.

For reference, we’ve selected the chatbot template and left every setting as default.

The prompt we’ve entered is “What is machine learning?” Here’s what the LLaMA model with 7B parameters outputted:

introduction-to-llama-img-3

Figure 4 - Dalai user interface

The answer is mostly correct, but the LLaMA response started looking like a blog post toward the end (“In this article…”). As with all large language models, you can use it to draw insights, but only after some human intervention.

And that’s how you can run a large language model locally! Let’s make a brief recap next.

Conclusion

It’s getting easier and cheaper to train large language models, which means the number of options you’ll have is only going to grow over time.

LLaMA was only recently released to the public, and today you’ve learned what it is, got a high-level overview of how it works, and how to get it running locally. You might want to tweak the 7B version if you’re not getting the desired response or opt for a version with more parameters (if your hardware allows it). Either way, have fun!

Author Bio:

Dario Radečić is a Senior Data Scientist at Neos, Croatia. Book author: "Machine Learning Automation with TPOT". Owner of betterdatascience.com. You can follow him on Medium: https://medium.com/@radecicdario