How to use Llama3 with Groq

May 13, 2024

Simple tutorial to run Llama3 from the fastest inference engine on the market, at the best price.

Meta has recently unveiled LLaMA3, the latest iteration of its cutting-edge language model. The 70B version is as powerful as GPT-4 according to benchmarks, even a little bit better. It is also open-source, and anyone can deploy the model and run it.

That is what Groq are doing, running the model on their “LPU” (Language Processing Units), chips designed especially to run LLMs!

As a result, the inference rate is more than 10x times higher on Groq. At the time of writing this article, I just got a whooping 310 tokens per second using llama3-70b, while I got 13 tokens per second with GPT-4. So, 24x times faster. It can be quite useful when you run applications with a lot of post-processing on large texts, such as my own creation, youtubesummary.com where several prompts are chained in order to generate a summary of a YouTube video.

Groq created an HTTP API that you can use to run the model. This way, you don’t have to run it yourself, you give them the hard work and they simply answer back using the good old HTTP protocol.

I’m going to show you two ways of calling that API using Python. The first level is using “raw” http, and the second using a specialized package.

Here we go!

Level 1 - Raw HTTP requests.

Prerequisites:

You have a Groq account and have created a new project.
You have Python installed on your computer (preferably Python 3.7 or later).
You have the requests library installed in your Python environment. You can install it using pip install requests.

Step 1: Get your API key

Log in to your Groq account and go to your API Keys dashboard.
Click on “Create API Key”
Give it a nice name
Copy the API key

Step 2: Install the required libraries

Open a new terminal or command prompt.
Install the requests library if you haven’t already: pip install requests

Step 3: Set up your Python script

Create a new Python file (e.g., llama3_groq_level1.py) and add the following code:

import requests

# Replace with your API key
api_key = "YOUR_API_KEY_HERE"


def call_llama3(prompt):
    # Set the API endpoint URL
    url = "https://api.groq.com/openai/v1/chat/completions"

    # Set the headers with your API key, and set the content type to JSON.
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    # Format the payload manually, injecting the prompt inside.
    payload = {
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "model": "llama3-70b-8192"
    }

    # Make the HTTP Call using a POST request, and sending the payload as a JSON body
    response = requests.post(url, headers=headers, json=payload)

    # Process the response
    if response.status_code == 200:
        return response.json()
    else:
        print("Error:", response.status_code)
        print("Error:", response.json())


if __name__ == "__main__":
    response = call_llama3("What is the capital of France?")
    print(response)

Replace YOUR_API_KEY_HERE with your actual API key.

Step 4: Run the code and see the response!

python3 llama3_groq_level1.py will output:

{
  "id": "chatcmpl-089...",
  "object": "chat.completion",
  "created": 1715628879,
  "model": "llama3-70b-8192",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "prompt_time": 0.01,
    "completion_tokens": 7,
    "completion_time": 0.023,
    "total_tokens": 24,
    "total_time": 0.033
  },
  "system_fingerprint": "fp_...",
  "x_groq": {
    "id": "req_01..."
  }
}

As you can see, it is a raw JSON. It contains lots of info such as the amount of tokens it consumed, the time it took to respond, or the timestamp of the request. To find the response from the LLM, we have to dive into the JSON and get the choices list, then the first element of that list, then the message property, and finally its content.

Which gives us:

The capital of France is Paris.

Level 2 - Using a specialized package.

As demonstrated, we can leverage basic Python and HTTP requests to interact with Groq’s inference API. However, this approach requires manual formatting of the payload, which demands a deep understanding of the message structure. Similarly, parsing the response also necessitates a thorough knowledge of the API’s intricacies. Although Groq’s API is designed to be compatible with OpenAI’s (and easy switch if you move from OpenAI 😈) and the community is becoming increasingly familiar with its structure, the underlying JSON-based API still presents a raw and unabstracted interface.

Step 1 - Install the required dependency: groq!

Yeah, no surprise, Groq made their own API wrapper in Python.
pip install groq

Step 2: Set up your Python script

Create a new Python file (e.g., llama3_groq_level2.py) and add the following code:

from groq import Groq

# Replace with your API key
api_key = "YOUR_API_KEY_HERE"


def call_llama3(prompt):
    client = Groq(api_key=api_key)

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="llama3-70b-8192",
    )

    return chat_completion


if __name__ == "__main__":
    response = call_llama3("What is the capital of France?")
    print(response)

You should spot some similarities with the previous code. Some of the HTTP fluff has been absorbed by the library, which leaves the function only focused on one thing: sending that prompt to the API.

Step 3: Execute the code

Same here, a simple python command: python3 llama3_groq_level2.py

The response here is ChatCompletion(choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChoiceMessage(content='The capital of France is Paris!', role='assistant', tool_calls=None))], id='chatcmpl-20dc6941-434c-464e-a895-04c053d25c68', created=1715629847, model='llama3-70b-8192', object='chat.completion', system_fingerprint='fp_abd29e8833', usage=Usage(completion_time=0.022, completion_tokens=7, prompt_time=0.009, prompt_tokens=17, queue_time=None, total_time=0.031, total_tokens=24), x_groq={'id': 'req_01hxsr3wswec7vjd12z1dna4c0'})

So, not so different from the previous, JSON one heh?

Yeah, except that now it’s an object, which I find a bit easier to walk through, even though it’s very similar to the JSON one.

Here’s the code to extract the response from the LLM to our prompt:

print(response.choices[0].message.content)

Which gives us:

The capital of France is Paris!

You’ve seen two ways of using the Groq inference API

The goal of this article was to show you how easy it was to run a prompt using Groq’s inference API. If you want to go further and see what you can do using Groq and Llama3, check out the examples here: https://console.groq.com/docs/examples

Don’t hesitate to share this tutorial with fellow developers, AI enthusiasts, or anyone interested in exploring the capabilities of LLaMA3 and Groq!