- Published on
Drop-In Replacement for GPT with Llama 2 for OpenAI API
- Authors
- Name
- Roanak Baviskar
In our last blog post, we showed how to deploy a FastChat OpenAI API endpoint. Here, we'll show how to request from this endpoint in Python through the OpenAI, LlamaIndex, and Langchain python APIs.
Setup
First, create the serving endpoint. This can be done with one command using llm-atc serve
. Afterwards, you'll need to get the public IP address assigned to the endpoint for sending requests.
# Serve Llama 2 7b can be served on a V100 on aws
$ llm-atc serve --name meta-llama/Llama-2-7b-chat-hf --accelerator V100:1 -c servecluster --cloud aws --region us-east-2 --envs "HF_TOKEN=<HuggingFace_token>"
# Get the ip address of the created endpoint
$ grep -A1 "Host servecluster" ~/.ssh/config | grep "HostName" | awk '{print $2}'
OpenAI Python API
import openai
# to get proper authentication, make sure to use a valid key that's listed in
# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
openai.api_key = "EMPTY"
openai.api_base = "http://<YOUR ENDPOINT IP>:8000/v1"
model = "Llama-2-7b-chat-hf"
prompt = "Once upon a time"
# create a completion
completion = openai.Completion.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)
# create a chat completion
completion = openai.ChatCompletion.create(
model=model, messages=[{"role": "user", "content": "Hello! Who are you?"}]
)
# print the completion
print(completion.choices[0].message.content)
Llama Index
import openai
from llama_index.llms import ChatMessage, OpenAI
from llama_index.llms.base import LLMMetadata
FASTCHAT_IP = "<YOUR ENDPOINT IP>"
openai.api_base = f"http://{FASTCHAT_IP}:8000/v1"
openai.api_key = "EMPTY"
class FastChatLlama2(OpenAI):
@property
def metadata(self) -> LLMMetadata:
return LLMMetadata(
context_window=4000,
num_output=self.max_tokens or -1,
is_chat_model=self._is_chat_model,
is_function_calling_model=False,
model_name=self.model,
)
messages = [
ChatMessage(role="system", content="You are a pirate with a colorful personality"),
ChatMessage(role="user", content="What is your name"),
]
resp = FastChatLlama2(model="Llama-2-7b-chat-hf", max_tokens=4000).chat(messages)
print(resp)
Langchain
import os
FASTCHAT_IP = "<YOUR ENDPOINT IP>"
os.environ["OPENAI_API_BASE"] = f"http://{FASTCHAT_IP}:8000/v1"
os.environ["OPENAI_API_KEY"] = "EMPTY"
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
chat = ChatOpenAI(model="Llama-2-7b-chat-hf")
messages = [
SystemMessage(content="You are a pirate with a colorful personality"),
HumanMessage(content="What is your name"),
]
print(chat(messages))