๐ก How Does the OpenAI Standard Work?
The OpenAI API standard defines a simple and powerful way for applications to communicate with language models โ whether running in the cloud or locally.
It uses a multi-role chat format, which includes three types of messages:
Role | Description |
---|---|
๐ System | Used to give high-level instructions to the model. For example: โYou are a helpful assistant,โ or โYou are a coding tutor that only explains using examples.โ It sets the tone, rules, or available tools for the model. |
๐ค User | Messages from the application to the model. These often come from the end-user (e.g., a typed prompt like: โExplain black holes.โ). |
๐ค Assistant | Responses generated by the model and returned to the application. These are answers to the user prompts. |
This structure makes it easy to build multi-turn conversations with consistent behavior.
๐ Developer Support
OpenAI provides official libraries in multiple programming languages to help developers follow the standard easily: Python, JavaScript, .NET, Java, and Go.
These libraries make it easy to send prompts, receive completions, and integrate with local or cloud-based OpenAI-compatible servers.
๐ Quick Test: Use OpenAI SDK with FastFlowLM in Python
You can try this instantly in any Python environment โ including Jupyter Notebook. Follow the steps below by copying each block into a notebook cell.
โ Step 0: Start FastFlowLM in Server Mode
Open PowerShell or terminal and launch the model server:
flm serve llama3.2:1B
๐ง This loads the model and starts the FastFlowLM OpenAI-compatible API at
http://localhost:11434/v1
.
โ Step 1: Install the OpenAI Python SDK
!pip install --upgrade openai
โ Step 2: Send a Chat Request to FastFlowLM
# Quick Start
from openai import OpenAI
# Connect to local FastFlowLM server
client = OpenAI(
base_url="http://localhost:11434/v1", # FastFlowLM's local API endpoint
api_key="flm" # Dummy key (FastFlowLM doesnโt require authentication)
)
# Send a chat-style prompt using OpenAI API format
response = client.chat.completions.create(
model="llama3.2:1B", # Replace with any model you've launched with `flm serve`
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is the sky blue?"}
]
)
# Show the model's response
print(response.choices[0].message.content)
# cleanup
del response, client
import gc
gc.collect()
๐ Notes
- ๐ง You can replace
"llama3.2:1B"
with any other model available viaflm run
orflm pull
. - ๐ฅ Make sure the FastFlowLM server is running in the background (
flm serve ...
). - ๐ No real API key is needed โ just pass
"flm"
as a placeholder. - โก FastFlowLM runs fully offline and is optimized for AMD Ryzen AI NPUs.
โ This setup is perfect for quick offline LLM testing using standard OpenAI tooling.
๐งช More Examples
๐ Ah โ that was easy, right?
Now letโs kick things up a notch with some awesome next-level examples!
๐ฌ Example: Multi-turn Chat (Conversation History)
Use this pattern when you want the model to remember previous turns in the conversation:
# Multi-turn conversation
from openai import OpenAI
messages = [
{"role": "system", "content": "You are a creative writing assistant."},
{"role": "user", "content": "Write the beginning of a fantasy story."},
]
client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")
response = client.chat.completions.create(model="llama3.2:1B", messages=messages)
print(response.choices[0].message.content)
# Add the assistant response and continue the conversation
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "Continue the story with a twist."})
response = client.chat.completions.create(model="llama3.2:1B", messages=messages)
print(response.choices[0].message.content)
# cleanup
del response, client
import gc
gc.collect()
โ ๏ธ The OpenAI API (and FastFlowLM server mode) is stateless โ you must resend the full conversation each time. No KV cache is kept between turns.
๐ This means all previous messages are reprocessed (prefill), which adds latency for long chats.
โก FastFlowLMโs CLI mode uses a real KV cache, making multi-turn responses much faster โ especially with long conversations.
๐ง FastFlowLM is optimized for long sequences with large KV caches, ideal for 32kโ128k context windows.
๐ง Weโre working on adding stateful KV cache to server mode. Stay tuned!
๐ Example: Streamed Output (Real-Time Response)
Display the modelโs output as it generates, token-by-token:
# Streaming
from openai import OpenAI
import gc, sys
client = OpenAI(base_url="http://localhost:11434/v1/", api_key="flm")
stream = client.chat.completions.create(
model="llama3.2:1B",
messages=[
{"role": "system", "content": "You are a fast, concise assistant."},
{"role": "user", "content": "Summarize the plot of Hamlet in 3 sentences."}
],
stream=True
)
print("=== STREAMING BEGIN ===")
try:
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print("\n=== STREAMING END ===")
except Exception as e:
sys.stderr.write(f"\n[STREAM ERROR] {e}\n")
#cleanup
del stream, client
gc.collect()
๐ Example: Use a File as the Prompt
You can load a full .txt
file as a prompt โ useful for long documents or testing large context windows.
๐ Download the sample prompt
Download to Downloads folder. This contains over 38k token, thus may take longer to prompt. FastFlowLM supports full context length (32kโ128k), making it ideal for processing long documents like this
# Use a text file to prompt
from openai import OpenAI
with open("C:\\Users\\<username>\\Downloads\\alice_in_wonderland.txt", "r", encoding="utf-8") as f:
user_prompt = f.read()
client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")
response = client.chat.completions.create(
model="llama3.2:1B",
messages=[
{"role": "system", "content": "You are a precise assistant."},
{"role": "user", "content": user_prompt}
]
)
print(response.choices[0].message.content)
# cleanup
del response, client
import gc
gc.collect()
๐ Example: Batch Requests (Multiple Prompts)
Loop over a list of prompts and generate answers โ useful for eval or bulk testing.
# Batched prompts
from openai import OpenAI
prompts = [
"Summarize the causes of World War I.",
"Describe how a transistor works.",
"What are the key themes in โTo Kill a Mockingbirdโ?",
]
client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")
for prompt in prompts:
response = client.chat.completions.create(
model="llama3.2:1B",
messages=[
{"role": "system", "content": "You are a concise academic tutor."},
{"role": "user", "content": prompt}
]
)
print(f"\n๐ Prompt: {prompt}\n๐ Answer: {response.choices[0].message.content}")
# cleanup
del response, client
import gc
gc.collect()
๐งฌ Example: Use Temperature, Top-p, and Presence Penalty
Control randomness and creativity โ for brainstorming or open-ended tasks.
# Change hyper parameters
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")
response = client.chat.completions.create(
model="llama3.2:1B",
messages=[
{"role": "system", "content": "You are a creative brainstorming partner."},
{"role": "user", "content": "Give me 5 startup ideas that combine AI and education."}
],
temperature=0.9, # More randomness
top_p=0.95, # Nucleus sampling
presence_penalty=0.5, # Encourage novelty
)
print(response.choices[0].message.content)
# cleanup
del response, client
import gc
gc.collect()