OpenAI API Standard

📡 How Does the OpenAI Standard Work?

The OpenAI API standard defines a simple and powerful way for applications to communicate with language models — whether running in the cloud or locally.

It uses a multi-role chat format, which includes three types of messages:

Role	Description
🛠 System	Used to give high-level instructions to the model. For example: “You are a helpful assistant,” or “You are a coding tutor that only explains using examples.” It sets the tone, rules, or available tools for the model.
👤 User	Messages from the application to the model. These often come from the end-user (e.g., a typed prompt like: “Explain black holes.”).
🤖 Assistant	Responses generated by the model and returned to the application. These are answers to the user prompts.

This structure makes it easy to build multi-turn conversations with consistent behavior.

📚 Developer Support

OpenAI provides official libraries in multiple programming languages to help developers follow the standard easily: Python, JavaScript, .NET, Java, and Go.

These libraries make it easy to send prompts, receive completions, and integrate with local or cloud-based OpenAI-compatible servers.

🚀 Quick Test: Use OpenAI SDK with FastFlowLM in Python

You can try this instantly in any Python environment — including Jupyter Notebook. Follow the steps below by copying each block into a notebook cell.

✅ Step 0: Start FastFlowLM in Server Mode

Open PowerShell or terminal and launch the model server:

flm serve llama3.2:1b

🧠 This loads the model and starts the FastFlowLM OpenAI-compatible API at http://localhost:11434/v1.

✅ Step 1: Install the OpenAI Python SDK

!pip install --upgrade openai

✅ Step 2: Send a Chat Request to FastFlowLM

# Quick Start
from openai import OpenAI

# Connect to local FastFlowLM server
client = OpenAI(
    base_url="http://localhost:11434/v1",  # FastFlowLM's local API endpoint
    api_key="flm"  # Dummy key (FastFlowLM doesn’t require authentication)
)

# Send a chat-style prompt using OpenAI API format
response = client.chat.completions.create(
    model="llama3.2:1b",  # Replace with any model you've launched with `flm serve`
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is the sky blue?"}
    ]
)

# Show the model's response
print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

📌 Notes

🧠 You can replace "llama3.2:1b" with any other model available via flm run or flm pull.
🖥 Make sure the FastFlowLM server is running in the background (flm serve ...).
🔒 No real API key is needed — just pass "flm" as a placeholder.
⚡ FastFlowLM runs fully offline and is optimized for AMD Ryzen™ AI NPUs.

✅ This setup is perfect for quick offline LLM testing using standard OpenAI tooling.

🧪 More Examples

🚀 Ah — that was easy, right?
Now let’s kick things up a notch with some awesome next-level examples!

💬 Example: Multi-turn Chat (Conversation History)

Use this pattern when you want the model to remember previous turns in the conversation:

# Multi-turn conversation
from openai import OpenAI

messages = [
    {"role": "system", "content": "You are a creative writing assistant."},
    {"role": "user", "content": "Write the beginning of a fantasy story."},
]

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")
response = client.chat.completions.create(model="llama3.2:1b", messages=messages)
print(response.choices[0].message.content)

# Add the assistant response and continue the conversation
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "Continue the story with a twist."})
response = client.chat.completions.create(model="llama3.2:1b", messages=messages)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

⚠️ The OpenAI API (and FastFlowLM server mode) is stateless — you must resend the full conversation each time. No KV cache is kept between turns.

🌀 This means all previous messages are reprocessed (prefill), which adds latency for long chats.

⚡ FastFlowLM’s CLI mode uses a real KV cache, making multi-turn responses much faster — especially with long conversations.

🧠 FastFlowLM is optimized for long sequences with large KV caches, ideal for 32k–128k context windows.

🔧 We’re working on adding stateful KV cache to server mode. Stay tuned!

🔁 Example: Streamed Output (Real-Time Response)

Display the model’s output as it generates, token-by-token:

# Streaming
from openai import OpenAI
import gc, sys

client = OpenAI(base_url="http://localhost:11434/v1/", api_key="flm")

stream = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a fast, concise assistant."},
        {"role": "user", "content": "Summarize the plot of Hamlet in 3 sentences."}
    ],
    stream=True
)

print("=== STREAMING BEGIN ===")
try:
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    print("\n=== STREAMING END ===")
except Exception as e:
    sys.stderr.write(f"\n[STREAM ERROR] {e}\n")

#cleanup
del stream, client
gc.collect()

📄 Example: Use a File as the Prompt

You can load a full .txt file as a prompt — useful for long documents or testing large context windows.

👉 Download the sample prompt

Download to Downloads folder. This contains over 38k token, thus may take longer to prompt. FastFlowLM supports full context length (32k–128k), making it ideal for processing long documents like this

# Use a text file to prompt
from openai import OpenAI

with open("C:\\Users\\<username>\\Downloads\\alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    user_prompt = f.read()

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")

response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a story rewriter."},
        {"role": "user", "content": user_prompt}
    ]
)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

📊 Example: Batch Requests (Multiple Prompts)

Loop over a list of prompts and generate answers — useful for eval or bulk testing.

# Batched prompts
from openai import OpenAI

prompts = [
    "Summarize the causes of World War I.",
    "Describe how a transistor works.",
    "What are the key themes in ‘To Kill a Mockingbird’?",
]

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")

for prompt in prompts:
    response = client.chat.completions.create(
        model="llama3.2:1b",
        messages=[
            {"role": "system", "content": "You are a concise academic tutor."},
            {"role": "user", "content": prompt}
        ]
    )
    print(f"\n📝 Prompt: {prompt}\n🔍 Answer: {response.choices[0].message.content}")

# cleanup
del response, client
import gc
gc.collect()

🧬 Example: Use Temperature, Top-p, and Presence Penalty

Control randomness and creativity — for brainstorming or open-ended tasks.

# Change hyper parameters
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")

response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a creative brainstorming partner."},
        {"role": "user", "content": "Give me 5 startup ideas that combine AI and education."}
    ],
    temperature=0.9,      # More randomness
    top_p=0.95,           # Nucleus sampling
    presence_penalty=0.5, # Encourage novelty
)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()