๐Ÿ“ก How Does the OpenAI Standard Work?

The OpenAI API standard defines a simple and powerful way for applications to communicate with language models โ€” whether running in the cloud or locally.

It uses a multi-role chat format, which includes three types of messages:

Role Description
๐Ÿ›  System Used to give high-level instructions to the model. For example: โ€œYou are a helpful assistant,โ€ or โ€œYou are a coding tutor that only explains using examples.โ€ It sets the tone, rules, or available tools for the model.
๐Ÿ‘ค User Messages from the application to the model. These often come from the end-user (e.g., a typed prompt like: โ€œExplain black holes.โ€).
๐Ÿค– Assistant Responses generated by the model and returned to the application. These are answers to the user prompts.

This structure makes it easy to build multi-turn conversations with consistent behavior.


๐Ÿ“š Developer Support

OpenAI provides official libraries in multiple programming languages to help developers follow the standard easily: Python, JavaScript, .NET, Java, and Go.

These libraries make it easy to send prompts, receive completions, and integrate with local or cloud-based OpenAI-compatible servers.


๐Ÿš€ Quick Test: Use OpenAI SDK with FastFlowLM in Python

You can try this instantly in any Python environment โ€” including Jupyter Notebook. Follow the steps below by copying each block into a notebook cell.


โœ… Step 0: Start FastFlowLM in Server Mode

Open PowerShell or terminal and launch the model server:

flm serve llama3.2:1B

๐Ÿง  This loads the model and starts the FastFlowLM OpenAI-compatible API at http://localhost:11434/v1.


โœ… Step 1: Install the OpenAI Python SDK

!pip install --upgrade openai

โœ… Step 2: Send a Chat Request to FastFlowLM

# Quick Start
from openai import OpenAI

# Connect to local FastFlowLM server
client = OpenAI(
    base_url="http://localhost:11434/v1",  # FastFlowLM's local API endpoint
    api_key="flm"  # Dummy key (FastFlowLM doesnโ€™t require authentication)
)

# Send a chat-style prompt using OpenAI API format
response = client.chat.completions.create(
    model="llama3.2:1B",  # Replace with any model you've launched with `flm serve`
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is the sky blue?"}
    ]
)

# Show the model's response
print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

๐Ÿ“Œ Notes

  • ๐Ÿง  You can replace "llama3.2:1B" with any other model available via flm run or flm pull.
  • ๐Ÿ–ฅ Make sure the FastFlowLM server is running in the background (flm serve ...).
  • ๐Ÿ”’ No real API key is needed โ€” just pass "flm" as a placeholder.
  • โšก FastFlowLM runs fully offline and is optimized for AMD Ryzen AI NPUs.

โœ… This setup is perfect for quick offline LLM testing using standard OpenAI tooling.


๐Ÿงช More Examples

๐Ÿš€ Ah โ€” that was easy, right?
Now letโ€™s kick things up a notch with some awesome next-level examples!


๐Ÿ’ฌ Example: Multi-turn Chat (Conversation History)

Use this pattern when you want the model to remember previous turns in the conversation:

# Multi-turn conversation
from openai import OpenAI

messages = [
    {"role": "system", "content": "You are a creative writing assistant."},
    {"role": "user", "content": "Write the beginning of a fantasy story."},
]

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")
response = client.chat.completions.create(model="llama3.2:1B", messages=messages)
print(response.choices[0].message.content)

# Add the assistant response and continue the conversation
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "Continue the story with a twist."})
response = client.chat.completions.create(model="llama3.2:1B", messages=messages)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

โš ๏ธ The OpenAI API (and FastFlowLM server mode) is stateless โ€” you must resend the full conversation each time. No KV cache is kept between turns.

๐ŸŒ€ This means all previous messages are reprocessed (prefill), which adds latency for long chats.

โšก FastFlowLMโ€™s CLI mode uses a real KV cache, making multi-turn responses much faster โ€” especially with long conversations.

๐Ÿง  FastFlowLM is optimized for long sequences with large KV caches, ideal for 32kโ€“128k context windows.

๐Ÿ”ง Weโ€™re working on adding stateful KV cache to server mode. Stay tuned!


๐Ÿ” Example: Streamed Output (Real-Time Response)

Display the modelโ€™s output as it generates, token-by-token:

# Streaming
from openai import OpenAI
import gc, sys

client = OpenAI(base_url="http://localhost:11434/v1/", api_key="flm")

stream = client.chat.completions.create(
    model="llama3.2:1B",
    messages=[
        {"role": "system", "content": "You are a fast, concise assistant."},
        {"role": "user", "content": "Summarize the plot of Hamlet in 3 sentences."}
    ],
    stream=True
)

print("=== STREAMING BEGIN ===")
try:
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    print("\n=== STREAMING END ===")
except Exception as e:
    sys.stderr.write(f"\n[STREAM ERROR] {e}\n")

#cleanup
del stream, client
gc.collect()

๐Ÿ“„ Example: Use a File as the Prompt

You can load a full .txt file as a prompt โ€” useful for long documents or testing large context windows.

๐Ÿ‘‰ Download the sample prompt

Download to Downloads folder. This contains over 38k token, thus may take longer to prompt. FastFlowLM supports full context length (32kโ€“128k), making it ideal for processing long documents like this

# Use a text file to prompt
from openai import OpenAI

with open("C:\\Users\\<username>\\Downloads\\alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    user_prompt = f.read()

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")

response = client.chat.completions.create(
    model="llama3.2:1B",
    messages=[
        {"role": "system", "content": "You are a precise assistant."},
        {"role": "user", "content": user_prompt}
    ]
)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

๐Ÿ“Š Example: Batch Requests (Multiple Prompts)

Loop over a list of prompts and generate answers โ€” useful for eval or bulk testing.

# Batched prompts
from openai import OpenAI

prompts = [
    "Summarize the causes of World War I.",
    "Describe how a transistor works.",
    "What are the key themes in โ€˜To Kill a Mockingbirdโ€™?",
]

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")

for prompt in prompts:
    response = client.chat.completions.create(
        model="llama3.2:1B",
        messages=[
            {"role": "system", "content": "You are a concise academic tutor."},
            {"role": "user", "content": prompt}
        ]
    )
    print(f"\n๐Ÿ“ Prompt: {prompt}\n๐Ÿ” Answer: {response.choices[0].message.content}")

# cleanup
del response, client
import gc
gc.collect()

๐Ÿงฌ Example: Use Temperature, Top-p, and Presence Penalty

Control randomness and creativity โ€” for brainstorming or open-ended tasks.

# Change hyper parameters
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="flm")

response = client.chat.completions.create(
    model="llama3.2:1B",
    messages=[
        {"role": "system", "content": "You are a creative brainstorming partner."},
        {"role": "user", "content": "Give me 5 startup ideas that combine AI and education."}
    ],
    temperature=0.9,      # More randomness
    top_p=0.95,           # Nucleus sampling
    presence_penalty=0.5, # Encourage novelty
)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()