πŸ“‘ Table of Contents


πŸ“‘ How Does the OpenAI Standard Work?

The OpenAI API standard defines a simple and powerful way for applications to communicate with language models β€” whether running in the cloud or locally.

It uses a multi-role chat format, which includes three types of messages:

Role Description
πŸ›  System Used to give high-level instructions to the model. For example: β€œYou are a helpful assistant,” or β€œYou are a coding tutor that only explains using examples.” It sets the tone, rules, or available tools for the model.
πŸ‘€ User Messages from the application to the model. These often come from the end-user (e.g., a typed prompt like: β€œExplain black holes.”).
πŸ€– Assistant Responses generated by the model and returned to the application. These are answers to the user prompts.

This structure makes it easy to build multi-turn conversations with consistent behavior.


πŸ“š Developer Support

OpenAI provides official libraries in multiple programming languages to help developers follow the standard easily: Python, JavaScript, .NET, Java, and Go.

These libraries make it easy to send prompts, receive completions, and integrate with local or cloud-based OpenAI-compatible servers.


πŸš€ Quick Test: Use OpenAI SDK with FastFlowLM in Python

You can try this instantly in any Python environment β€” including Jupyter Notebook. Follow the steps below by copying each block into a notebook cell.


βœ… Step 0: Start FastFlowLM in Server Mode

Open PowerShell or terminal and launch the model server:

flm serve llama3.2:1b

🧠 This loads the model and starts the FastFlowLM OpenAI-compatible API at http://localhost:52625/v1.


βœ… Step 1: Install the OpenAI Python SDK

pip install --upgrade openai

βœ… Step 2: Send a Chat Request to FastFlowLM

# Quick Start
from openai import OpenAI

# Connect to local FastFlowLM server
client = OpenAI(
    base_url="http://localhost:52625/v1",  # FastFlowLM's local API endpoint
    api_key="flm"  # Dummy key (FastFlowLM doesn’t require authentication)
)

# Send a chat-style prompt using OpenAI API format
response = client.chat.completions.create(
    model="llama3.2:1b",  # Replace with any model you've launched with `flm serve`
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is the sky blue?"}
    ]
)

# Show the model's response
print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

πŸ“Œ Notes

  • 🧠 You can replace "llama3.2:1b" with any other model available via flm run or flm pull.
  • πŸ–₯ Make sure the FastFlowLM server is running in the background (flm serve ...).
  • πŸ”’ No real API key is needed β€” just pass "flm" as a placeholder.
  • ⚑ FastFlowLM runs fully offline and is optimized for AMD Ryzenβ„’ AI NPUs.

βœ… This setup is perfect for quick offline LLM testing using standard OpenAI tooling.


πŸ§ͺ More Examples

πŸš€ Ah β€” that was easy, right?
Now let’s kick things up a notch with some awesome next-level examples!


πŸ’¬ Example: Multi-turn Chat (Conversation History)

Use this pattern when you want the model to remember previous turns in the conversation:

# Multi-turn conversation
from openai import OpenAI

messages = [
    {"role": "system", "content": "You are a creative writing assistant."},
    {"role": "user", "content": "Write the beginning of a fantasy story."},
]

client = OpenAI(base_url="http://localhost:52625/v1", api_key="flm")
response = client.chat.completions.create(model="llama3.2:1b", messages=messages)
print(response.choices[0].message.content)

# Add the assistant response and continue the conversation
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "Continue the story with a twist."})
response = client.chat.completions.create(model="llama3.2:1b", messages=messages)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

⚠️ The OpenAI API (and FastFlowLM server mode) is stateless β€” you must resend the full conversation each time. No KV cache is kept between turns.
πŸŒ€ This means all previous messages are reprocessed (prefill), which adds latency for long chats.
⚑ FastFlowLM’s CLI mode uses a real KV cache, making multi-turn responses much faster β€” especially with long conversations.
🧠 FastFlowLM is optimized for long sequences with large KV caches, ideal for 32k–256k context windows.


πŸ” Example: Streamed Output (Real-Time Response)

Display the model’s output as it generates, token-by-token:

# Streaming
from openai import OpenAI
import gc, sys

client = OpenAI(base_url="http://localhost:52625/v1/", api_key="flm")

stream = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a fast, concise assistant."},
        {"role": "user", "content": "Summarize the plot of Hamlet in 3 sentences."}
    ],
    stream=True
)

print("=== STREAMING BEGIN ===")
try:
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    print("\n=== STREAMING END ===")
except Exception as e:
    sys.stderr.write(f"\n[STREAM ERROR] {e}\n")

#cleanup
del stream, client
gc.collect()

πŸ“„ Example: Use a File as the Prompt

You can load a full .txt file as a prompt β€” useful for long documents or testing large context windows.

πŸ‘‰ Download the sample prompt

Download to Downloads folder. This contains over 38k token, thus may take longer to prompt. FastFlowLM supports full context length (32k–128k), making it ideal for processing long documents like this

# Use a text file to prompt
from openai import OpenAI

with open("C:\\Users\\<username>\\Downloads\\alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    user_prompt = f.read()

client = OpenAI(base_url="http://localhost:52625/v1", api_key="flm")

response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a story rewriter."},
        {"role": "user", "content": user_prompt}
    ]
)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

πŸ“Š Example: Batch Requests (Multiple Prompts)

Loop over a list of prompts and generate answers β€” useful for eval or bulk testing.

# Batched prompts
from openai import OpenAI

prompts = [
    "Summarize the causes of World War I.",
    "Describe how a transistor works.",
    "What are the key themes in β€˜To Kill a Mockingbird’?",
]

client = OpenAI(base_url="http://localhost:52625/v1", api_key="flm")

for prompt in prompts:
    response = client.chat.completions.create(
        model="llama3.2:1b",
        messages=[
            {"role": "system", "content": "You are a concise academic tutor."},
            {"role": "user", "content": prompt}
        ]
    )
    print(f"\nπŸ“ Prompt: {prompt}\nπŸ” Answer: {response.choices[0].message.content}")

# cleanup
del response, client
import gc
gc.collect()

🧬 Example: Use Temperature, Top-p, and Presence Penalty

Control randomness and creativity β€” for brainstorming or open-ended tasks.

# Change hyper parameters
from openai import OpenAI

client = OpenAI(base_url="http://localhost:52625/v1", api_key="flm")

response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a creative brainstorming partner."},
        {"role": "user", "content": "Give me 5 startup ideas that combine AI and education."}
    ],
    temperature=0.9,      # More randomness
    top_p=0.95,           # Nucleus sampling
    presence_penalty=0.5, # Encourage novelty
)

print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()

πŸ“· Example: Multi-Image Input

Send multiple images together with a text prompt.
The model will analyze them, compare, and respond with reasoning.

import base64
from openai import OpenAI

# Paths to your local images
image_path_0 = r"C:\Users\info\OneDrive\Desktop\FLM\image_test\image0.jpg"
image_path_1 = r"C:\Users\info\OneDrive\Desktop\FLM\image_test\image1.png"

# Read and encode the images as Base64 strings (required for API input)
with open(image_path_0, "rb") as image_file:
    image_0 = base64.b64encode(image_file.read()).decode("utf-8")
with open(image_path_1, "rb") as image_file:
    image_1 = base64.b64encode(image_file.read()).decode("utf-8")

# Connect to your local FLM/OpenAI-compatible endpoint
client = OpenAI(base_url="http://localhost:52625/v1", api_key="dummykey")

# Create a chat completion request with text + two images
response = client.chat.completions.create(
    model="gemma3:4b",  # Vision-capable model
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the images. Which one you like better and why?"},
                
                {"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{image_0}"}}, # standard OpenAI-compatible API
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_1}"}}, 
            ],
        }
    ],
)

# Print the model's answer
print(response.choices[0].message.content)

# cleanup
del response, client
import gc
gc.collect()