β‘ CLI Mode
FLM CLI mode offers a terminal-based interactive experience, similar to Ollama, but fully offline and accelerated exclusively on AMD NPUs. Here are detailed descriptions of commands and setup for CLI mode usage. It includes:
π§ Pre-Run PowerShell Commands
π Show Help
flm help
π Run a Model
Run a model interactively from the terminal:
flm run llama3.2:1b
flm
is short for FastFlowLM. If the model isnβt available locally, it will be downloaded automatically. This launches FastFlowLM in CLI mode.
β¬οΈ Pull a Model (Download Only)
Download a model from Hugging Face without launching it:
flm pull llama3.2:3b
This code forces a re-download of the model, overwriting the current version.
flm pull llama3.2:3b --force
β οΈ Use
--force
only if the model file is corrupted (e.g., incomplete download). Proceed with caution.
π¦ List Supported and Downloaded Models
Display all available models and locally downloaded models:
flm list
Filters flag:
# Show everything
flm list --filter all
# Only models already installed
flm list --filter installed
# Only models not yet installed
flm list --filter not-installed
Quiet mode:
# Default view (pretty, with icons)
flm list
# Quiet view (no emoji / minimal)
flm list --quiet
# Show everything
flm list --filter all --quiet
# Only models already installed
flm list --filter installed --quiet
# Only models not yet installed
flm list --filter not-installed --quiet
β Remove a Downloaded Model
Delete a model from local storage:
flm remove llama3.2:3b
π§ Start Server Mode (Local)
Launch FastFlowLM as a local REST API server (also support OpenAI API):
flm serve llama3.2:1b
Show Server Port
Show current FLM port (default) in PowerShell:
flm port
β‘ NPU Power Mode NPU Power Mode
By default, FLM runs in performance
NPU power mode. You can switch to other NPU power modes (powersaver
, balanced
, or turbo
) using the --pmode
flag:
For CLI mode:
flm run gemma3:4b --pmode balanced
For Server mode:
flm serve gemma3:4b --pmode balanced
ποΈ Set Context Length at Launch
The default context length for each model can be found here.
Set the context length with --ctx-len
(or -c
).
In PowerShell, run:
For CLI mode:
flm run llama3.2:1b --ctx-len 8192
For Server mode:
flm serve llama3.2:1b --ctx-len 8192
- Internally, FLM enforces a minimum context length of 512. If you specify a smaller value, it will automatically be adjusted up to 512.
- If you enter a context length that is not a power of 2, FLM automatically rounds it up to the nearest power of 2. For example: input
8000
β adjusted to8192
.
Set Server Port at Launch
Set a custom port at launch:
flm serve llama3.2:1b --port 8000
flm serve llama3.2:1b -p 8000
β οΈ
--port
(-p
) only affects the current run; it wonβt change the default port.
π Cross-Origin Resource Sharing (CORS)
CORS lets browser apps hosted on a different origin call your FLM server safely.
- Enable CORS
flm serve --cors 1
- Disable CORS
flm serve --cors 0
β οΈ Default: CORS is enabled.
π Security tip: Disable CORS (or restrict at your proxy) if your server is exposed beyond localhost.
Preemption
Preemption allows high-priority tasks to interrupt ongoing NPU jobs, improving responsiveness for critical workloads. To enable preemption:
For CLI mode:
flm run llama3.2:1b --preemption 1
For Server mode:
flm serve llama3.2:1b --preemption 1
β οΈ Note: Preemption is for engineering testing/optimization only. It requires a special driver + toolkit and is not for public use.
π» Commands Inside CLI Mode
Once inside the CLI, use the following commands. System commands always start with /
(e.g., /help
).
π Help
/?
Displays all available interactive system commands. Highly recommended for first-time users.
πͺͺ Model Info
/show
View model architecture, size, max context size (Adjustable β see bottom) and more.
π Change Model
/load [model_name]
Unload the current model and load a new one. KV cache will be cleared.
πΎ Save Conversation
/save
Save the current conversation history to disk.
π§Ή Clear Memory
/clear
Clear the KV cache (model memory) for a fresh start.
π Show Runtime Stats
/status
Display runtime statistics like token count, throughput, etc.
π°οΈ Show History
/history
Review the current sessionβs conversation history.
π Toggle Verbose Mode
/verbose
Enable detailed performance metrics per turn. Run again to disable.
π¦ List Models
Display all available models and locally downloaded models:
/list
π Quit CLI Mode
/bye
Exit the CLI.
π§ Think Mode Toggle
Type /think
to toggle Think Mode on or off interactively in the CLI.
π‘ Note: This feature is only supported on certain models, such as Qwen3.
π Load a Local Text File in CLI Mode
Use any file that can be opened in Notepad (like .txt
, .json
, .csv
, etc.).
Format (in CLI mode):
/input "<file_path>" prompt
Example:
/input "C:\Users\Public\Desktop\alice_in_wonderland.txt" Summarize it into 200 words
Notes:
- Use quotes only around the file path
- No quotes around the prompt
- File must be plain text (readable in Notepad)
π Download a sample prompt (around 40k tokens)
β οΈ Caution: a modelβs supported context length is limited by available DRAM capacity. For example, with 32 GB of DRAM, LLaMA 3.1:8B cannot run beyond a 32K context length. For the full 128K context, we recommend larger memory system.
If DRAM is heavily used by other programs while running FastFlowLM, you may encounter errors due to insufficient memory, such as:
[XRT] ERROR: Failed to submit the command to the hw queue (0xc01e0200):
Even after the video memory manager split the DMA buffer, the video memory manager
could not page-in all of the required allocations into video memory at the same time.
The device is unable to continue.
π€ Interested in checking the DRAM usage?
- Press Ctrl + Shift + Esc (or Ctrl + Alt + Del and select Task Manager).
- Go to the Performance tab.
- Click Memory to see total, used, and available DRAM, as well as usage percentage.
π Loading Images in CLI Mode (for VLMs only, e.g. gemma3:4b)
Supports .png and .jpg formats.
/input "<image_path>" prompt
Example:
/input "C:\Users\Public\Desktop\cat.jpg" describe this image
Notes:
- Make sure the model you are using is a vision model (VLM) (e.g., gemma3:4b)
- Put quotes only around the file path
- Do not use quotes around the prompt
- Image must be in .jpg or .png format
βοΈ Set Variables
/set
Customize decoding parameters like
top_k
,top_p
,temperature
,context length (max)
,generate limit
, etc.
β οΈ Note: Providing invalid or extreme hyperparameter values may cause inference errors.
generate limit
sets an upper limit on the number of tokens that can be generated for each response. Example:
/set gen-lim 128
ποΈ Others
π Change Default Context Length (max)
You can find more information about available models here:
C:\Program Files\flm\model_list.json
You can also change the default_context_length
setting.
β οΈ Note: Be cautious! The system reserves DRAM space based on the context length you set.
Setting a longer default context length may cause errors on systems with smaller DRAM. Also, each model has its own context length limit (examples below).
- qwen3-tk:4b β up to 256k tokens
- gemma3:4b β up to 128k tokens
- gemma3:1b β up to 32k tokens
- llama3.x β up to 128k tokens