β‘ CLI (Interactive Mode)
FastFlowLM offers a terminal-based interactive experience, similar to Ollama, but fully offline and accelerated excusively on AMD NPUs.
π§ Pre-Run PowerShell Commands
π Show Help
flm help
π Run a Model
Run a model interactively from the terminal:
flm run llama3.2:1b
flm
is short for FastFlowLM. If the model isnβt available locally, it will be downloaded automatically. This launches FastFlowLM in CLI mode.
β¬οΈ Pull a Model (Download Only)
Download a model from Hugging Face without launching it:
flm pull llama3.2:3b
This code forces a re-download of the model, overwriting the current version.
flm pull llama3.2:3b --force
βforceβ is only needed when a major
flm
update is released and installed. Proceed with Caution.
π¦ List Supported and Downloaded Models
Display all available models and locally downloaded models:
flm list
β Remove a Downloaded Model
Delete a model from local storage:
flm remove llama3.2:3b
π Run with a Text File
Load input from a local text file:
flm run llama3.2:1b "C:\Users\Public\Desktop\alice_in_wonderland.txt"
π Download the sample prompt
Specify the maximum generation length directly after the file path/name. For example,
flm run llama3.2:1b "C:\Users\Public\Desktop\alice_in_wonderland.txt" 1024
β οΈ Note:: a modelβs supported context length is limited by available DRAM capacity. For example, with 32 GB of DRAM, LLaMA 3.1:8B cannot run beyond a 32K context length. For the full 128K context, we recommend larger memory system.
If DRAM is heavily used by other programs while running FastFlowLM, you may encounter errors due to insufficient memory, such as:
[XRT] ERROR: Failed to submit the command to the hw queue (0xc01e0200):
Even after the video memory manager split the DMA buffer, the video memory manager
could not page-in all of the required allocations into video memory at the same time.
The device is unable to continue.
π€ Interested in checking the DRAM usage?
Method 1 β Task Manager (Quick View)
- Press Ctrl + Shift + Esc (or Ctrl + Alt + Del and select Task Manager).
- Go to the Performance tab.
- Click Memory to see total, used, and available DRAM, as well as usage percentage.
Method 2 β Resource Monitor (Detailed View)
- Press Windows + R.
- Type:
resmon
- Press Enter.
- Go to the Memory tab to view detailed DRAM usage and a per-process breakdown.
π Load a Local Text File in CLI Mode (Preferred Method to Attach Text File)
Use any file that can be opened in Notepad (like .txt
, .json
, .csv
, etc.).
β Format (in CLI mode):
/input "<file_path>" prompt
π§ͺ Example:
/input "C:\Users\Public\Desktop\alice_in_wonderland.txt" Can you rewrite the story in 2 sentences?
π Notes:
- Use quotes only around the file path
- No quotes around the prompt
- File must be plain text (readable in Notepad)
π Loading Images in CLI Mode (for VLMs only, e.g. gemma3:4b)
Supports .png and .jpg formats.
β Usage:
/input "<image_path>" prompt
π§ͺ Example:
/input "C:\Users\Public\Desktop\cat.jpg" describe this image
π Notes:
- Make sure the model you are using is a vision model (VLM)
- Put quotes only around the file path
- Do not use quotes around the prompt
- Image must be in .jpg or .png format
π Start Server Mode
Launch FastFlowLM as a local REST API server (also support OpenAI API):
flm serve llama3.2:1b
π§ Commands Inside Interactive Mode
Once inside the CLI, use the following commands:
π Help
/?
Displays all available interactive commands. Highly recommended for first-time users.
πͺͺ Model Info
/show
View model architecture, size, cache path, and more.
π Change Model
/load [model_name]
Unload the current model and load a new one. KV cache will be cleared.
πΎ Save Conversation
/save
Save the current conversation history to disk.
π§Ή Clear Memory
/clear
Clear the KV cache (model memory) for a fresh start.
π Show Runtime Stats
/status
Display runtime statistics like token count, throughput, etc.
π°οΈ Show History
/history
Review the current sessionβs conversation history.
π Toggle Verbose Mode
/verbose
Enable detailed performance metrics per turn. Run again to disable.
π¦ List Models
Display all available models and locally downloaded models:
/list
π Quit Interactive Mode
/bye
Exit the CLI.
π§ Think Mode Toggle
Type /think
to toggle Think Mode on or off interactively in the CLI.
π‘ Note: This feature is only supported on certain models, such as Qwen3.
βοΈ Set Variables
/set
Customize decoding parameters like
top_k
,top_p
,temperature
,context length
,generate_limit
, etc.
β οΈ Note: Providing invalid or extreme hyperparameter values may cause inference errors.
generate_limit
sets an upper limit on the number of tokens that can be generated for each response.
π Change Default Context Length
You can find more information about available models here:
C:\Program Files\flm\model_list.json
You can also change the default_context_length
setting.
β οΈ Note: Be cautious! The system reserves DRAM space based on the context length you set.
Setting a longer default context length may cause errors on systems with smaller DRAM. Also, each model has its own context length limit (examples below).
- Qwen3 β up to 32k tokens
- LLaMA 3.1 / 3.2 β up to 128k tokens