This is an old revision of the document!

LLM

For Production Deployment:

Optimization Strategies:

Hardware	Llama 3.3 8B (tokens/sec)	Llama 3.3 70B (tokens/sec)	Llama 3.2
RTX 4090	89.2	12.1
RTX 3090	67.4	8.3
A100 40GB	156.7	45.2
M3 Max 128GB	34.8	4.2
Strix Halo 128GB ollama		5.1	85.02
Strix Halo 128GB llama.cpp			90
RTX 3060			131.76

ROCM

model	capabilities	size	context	quantization	eval rate [token/s]	prompt eval rate [token/s]
llama3.2	completion tools	“3.2B”	131072	“Q4KM”	52.78	1957.30
qwen3:30b-a3b	completion tools thinking	“30.5B”	262144	“Q4KM”	50.19	803.06
gpt-oss:20b	completion tools thinking	“20.9B”	131072	“MXFP4”	45.37	519.90
qwen3-coder	completion tools	“30.5B”	262144	“Q4KM”	52.10	776.55
qwen3:8b	completion tools thinking	“8.2B”	40960	“Q4KM”	32.68	890.98
qwen3-coder-next	completion tools	“79.7B”	262144	“Q4KM”	33.06	380.21
glm-4.7-flash	completion tools thinking	“29.9B”	202752	“Q4_K_M”	38.46	485.27

VULKAN

model	capabilities	size	context	quantization	eval rate [token/s]	prompt eval rate [token/s]
qwen3-coder	completion tools	“30.5B”	262144	“Q4KM”	54.03	805.43
llama3.2	completion tools	“3.2B”	131072	“Q4_K_M”	52.54	1838.82
gpt-oss:20b	completion tools thinking	“20.9B”	131072	“MXFP4”	43.36	475.60