LLM

For Production Deployment:

Optimization Strategies:

Hardware Llama 3.3 8B (tokens/sec) Llama 3.3 70B (tokens/sec) Llama 3.2
RTX 4090 89.2 12.1
RTX 3090 67.4 8.3
A100 40GB 156.7 45.2
M3 Max 128GB 34.8 4.2
Strix Halo 128GB ollama 5.1 85.02
Strix Halo 128GB llama.cpp 90
RTX 3060 131.76
model capabilities size context quantization eval rate [token/s] prompt eval rate [token/s]
llama3.2 completion tools “3.2B” 131072 “Q4KM” 88.14 715.43
ministral-3:14b completion vision tools “13.9B” 262144 “Q4KM” 23.78 302.07
qwen3-coder:30b completion tools “30.5B” 262144 “Q4KM” 73.75 72.41
llama3:70b completion “70.6B” 8192 “Q4” 5.55 9.72
llava completion vision “7B” 32768 “Q4” 49.92 207.27
deepseek-coder-v2:16b completion insert “15.7B” 163840 “Q4” 84.44 111.71
bjoernb/qwen3-coder-30b-1m:latest completion tools “30.5B” 1048576 “Q4KM” 74.23 94.84
freehuntx/qwen3-coder:8b completion tools “8.2B” 40960 “Q4KM” 37.97 565.68
networkjohnny/deepseek-coder-v2-lite-base-q4km-gguf:latest completion tools “3.2B” 131072 “Q4KM” 86.02 1124.53
phi4-mini completion tools “3.8B” 131072 “Q4KM” 72.24 31.37
qwen2.5:7b completion tools “7.6B” 32768 “Q4KM” 42.98 153.34
llama3.3:70b-instruct-q4KM completion tools “70.6B” 131072 “Q4KM” 5.06 15.50
functiongemma completion tools “268.10M” 32768 “Q8” 364.21 240.50
danielsheep/Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth completion tools “30.5B” 1048576 “Q4_K_M” 71.60 33.14
gpt-oss:20b completion tools thinking “20.9B” 131072 “MXFP4” 47.32 402.47