User Tools

Site Tools


tips:llm

LLM

For Production Deployment:

  • Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications
  • Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency
  • General Purpose: Llama 3.3 70B for maximum versatility
  • Edge Computing: Phi-4 14B for resource-constrained environments

Optimization Strategies:

  • Always enable Flash Attention and KV-cache quantization
  • Use Q4_K_M quantization for production deployments
  • Implement caching for repeated queries
  • Monitor GPU memory usage and implement automatic model swapping
  • Use load balancing for high-throughput applications
Hardware Llama 3.3 8B (tokens/sec) Llama 3.3 70B (tokens/sec) Llama 3.2
RTX 4090 89.2 12.1
RTX 3090 67.4 8.3
A100 40GB 156.7 45.2
M3 Max 128GB 34.8 4.2
Strix Halo 128GB ollama 5.1 85.02
Strix Halo 128GB llama.cpp 90
RTX 3060 131.76
model capabilities size context quantization eval rate [token/s] prompt eval rate [token/s]
llama3.2 completion tools “3.2B” 131072 “Q4KM” 88.14 715.43
ministral-3:14b completion vision tools “13.9B” 262144 “Q4KM” 23.78 302.07
qwen3-coder:30b completion tools “30.5B” 262144 “Q4KM” 73.75 72.41
llama3:70b completion “70.6B” 8192 “Q4” 5.55 9.72
llava completion vision “7B” 32768 “Q4” 49.92 207.27
deepseek-coder-v2:16b completion insert “15.7B” 163840 “Q4” 84.44 111.71
bjoernb/qwen3-coder-30b-1m:latest completion tools “30.5B” 1048576 “Q4KM” 74.23 94.84
freehuntx/qwen3-coder:8b completion tools “8.2B” 40960 “Q4KM” 37.97 565.68
networkjohnny/deepseek-coder-v2-lite-base-q4km-gguf:latest completion tools “3.2B” 131072 “Q4KM” 86.02 1124.53
phi4-mini completion tools “3.8B” 131072 “Q4KM” 72.24 31.37
qwen2.5:7b completion tools “7.6B” 32768 “Q4KM” 42.98 153.34
llama3.3:70b-instruct-q4KM completion tools “70.6B” 131072 “Q4KM” 5.06 15.50
functiongemma completion tools “268.10M” 32768 “Q8_0” 364.21 240.50
tips/llm.txt · Last modified: by sscipioni