User Tools

Site Tools


tips:llm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
tips:llm [2025/12/19 07:42] sscipionitips:llm [2026/02/15 08:29] (current) sscipioni
Line 1: Line 1:
 ====== LLM ====== ====== LLM ======
  
 +- https://collabnix.com/best-ollama-models-in-2025-complete-performance-comparison/
 +
 +For Production Deployment:
 +  * Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications
 +  * Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency
 +  * General Purpose: Llama 3.3 70B for maximum versatility
 +  * Edge Computing: Phi-4 14B for resource-constrained environments
 +
 +Optimization Strategies:
 +  * Always enable **Flash Attention** and KV-cache quantization
 +  * Use **Q4_K_M** quantization for production deployments
 +  * Implement caching for repeated queries
 +  * Monitor GPU memory usage and implement automatic model swapping
 +  * Use load balancing for high-throughput applications
 +
 +
 +
 +^ Hardware ^ Llama 3.3 8B (tokens/sec) ^ Llama 3.3 70B (tokens/sec) ^ Llama 3.2 ^
 +| RTX 4090 | 89.2 | 12.1 | |
 +| RTX 3090 | 67.4 | 8.3 | |
 +| A100 40GB | 156.7 | 45.2 | |
 +| M3 Max 128GB | 34.8 | 4.2 | |
 +| Strix Halo 128GB ollama | | 5.1 | 85.02 |
 +| Strix Halo 128GB llama.cpp | |  | 90 |
 +| RTX 3060 | | | 131.76 |
 +
 +
 +ROCM
 ^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s]  ^ ^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s]  ^
-| llama3.2               | completion tools         | "3.2B"   | 131072   | "Q4_K_M"                                                                          88.14                715.43                      +| llama3.2  | completion tools | "3.2B"   | 131072    | "Q4_K_M"52.78 1957.30 
-ministral-3:14b        | completion vision tools  | "13.9B | 262144   | "Q4_K_M"                                                                          23.78                302.07                      +qwen-strixhalo  | completion tools | "30.5B  | 262144    | "Q4_K_M"53.54 1056.37 
-| qwen3-coder:30b        | completion tools         | "30.5B"  | 262144   | "Q4_K_M"                                                                          73.75                72.41                       +| qwen3-coder  | completion tools | "30.5B"   | 262144    | "Q4_K_M" | 52.10 | 776.55 | 
-llama3:70b | completion   | "70.6B 8192     | "Q4" | 5.55 9.72 |  +| qwen3:30b-a3b  | completion tools thinking | "30.5B"   | 262144    | "Q4_K_M"50.19 803.06 
-llava  | completion vision | "7B  32768    | "Q449.92                207.27                      +gpt-oss:20b  | completion tools thinking | "20.9B  131072    | "MXFP4" | 45.37 519.90 
-deepseek-coder-v2:16b  | completion insert        | "15.7B"  | 163840   | "Q4                                                                           84.44                111.71                      +glm-4.7-flash  | completion tools thinking | "29.9B  202752    | "Q4_K_M| 41.54 | 470.09 | 
-bjoernb/qwen3-coder-30b-1m:latest  | completion tools | "30.5B  1048576    | "Q4_K_M"74.23 94.84 +| qwen3:8b  completion tools thinking | "8.2B"   | 40960    | "Q4_K_M" 32.68 | 890.98 
-| freehuntx/qwen3-coder:8b  | completion tools | "8.2B  40960    | "Q4_K_M"37.97 565.68 +qwen3-coder-next  | completion tools | "79.7B"   | 262144    | "Q4_K_M" | 33.06 380.21 
-networkjohnny/deepseek-coder-v2-lite-base-q4_k_m-gguf:latest  | completion tools | "3.2B"   | 131072    | "Q4_K_M"86.02 1124.53 +qwen2.5-coder:14b-instruct-q4_K_M  | completion tools insert | "14.8B  32768    | "Q4_K_M"17.25 527.74 
-phi4-mini  | completion tools | "3.8B  | 131072    | "Q4_K_M" | 72.24 31.37 |+ 
 + 
 +VULKAN 
 +^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s] 
 +qwen3-coder  | completion tools | "30.5B  262144    | "Q4_K_M"54.03 805.43 
 +llama3.2  | completion tools | "3.2B"   | 131072    | "Q4_K_M"52.54 1838.82 
 +gpt-oss:20b  | completion tools thinking | "20.9B  | 131072    | "MXFP4" | 43.36 475.60 | 
 + 
 + 
 +ollama model 
 +<code> 
 +FROM qwen3-coder 
 + 
 +# STRIX HALO AGENTIC TUNING 
 +PARAMETER num_ctx 128000 
 +PARAMETER num_batch 1024 
 +PARAMETER num_predict 4096 
 + 
 +SYSTEM """ 
 +You are a Strix Halo Optimized Coding Agent.  
 +Always use asynchronous patterns and favor memory-efficient algorithms. 
 +""" 
 +</code>
  
tips/llm.1766126549.txt.gz · Last modified: by sscipioni