User Tools

Site Tools


tips:llm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
tips:llm [2025/12/08 15:32] – [vision tasks] sscipionitips:llm [2025/12/26 15:35] (current) sscipioni
Line 1: Line 1:
 ====== LLM ====== ====== LLM ======
  
-===== under 16GB =====+- https://collabnix.com/best-ollama-models-in-2025-complete-performance-comparison/ 
 + 
 +For Production Deployment: 
 +  * Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications 
 +  * Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency 
 +  * General Purpose: Llama 3.3 70B for maximum versatility 
 +  * Edge Computing: Phi-4 14B for resource-constrained environments 
 + 
 +Optimization Strategies: 
 +  * Always enable **Flash Attention** and KV-cache quantization 
 +  * Use **Q4_K_M** quantization for production deployments 
 +  * Implement caching for repeated queries 
 +  * Monitor GPU memory usage and implement automatic model swapping 
 +  * Use load balancing for high-throughput applications 
 + 
 + 
 + 
 +^ Hardware ^ Llama 3.3 8B (tokens/sec) ^ Llama 3.3 70B (tokens/sec) ^ Llama 3.2 ^ 
 +| RTX 4090 | 89.2 | 12.1 | | 
 +| RTX 3090 | 67.4 | 8.3 | | 
 +| A100 40GB | 156.7 | 45.2 | | 
 +| M3 Max 128GB | 34.8 | 4.2 | | 
 +| Strix Halo 128GB ollama | | 5.1 | 85.02 | 
 +| Strix Halo 128GB llama.cpp | |  | 90 | 
 +| RTX 3060 | | | 131.76 | 
 + 
 + 
 + 
 +^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s] 
 +| llama3.2               | completion tools         | "3.2B"   | 131072   | "Q4_K_M"                                                                          | 88.14                | 715.43                      | 
 +| ministral-3:14b        | completion vision tools  | "13.9B"  | 262144   | "Q4_K_M"                                                                          | 23.78                | 302.07                      | 
 +| qwen3-coder:30b        | completion tools         | "30.5B"  | 262144   | "Q4_K_M"                                                                          | 73.75                | 72.41                       | 
 +| llama3:70b | completion   | "70.6B"  | 8192     | "Q4" | 5.55 | 9.72 |  
 +| llava  | completion vision | "7B"   | 32768    | "Q4"  | 49.92                | 207.27                      | 
 +| deepseek-coder-v2:16b  | completion insert        | "15.7B"  | 163840   | "Q4"                                                                            | 84.44                | 111.71                      | 
 +| bjoernb/qwen3-coder-30b-1m:latest  | completion tools | "30.5B"   | 1048576    | "Q4_K_M" | 74.23 | 94.84 | 
 +| freehuntx/qwen3-coder:8b  | completion tools | "8.2B"   | 40960    | "Q4_K_M" | 37.97 | 565.68 | 
 +| networkjohnny/deepseek-coder-v2-lite-base-q4_k_m-gguf:latest  | completion tools | "3.2B"   | 131072    | "Q4_K_M" | 86.02 | 1124.53 | 
 +| phi4-mini  | completion tools | "3.8B"   | 131072    | "Q4_K_M" | 72.24 | 31.37 | 
 +| qwen2.5:7b  | completion tools | "7.6B"   | 32768    | "Q4_K_M" | 42.98 | 153.34 | 
 +| llama3.3:70b-instruct-q4_K_M  | completion tools | "70.6B"   | 131072    | "Q4_K_M" | 5.06 | 15.50 | 
 +| functiongemma  | completion tools | "268.10M" | 32768 | "Q8" | 364.21 | 240.50 | 
 +| danielsheep/Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth  | completion tools | "30.5B"   | 1048576    | "Q4_K_M" | 71.60 | 33.14 | 
 +| gpt-oss:20b  | completion tools thinking | "20.9B"   | 131072    | "MXFP4" | 47.32 | 402.47 |
  
-- vision: **llama3.2-vision** 
-- coding and agentic: **deepseek-coder-v2:lite** 
-- general reasoning: **llama3.1:8b** 
tips/llm.1765204346.txt.gz · Last modified: by sscipioni