This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| tips:llm [2025/12/19 07:47] – sscipioni | tips:llm [2025/12/26 15:35] (current) – sscipioni | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== LLM ====== | ====== LLM ====== | ||
| + | |||
| + | - https:// | ||
| + | |||
| + | For Production Deployment: | ||
| + | * Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications | ||
| + | * Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency | ||
| + | * General Purpose: Llama 3.3 70B for maximum versatility | ||
| + | * Edge Computing: Phi-4 14B for resource-constrained environments | ||
| + | |||
| + | Optimization Strategies: | ||
| + | * Always enable **Flash Attention** and KV-cache quantization | ||
| + | * Use **Q4_K_M** quantization for production deployments | ||
| + | * Implement caching for repeated queries | ||
| + | * Monitor GPU memory usage and implement automatic model swapping | ||
| + | * Use load balancing for high-throughput applications | ||
| + | |||
| + | |||
| + | |||
| + | ^ Hardware ^ Llama 3.3 8B (tokens/ | ||
| + | | RTX 4090 | 89.2 | 12.1 | | | ||
| + | | RTX 3090 | 67.4 | 8.3 | | | ||
| + | | A100 40GB | 156.7 | 45.2 | | | ||
| + | | M3 Max 128GB | 34.8 | 4.2 | | | ||
| + | | Strix Halo 128GB ollama | | 5.1 | 85.02 | | ||
| + | | Strix Halo 128GB llama.cpp | | | 90 | | ||
| + | | RTX 3060 | | | 131.76 | | ||
| + | |||
| + | |||
| ^ model ^ capabilities | ^ model ^ capabilities | ||
| Line 13: | Line 41: | ||
| | phi4-mini | | phi4-mini | ||
| | qwen2.5: | | qwen2.5: | ||
| + | | llama3.3: | ||
| + | | functiongemma | ||
| + | | danielsheep/ | ||
| + | | gpt-oss: | ||