Differences

This shows you the differences between two versions of the page.

--- tips:rocm [2025/12/03 11:18] – [Benchmark] sscipioni
+++ tips:rocm [2025/12/29 23:04] (current) – [3\. ROCm (Optional but Recommended)] sscipioni
@@ Line 1: / Line 1: @@
+References:
+- https://community.frame.work/t/amd-strix-halo-llama-cpp-installation-guide-for-fedora-42/75856
 This report outlines the deployment of the **Ollama LLM runtime** on **Arch Linux** specifically tailored for the **AMD Ryzen AI Max+ 395 APU**. The primary focus is optimizing performance by leveraging the integrated **Radeon 8060S iGPU** through the **Vulkan** backend, and considering the potential of the **XDNA 2 NPU** for heterogeneous acceleration.
@@ Line 33: / Line 37: @@
 sudo pacman -S mesa vulkan-radeon lib32-vulkan-radeon vulkan-headers
 ```
 ### 3\. ROCm (Optional but Recommended)
@@ Line 40: / Line 43: @@
 ```bash
 # Install essential ROCm packages
-sudo pacman -S rocm-core
+yay -S rocm-core amdgpu_top rocminfo rocm-gfx1151-bin
-# Install rocm-hip-sdk if developing or using other tools
+yay -S rocm-hip-sdk rocm-opencl-runtime
-# sudo pacman -S rocm-hip-sdk
+sudo usermod -a -G render,video $USER
+# xdna
+yay -S amdxdna-driver-bin xrt-npu-git
 ```
+**IMPORTANT**: add /opt/rocm/bin to PATH
 ### 4\. Memory Configuration
@@ Line 118: / Line 127: @@
 -----
-## 🌐 Heterogeneous Architecture Strategy
+-
-The Ryzen AI Max+ 395 introduces the **XDNA 2 NPU**, enabling true heterogeneous computing. While the Ollama (via `llama.cpp`) Vulkan backend accelerates the iGPU, **direct, standardized, and easily-configured NPU acceleration support within Ollama on Linux is currently limited/experimental** and often requires niche frameworks or ONNX models (e.g., for hybrid execution).
-### 1\. Current State: iGPU Dominance (Vulkan/ROCm)
-  * **iGPU (Vulkan):** Offers the best performance for general-purpose LLM inference, especially for larger models ($8B+$). This is the focus of the Vulkan optimization.
-  * **NPU (XDNA 2):** Best leveraged by specific, often proprietary, toolchains (e.g., Lemonade, ONNXRuntime GenAI) for models converted into an optimized format (e.g., OGA/ONNX).
-### 2\. Future Heterogeneous Model (Conceptual)
-The optimal, long-term deployment strategy will involve splitting the LLM workload to maximize the strengths of each compute unit.
-\<pre class="mermaid"\>
-graph LR
-subgraph Frontend
-A[User Prompt] --\> B(Ollama Server/API)
-end
-B --\> C{Workload Scheduler (Ollama)}
-C --\> |Prompt/Context Prefill (High TTFT focus)| D(XDNA 2 NPU)
-C --\> |Token Decoding (High TPS focus)| E(Radeon 8060S iGPU - Vulkan)
-D --\> F[NPU Results]
-E --\> G[iGPU Results]
-F & G --\> H(Final Token Stream)
-H --\> B
-\</pre\>
-> **Figure: Conceptual Heterogeneous LLM Pipeline for AMD Ryzen AI Max+ 395**
-### 3\. NPU Exploration (Advanced)
-For the skilled Linux user aiming for NPU utilization, direct integration requires bypassing Ollama for NPU-specific frameworks:
-  * **ONNX-EP:** Investigate using ONNX Runtime with the AMD EP for NPU-accelerated execution, typically requiring model conversion and separate serving.
-  * **AMD Guides:** Follow AMD's specific Linux guides for the Ryzen AI 300 series to enable NPU drivers and libraries, then integrate them with a custom LLM serving solution that can utilize them (e.g., a hand-built `llama.cpp` with NPU support or a dedicated NPU LLM tool).
------
 ## ⚙️ Performance Tuning Notes
@@ Line 166: / Line 139: @@
+====== lemonade-server ======
+<code | download>
+yay -S lemonade-server
+</code>
+oga-hybrid mode: this splits the work, the NPU handles the prefill (prompt processing), and the iGPU handles the generation.
+<code | download>
+lemonade-server run Qwen3-Coder-30B-A3B-Instruct-GGUF --recipe oga-hybrid --llamacpp rocm
+</code>
+<code | download>
+curl http://localhost:8000/api/v1/chat/completions \
+                    -H "Content-Type: application/json" \
+                    -d '{
+                  "model": "Qwen3-Coder-30B-A3B-Instruct-GGUF",
+                  "messages": [{"role": "user", "content": "Who are you?"}],
+                  "stream": false
+                }'
+curl http://localhost:8000/api/v1/stats
+</code>
 ====== Benchmark ======
@@ Line 175: / Line 170: @@
 # Use CMAKE to enable Vulkan
 cmake -DGGML_VULKAN=ON -B build
-cmake --build build --config Release
+cmake --build build --config Release -j15
 # This command downloads the 4.92 GB model file directly.
 wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
+https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
 # bench
 ./build/bin/llama-bench -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
+./build/bin/llama-bench -m Llama-3.2-3B-Instruct-Q4_K_M.gguf
+</code>
+ollama
+<code bash>
+ollama run llama3.2  --verbose 'explain nuclear fusion'
+</code>
+<code>
+yay -S python-huggingface-hub
+hf download Qwen/Qwen3-VL-8B-Instruct
 </code>

Galileo Labs

User Tools

Site Tools

Differences

Page Tools