This repository contains optimised settings for running KoboldCpp with AMD GPUs, specifically tested with the AMD Radeon RX 6700 XT and Llama 2 7B model.
This setup was specifically developed and tested with an AMD Radeon RX 6700 XT (RDNA2 architecture). Whilst newer RDNA3 and RDNA4 GPUs might handle dual GPU setups better with Vulkan, RDNA2 GPUs can experience issues when running alongside NVIDIA GPUs. This solution provides a reliable way to:
- Ensure stable operation on RDNA2 GPUs
- Avoid Vulkan-related conflicts in dual GPU setups
- Provide consistent performance regardless of GPU architecture
This setup is particularly useful for systems with multiple GPUs, especially when you have both AMD and NVIDIA GPUs installed. In such configurations, other applications might try to use both GPUs with Vulkan, which can lead to conflicts and failures. By using the ROCm version of KoboldCpp, we ensure that:
- The application specifically targets the AMD GPU
- Avoids conflicts with NVIDIA GPU operations
- Prevents Vulkan-related issues in dual GPU setups
- Provides stable performance on the AMD GPU
- AMD GPU with ROCm support (tested with RX 6700 XT)
- Windows 10/11
- Python 3.x
- KoboldCpp ROCm version
-
Clone the repository:
git clone https://gitlab.com/CodenameCookie/koboldcpp-amd-rdna2.git
-
Navigate to the project directory:
cd koboldcpp-amd-rdna2
Alternatively, you can open the project in Visual Studio Code:
- Open Visual Studio Code
- Go to File > Open Folder
- Navigate to where you cloned the repository (e.g.,
C:\Users\YourUsername\Documents\koboldcpp-amd-rdna2
) - Click "Select Folder"
- Open the integrated terminal in VS Code using
Ctrl + `
or View > Terminal
-
Check if ROCm is already installed:
- Open PowerShell and run
rocm-smi
to check if ROCm is installed - If the command is recognized, ROCm is already installed
- If not, proceed with ROCm installation
- Open PowerShell and run
-
Install ROCm for Windows (if not already installed):
- Download and install ROCm from AMD's official website
- Follow the installation guide for Windows
- Make sure your GPU is supported by the installed ROCm version
-
Download KoboldCpp ROCm:
- Download the latest release from YellowRoseCx/koboldcpp-rocm
- For Windows: Download
koboldcpp_rocm.exe
(single file) orkoboldcpp_rocm_files.zip
- If using the zip file, extract it to your desired location
- Place
koboldcpp_rocm.exe
in the root directory of this project
-
Download the Llama 2 7B Chat model:
- Create a models directory if it doesn't exist:
mkdir -Force models
- Download the GGUF version of Llama 2 7B Chat using PowerShell:
Invoke-WebRequest -Uri "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf" -OutFile "models\llama-2-7b-chat.gguf"
- Note: The download is approximately 4GB and may take some time depending on your internet connection
- Alternative: You can manually download the model from TheBloke's HuggingFace repository and place it in the
models
folder
- Model: Llama 2 7B Chat
- Format: GGUF
- Size: 3.80 GiB
- Context Size: 2048
- Total Layers: 32
After testing various configurations, we found the optimal settings for the RX 6700 XT:
.\koboldcpp_rocm.exe --model .\models\llama-2-7b-chat.gguf --host 127.0.0.1 --port 5001 --contextsize 2048 --gpulayers 30 --blasbatchsize 2048 --blasthreads 4 --highpriority --usecublas mmq
--gpulayers 30
: Offloads 30 layers to GPU (optimal for 32-layer model)--blasbatchsize 2048
: Maximum batch size for better GPU utilization--blasthreads 4
: Reduced thread count to prevent CPU bottlenecks--highpriority
: Improves CPU allocation--usecublas mmq
: Enables Matrix Multiplication Quantization through hipBLAS
Previous configurations:
- 43 layers: 17.69s, 2.94 tokens/s
- 27 layers: 15.76s, 3.05 tokens/s
- 20 layers: 16.54s, 3.14 tokens/s
Optimized configuration:
- 30 layers with MMQ: 6.29s, 7.79 tokens/s
- Stop any existing KoboldCpp processes (only if you have run it already):
Get-Process -Name koboldcpp_rocm -ErrorAction SilentlyContinue | Stop-Process -Force
- Start KoboldCpp with optimized settings:
.\koboldcpp_rocm.exe --model .\models\llama-2-7b-chat.gguf --host 127.0.0.1 --port 5001 --contextsize 2048 --gpulayers 30 --blasbatchsize 2048 --blasthreads 4 --highpriority --usecublas mmq
- Test the performance:
python test_inference.py
- The model requires approximately 3.80 GiB of VRAM
- The optimised settings use hipBLAS for better GPU utilisation
- High priority mode is recommended for better CPU allocation
- The context size of 2048 provides a good balance between performance and memory usage
We welcome contributions to improve this setup! Here's how you can help:
- Please check if the issue has already been reported
- Include your system specifications (GPU model, ROCm version, etc.)
- Provide detailed steps to reproduce the issue
- Include any error messages or logs
- Fork the repository
- Create a new branch for your feature (
git checkout -b feature/amazing-feature
) - Make your changes
- Test thoroughly with your AMD GPU setup
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Follow the existing code style
- Keep code comments clear and concise
- Update documentation for any new features
- Test changes with different AMD GPU models
- Verify performance improvements
- Check compatibility with different ROCm versions
- Ensure no regressions in existing functionality
By contributing, you agree that your contributions will be licensed under the same terms as the project.