InterGPU+ollama

Docker方式

Setting Docker on windows

Need to enable --net=host,follow this guide so that you can easily access the service running on the docker. The v6.1x kernel version wsl is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU.

# This image will be updated every day
docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest

# For Windows WSL users:
# To map the xpu into the container, you need to specify --device=/dev/dri when booting the container. And change the /path/to/models to mount the models. Then add --privileged and map the /usr/lib/wsl to the docker.

#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
                --net=host \
                --device=/dev/dri \
                --privileged \
                -v /root/ggml/models:/models \
                -v /usr/lib/wsl:/usr/lib/wsl \
                -e no_proxy=localhost,127.0.0.1 \
                --memory="32G" \
                --name=$CONTAINER_NAME \
                -e bench_model="mistral-7b-v0.1.Q4_0.gguf" \
                -e DEVICE=Arc \
                --shm-size="16g" \
                $DOCKER_IMAGE
# After the container is booted, you could get into the container through docker exec.
docker exec -it ipex-llm-inference-cpp-xpu-container /bin/bash
# To verify the device is successfully mapped into the container, run sycl-ls to check the result. In a machine with Arc A770, the sampled output is:
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]

# Run Ollama models (interactive)
cd /llm/ollama
# create a file named Modelfile
FROM /models/mistral-7b-v0.1.Q4_0.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_predict 64

# create example and run it on console
./ollama create example -f Modelfile
./ollama run example

Running llama.cpp inference with IPEX-LLM on Intel GPU

cd /llm/scripts/
# set the recommended Env
source ipex-llm-init --gpu --device $DEVICE
# mount models and change the model_path in `start-llama-cpp.sh`
bash start-llama-cpp.sh

The example output is like:

llama_print_timings:        load time =    xxx ms
llama_print_timings:      sample time =       xxx ms /    32 runs   (    xxx ms per token, xxx tokens per second)
llama_print_timings: prompt eval time =     xxx ms /    xxx tokens (    xxx ms per token,   xxx tokens per second)
llama_print_timings:        eval time =     xxx ms /    31 runs   (   xxx ms per token,    xxx tokens per second)
llama_print_timings:       total time =     xxx ms /    xxx tokens

Please refer to this documentation for more details.

Running Ollama serving with IPEX-LLM on Intel GPU

Running the ollama on the background, you can see the ollama.log in /root/ollama/ollama.log

cd /llm/scripts/
# set the recommended Env
source ipex-llm-init --gpu --device $DEVICE
bash start-ollama.sh
# ctrl+c to exit, and the ollama serve will run on the background

Sample output:

time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:697 msg="total blobs: 0"
time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:704 msg="total unused blobs removed: 0"
time=2024-05-16T10:45:33.536+08:00 level=INFO source=routes.go:1044 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-05-16T10:45:33.537+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama751325299/runners
time=2024-05-16T10:45:33.565+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-05-16T10:45:33.565+08:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-16T10:45:33.566+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

Run Ollama models (interactive)

cd /llm/ollama
# create a file named Modelfile
FROM /models/mistral-7b-v0.1.Q4_0.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_predict 64

# create example and run it on console
./ollama create example -f Modelfile
./ollama run example

An example process of interacting with model with ollama run example looks like the following:

Running Open WebUI with Intel GPU

Start the ollama and load the model first, then use the open-webui to chat. If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add export HF_ENDPOINT=https://hf-mirror.combefore running bash start.sh.

1 2	cd /llm/scripts/ bash start-open-webui.sh

Sample output:

INFO:     Started server process [1055]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

For how to log-in or other guide, Please refer to this documentation for more details.

Windows下安装

配置python环境要用管理员身份运行, 推荐使用cmd运行

1
2
3

conda create -n llm python=3.11 libuv
conda activate llm
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0

直接在windows下安装intel oneapi

安装ipex 11m

pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
# 验证安装
conda activate llm
set SYCL_CACHE_PERSISTENT=1
python
# 在python的>>>下继续输入
import torch 
from ipex_llm.transformers import AutoModel,AutoModelForCausalLM    
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu') 
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu') 
print(torch.matmul(tensor_1, tensor_2).size()) 
# 得到结果 就说明上述安装过程没什么问题。
torch.Size([1, 1, 40, 40])

安装配置并运行IPEX-LLM for llama.cpp

pip install --pre --upgrade ipex-llm[cpp] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
mkdir c:\Users\777\llm
cd c:\Users\777\Desktop\llm
# 用管理员身份运行，不然就是各种没有权限。完成后，可以到对应的目录下面看到一堆软连接。
init-llama-cpp.bat

下载windows 版本ollama

1
2
3

conda activate llm
cd c:\Users\777\Desktop\llm
init-ollama.bat

启动服务 运行后该窗口不要关闭。

set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

set OLLAMA_HOST=0.0.0.0
ollama serve

新开一个anaconda窗口,使用ollama去启动模型

conda activate llm
cd c:\Users\777\Desktop\llm
ollama run llama3:8b
# 或者 8b的大概5G左右，70b的大概在39G
ollama run llama3:70b

下载chatbox连接到服务就可以了

安装必备组件

安装/更新Intel® Arc™ & Iris® Xe Graphics - Windows* GPU 驱动程序
- 参考 ipex-llm/docs/mddocs/Quickstart/install_windows_gpu.md

Linux下Install GPU Driver

Installation guide: https://dgpu-docs.intel.com/installation-guides/ubuntu/ubuntu-jammy-arc.html
WSL2下安装linux-headers,可以参考更新内核时指定module

sudo apt-get install -y gpg-agent wget
# Add the online network package repository.
. /etc/os-release
if [[ ! " jammy " =~ " ${VERSION_CODENAME} " ]]; then
  echo "Ubuntu version ${VERSION_CODENAME} not supported"
else
  wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
    sudo gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
  echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" | \
    sudo tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
  sudo apt update
fi
# Install kernel and Intel® XPU System Management Interface (XPU-SMI) packages on a bare metal system.
# Installation on the host is sufficient for hardware management and support of the runtimes in containers and bare metal.
sudo apt install -y \
  linux-headers-$(uname -r) \
  linux-modules-extra-$(uname -r) \
  flex bison \
  intel-fw-gpu intel-i915-dkms xpu-smi
sudo reboot

# Install packages responsible for computing and media runtimes.
sudo apt install -y gawk libc6-dev udev\
  intel-opencl-icd intel-level-zero-gpu level-zero \
  intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
  libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
  libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
  mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
# Install development packages.
sudo apt install -y \
  libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# 用户必须有一些特定的组来访问GPU的某些功能
stat -c "%G" /dev/dri/render*
groups ${USER}
# 将您的用户添加到render节点组。
sudo gpasswd -a ${USER} render
newgrp render
# Change the group ID of the current shell.
newgrp render

# Verify the device is working with i915 driver
sudo apt-get install -y hwinfo
hwinfo --display

# 检查
sycl-ls

Download and install Intel® oneAPI Base Toolkit

Installation guide: https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top.html

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
# sudo apt install intel-basekit
sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \
   intel-oneapi-common-oneapi-vars=2024.0.0-49406 \
   intel-oneapi-diagnostics-utility=2024.0.0-49093 \
   intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \
   intel-oneapi-dpcpp-ct=2024.0.0-49381 \
   intel-oneapi-mkl=2024.0.0-49656 \
   intel-oneapi-mkl-devel=2024.0.0-49656 \
   intel-oneapi-mpi=2021.11.0-49493 \
   intel-oneapi-mpi-devel=2021.11.0-49493 \
   intel-oneapi-dal=2024.0.1-25 \
   intel-oneapi-dal-devel=2024.0.1-25 \
   intel-oneapi-ippcp=2021.9.1-5 \
   intel-oneapi-ippcp-devel=2021.9.1-5 \
   intel-oneapi-ipp=2021.10.1-13 \
   intel-oneapi-ipp-devel=2021.10.1-13 \
   intel-oneapi-tlt=2024.0.0-352 \
   intel-oneapi-ccl=2021.11.2-5 \
   intel-oneapi-ccl-devel=2021.11.2-5 \
   intel-oneapi-dnnl-devel=2024.0.0-49521 \
   intel-oneapi-dnnl=2024.0.0-49521 \
   intel-oneapi-tcm-1.0=1.0.0-435

# You can uninstall the package by running the following command:
sudo apt autoremove intel-oneapi-common-vars

设置 Python 环境

访问 Miniforge 安装页面，下载适用于 Windows 的 Miniforge 安装程序，然后按照说明完成安装。

# windows 安装miniforge
scoop install miniforge

# linux下载
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
# Miniforge3-Linux-x86_64.sh
bash Miniforge3-$(uname)-$(uname -m).sh

# 安装程序会询问是否将anaconda添加到PATH环境变量中。如果要在命令行中使用conda和anaconda，则建议选择此选项
# You can undo this by running `conda init --reverse $SHELL`? [yes|no]
# 安装过程选择了no，但是安装完还想初始化，该怎么操作？可以通过以下命令来实现：
source /usr/local/src/anaconda3/bin/activate
conda init

##  re-open your current shell
exec $SHELL

## 查看虚拟环境
conda env list
conda info --envs

## 删除环境
conda remove -n xxxxx(名字) --all

## 进入base
conda activate

## 查看虚拟环境的库
pip list

# 设置启动shell时不自动激活base环境.
conda config --set auto_activate_base false

配置conda源

# 首先，看一下目前conda源都有哪些内容
conda info
# 然后，删除并恢复默认的conda源
conda config --remove-key channels

# 添加指定源
conda config --add channels *（*指代你要添加的源）
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
# 设置安装包时，显示镜像来源，建议显示
conda config --set show_channel_urls yes
# 删除指定源
conda config --remove channels *(*代表你要删除的源）
# 令清除索引缓存
conda clean -i

# 请在 Miniforge 提示符下以管理员权限运行以下命令。 生成.condarc文件
conda config --set show_channel_urls yes
# 会在用户 环境下 如 /home/sun/miniforge3 目录下生成一个 .condarc
vim /home/sun/miniforge3/.condarc
# 将下面代码复制到.condarc文件中，其中最后两行作用是修改默认conda创建环境的路径。
channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  deepmodeling: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/
envs_dirs:
  - D:\ProgramData\miniconda3\envs

# 保存文件后，在命令窗口执行命令清除索引缓存，保证用的是镜像站提供的索引。
conda clean -i

# conda 恢复默认源  将里面的内容删掉或者用#注释掉
cd ~/.config/pip

conda config --remove-key channels

windows下需要给conda环境创建路径修改权限

修改pip源

# 清华源【全局配置方式】
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# Writing to C:\Users\777\AppData\Roaming\pip\pip.ini
# 如果需要root执行
vim /root/.config/pip/pip.conf
[global]
allow-root = true

pip install安装的路径修改:

如果是使用conda创建的虚拟环境，则可以在环境的路径下找到site.py文件。使用其他方式安装的同理，需要找到site.py文件。打开文件后，搜索USER_SITE约在88行。

USER_SITE 表示安装路径
USER_BASE 表示执行下载的pip等脚本的路径
# 我的修改如下
USER_SITE = "D:\ProgramData\miniconda3\envs\python3.8\Lib\site-packages"
USER_BASE = "D:\ProgramData\miniconda3\envs\python3.8\Scripts"
# 保存退出即可完成修改。使用以下命令查看修改结果
python -m site

1 2	# 恢复 pip config set global.index-url https://pypi.org/simple

创建python环境

安装完成后，打开 Miniforge 提示符，创建一个新的 python 环境 llm ：

# python 环境 llm
conda create -n llm python=3.11 libuv
# 激活新创建的环境 llm ：
conda activate llm
# 离开
conda deactivate

Activated OneAPI Env

# linux下执行
conda activate llm
export PYTHONUSERBASE=~/intel/oneapi
pip install dpcpp-cpp-rt==2023.2.0 mkl-dpcpp==2023.2.0 onednn-cpu-dpcpp-gpu-dpcpp==2023.2.0 --user
conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/oneapi/lib -n llm
conda deactivate
# 激活环境
conda activate llm
# 检查
sycl-ls

安装 ipex-llm

llm 在环境处于活动状态时，用于 pip 安装 ipex-llm。

conda activate llm
# 对于美国：
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# 对于 CN：
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

验证安装

打开 Miniforge 提示符并激活您之前创建的 Python 环境 llm ：

对于英特尔 iGPU：

1 2	set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1

For Intel Arc™ A770:
1
set SYCL_CACHE_PERSISTENT=1

运行 Python 代码

# 在 Miniforge Prompt 窗口中键入 python ，然后按 Enter 键来启动 Python 交互式 shell。
python
# 将以下代码逐行复制到 Miniforge Prompt 中，并在复制每行后按 Enter 键。
import torch
from ipex_llm.transformers import AutoModel,AutoModelForCausalLM
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
print(torch.matmul(tensor_1, tensor_2).size())

# 最后输出如下内容：
torch.Size([1, 1, 40, 40])

# 要退出 Python 交互式 shell，只需按 Ctrl+Z，然后按 Enter（或输入 exit() ，然后按 Enter）。

运行llama.cpp的设置

创建一个要使用的 llama.cpp 目录

# 请在 Miniforge 提示符下以管理员权限运行以下命令。
conda activate llm
pip install --pre --upgrade ipex-llm[cpp] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
mkdir c:\llama-cpp
cd c:\llama-cpp
# 使用 IPEX- 初始化llama.cppLLM
init-llama-cpp.bat
# 现在，您可以按标准llama.cpp的用法使用这些可执行文件。

要使用 GPU 加速，在运行 llama.cpp 之前需要或建议使用几个环境变量。

1
2
3

# 请在 Miniforge 提示符下运行以下命令。
set SYCL_CACHE_PERSISTENT=1
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

示例运行GGUF 模型

# 下载或复制社区 GGUF 模型到当前目录
# 请在 Miniforge 提示符下以管理员权限运行以下命令。
conda activate llm
# 运行量化模型 有关每个参数含义的更多详细信息，可以使用 main -h 。
main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color

如何设置 -ngl 参数: -ngl 表示要存储在 VRAM 中的层数。如果你的VRAM足够了，我们建议把所有的层都放在GPU上，你可以只设置 -ngl 一个大数字，比如999，就可以达到这个目标。如果 -ngl 设置为 0，则表示整个模型将在 CPU 上运行。如果 -ngl 设置为大于 0 且小于模型层，则为混合 GPU + CPU 方案。

如果您的机器具有多个 GPU， llama.cpp 则默认使用所有 GPU，这可能会减慢您对可以在单个 GPU 上运行的模型的推理速度。您可以在命令中添加 -sm none 仅使用一个 GPU。此外，您可以使用 ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id] 在执行命令之前选择设备，更多详细信息可以参考此处。

运行Ollama

环境变量OLLAMA_NUM_GPU设置为999，以确保模型的所有层都在Intel GPU上运行，否则，某些层可能会在CPU上运行

要允许该服务接受所有IP地址的连接，请使用OLLAMA_HOST=0.0.0.0服务，而不是 ./ollama serve.服务。

# 请在 Miniforge 提示符下以管理员权限运行以下命令。
conda activate llm
cd c:\llama-cpp
init-ollama.bat
# Run Ollama Serve
set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
ollama serve

# Pull Model 保持Ollama服务并打开另一个终端。例如Dolphin-Phi：最新：
./ollama pull <model_name>

# 使用Ollama
curl http://localhost:11434/api/generate -d "
{
   \"model\": \"<model_name>\",
   \"prompt\": \"Why is the sky blue?\",
   \"stream\": false
}"

使用Ollama 运行 GGUF模型

假设您已经下载了mistral-7b-instruct-v0.1.Q4_K_M.gguf ,创建一个名为Modelfile的文件

1
2
3

FROM ./mistral-7b-instruct-v0.1.Q4_K_M.gguf
TEMPLATE [INST] {{ .Prompt }} [/INST]
PARAMETER num_predict 64

然后，您可以在Ollama中创建Ollama的模型

1
2
3

set no_proxy=localhost,127.0.0.1
ollama create example -f Modelfile
ollama run example