NVIDIA+ollama

安装 GPU 驱动程序

这是您唯一需要安装的驱动程序。不要在 WSL 中安装任何 Linux 显示驱动程序。

下载并安装支持 NVIDIA CUDA 的 WSL 驱动程序，以用于现有的 CUDA ML 工作流。

安装 WSL2

安装CUDA Toolkit

从这一点开始，您应该能够运行任何需要 CUDA 的现有 Linux 应用程序。不要在 WSL 环境中安装任何驱动程序。要构建 CUDA 应用程序，您将需要 CUDA 工具包。

在系统上安装 Windows NVIDIA GPU 驱动程序后，CUDA 将在 WSL 2 中可用。安装在 Windows 主机上的 CUDA 驱动程序将在 WSL 2 中存根，因此用户不得在 WSL 2 中安装任何 NVIDIA GPU Linux 驱动程序。这里必须非常小心，因为默认的 CUDA 工具包附带驱动程序，并且很容易使用默认安装覆盖 WSL 2 NVIDIA 驱动程序。建议开发人员使用单独的 CUDA Toolkit for WSL 2 （Ubuntu）下载页面中提供的 CUDA Toolkit，以避免此覆盖。

首先，删除旧的 GPG 密钥：

sudo apt-key del 7fa2af80
sudo apt install gcc dkms -y
## 方式一 Installer Type runfile
wget https://developer.download.nvidia.com/compute/cuda/12.5.1/local_installers/cuda_12.5.1_555.42.06_linux.run
sudo sh cuda_12.5.1_555.42.06_linux.run


## 方式二 deb方式
sudo cp cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo dpkg -i cuda-repo-wsl-ubuntu-12-5-local_12.5.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-5

WSL2 安装 Docker

参考历史文章 WSL2安装Docker

安装 NVIDIA 容器工具包

配置生产存储库：

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

从存储库更新软件包列表：
1
sudo apt-get update

安装 NVIDIA Container Toolkit 软件包：

1	sudo apt-get install -y nvidia-container-toolkit

使用以下命令配置容器运行时：nvidia-ctk

# 该命令修改主机上的/etc/docker/daemon.json，以便 Docker 可以使用 NVIDIA 容器运行时。
sudo nvidia-ctk runtime configure --runtime=docker
# 重新启动 Docker 守护程序：
sudo systemctl restart docker

测试

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
sudo docker run --rm  --gpus=all ubuntu nvidia-smi -L
sudo docker run -it --gpus=all --rm nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
sudo docker run --gpus=all --rm nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
sudo docker run --rm --runtime nvidia --gpus all pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime nvidia-smi

docker-desktop

如果使用的是WSL+dockerdesktop

{
  "builder": {
    "gc": {
      "defaultKeepStorage": "20GB",
      "enabled": true
    }
  },
  "experimental": false,
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

问题1

1
2
3

root@DESKTOP-8DE2K5S:/mnt/c/Users/fulsun# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system
Failed to properly shut down NVML: GPU access blocked by the operating system

解决方案：在 WSL2 的 docker 容器中：操作系统阻止的 GPU 访问 ·问题 #9962 ·微软/WSL ·GitHub上

1	在文件 /etc/nvidia-container-runtime/config.toml 中，将 no-cgroups 从 true 更改为 false

问题2

root@DESKTOP-8DE2K5S:/mnt/g/大模型/nvida/cuda# nvidia-smi
Sat Jul 20 01:37:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
Segmentation fault

错误问题 nvidia-smi segmentation fault in wsl2 but not in Windows · Issue #11277 · microsoft/WSL · GitHub

下载安装537版本的驱动：https://developer.nvidia.com/downloads/vulkan-beta-53796-windows

Ollama 安装

https://ollama.com/
https://github.com/ollama/ollama/blob/main/docs/linux.md
https://github.com/ollama/ollama/blob/main/docs/faq.md

Docker方式安装

docker pull ollama/ollama
# 默认存放在宿主机的 /var/lib/docker/volumes/数据卷名称/_data
docker volume create ollama
# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Nvidia GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --env OLLAMA_KEEP_ALIVE=60m --name ollama ollama/ollama
# AMD GPU
docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm
# 本地运行模型
docker exec -it ollama ollama run gemma:2b
docker exec -it ollama /bin/bash
# 设置自动启动
docker update --restart=always 容器ID(或者容器名)


## 环境变量查看
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
2f2104f18c08f37b4e79e3a5aa35f2b9f4240659f6a1786e6d4b504aeb5b2e03
docker exec 2f2104f1 env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=2f2104f18c08
OLLAMA_HOST=0.0.0.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_VISIBLE_DEVICES=all
HOME=/root


# 导出导入
# docker save保存的是镜像（image），docker export保存的是容器（container）；
# docker load用来载入镜像包，docker import用来载入容器包，但两者都会恢复为镜像；
# docker load不能对载入的镜像重命名，而docker import可以为镜像指定新名称。
cd /mnt/h/大模型/docker/
docker save -o ollama.tar ollama/ollama
docker load -i ollama.tar

在Linux上设置环境变量(非docker)

OLLAMA_HOST：这个变量定义了Ollama监听的网络接口。通过设置OLLAMA_HOST=0.0.0.0，我们可以让Ollama监听所有可用的网络接口，从而允许外部网络访问。
OLLAMA_MODELS：这个变量指定了模型镜像的存储路径。通过设置OLLAMA_MODELS=E:\OllamaCache，我们可以将模型镜像存储在E盘，避免C盘空间不足的问题。
OLLAMA_KEEP_ALIVE：这个变量控制模型在内存中的存活时间。设置OLLAMA_KEEP_ALIVE=24h可以让模型在内存中保持24小时，提高访问速度。
OLLAMA_PORT：这个变量允许我们更改Ollama的默认端口。例如，设置OLLAMA_PORT=8080可以将服务端口从默认的11434更改为8080。
OLLAMA_NUM_PARALLEL：这个变量决定了Ollama可以同时处理的用户请求数量。设置OLLAMA_NUM_PARALLEL=4可以让Ollama同时处理两个并发请求。
OLLAMA_MAX_LOADED_MODELS：这个变量限制了Ollama可以同时加载的模型数量。设置OLLAMA_MAX_LOADED_MODELS=4可以确保系统资源得到合理分配。

sudo vi /etc/systemd/system/ollama.service
# 添加以下内容
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=60m"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
ExecStart=/usr/local/bin/ollama serve

# 重载 systemd 并重启 Ollama：
systemctl daemon-reload
sudo systemctl start ollama

# 检查
curl http://localhost:11434
显示 Ollama is running

# 下载模型
ollama pull gemma:2b
ollama list
ollama run gemma:2b

导入模型

导入 (GGUF)

首先创建一个Modelfile。此文件是模型的蓝图，用于指定权重、参数、提示模板等。

cp /mnt/h/大模型/模型/gguf/qwen2-7b-instruct-q4_k_s.gguf /var/lib/docker/volumes/ollama/_data/
docker exec -it ollama /bin/bash
cd /root/.ollama/
vi Modelfile
FROM ./qwen2-7b-instruct-q4_k_s.gguf
# （可选）许多聊天模型需要提示模板才能正确回答
FROM ./qwen2-7b-instruct-q4_k_s.gguf
TEMPLATE "[INST] {{ .Prompt }} [/INST]"

创建 Ollama 模型

1 2	ollama create qwen2-7b-instruct-q4_k_s.gguf -f Modelfile transferring model data ⠋

使用以下命令测试模型：ollama run

1	ollama run qwen2-7b-instruct-q4_k_s.gguf "What is your favourite condiment?"

导入（PyTorch 和 Safetensors）

从 PyTorch 和 Safetensors 导入的过程比从 GGUF 导入要长。使它更容易的改进正在进行中。

首先，克隆存储库：ollama/ollama

1 2	git clone git@github.com:ollama/ollama.git ollama cd ollama

然后获取其子模块：llama.cpp

1 2	git submodule init git submodule update llm/llama.cpp

接下来，安装 Python 依赖项：

1
2
3

python3 -m venv llm/llama.cpp/.venv
source llm/llama.cpp/.venv/bin/activate
pip install -r llm/llama.cpp/requirements.txt

然后构建工具：quantize
1
make -C llm/llama.cpp quantize

如果模型当前托管在 HuggingFace 存储库中，请首先克隆该存储库以下载原始模型。（可选）

1
2
3

# 安装Git LFS，验证它是否已安装，然后克隆模型的存储库：
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 model

转换模型 注意：某些模型架构需要使用特定的转换脚本。例如，Qwen 模型需要运行convert-hf-to-gguf.py而不是convert.py
1
python llm/llama.cpp/convert.py ./model --outtype f16 --outfile converted.bin

量化模型

1	llm/llama.cpp/quantize converted.bin quantized.bin q4_0

为您的模型创建一个：Modelfile

1 2	FROM quantized.bin TEMPLATE "[INST] {{ .Prompt }} [/INST]"

创建 Ollama 模型
1
ollama create example -f Modelfile

运行模型

1	ollama run example "What is your favourite condiment?"

调用模型

Curl调用模型

1	curl http://localhost:11434/api/chat -d '{"model": "gemma:2b","messages": [{ "role": "user", "content": "who are you" }]}'

Python调用模型

（Python代码调用，建议在conda下运行)

vim test.py

import requests
import json

def send_message_to_ollama(message, port=11434):
url = f"http://localhost:{port}/api/chat"
payload = {
"model": "gemma:2b",
"messages": [{"role": "user", "content": message}]
}
response = requests.post(url, json=payload)
if response.status_code == 200:
response_content = ""
for line in response.iter_lines():
if line:
response_content += json.loads(line)["message"]["content"]
return response_content
else:
return f"Error: {response.status_code} - {response.text}"

if __name__ == "__main__":
user_input = "who are you"
response = send_message_to_ollama(user_input)
print("Ollama's response:")
print(response)

在 conda环境中运行Python脚本

conda activate XXX
pip install requests
python test.py
conda deactivate

Open WebUI 部署

https://openwebui.com/
https://github.com/open-webui/open-webui

Docker方式安装

docker images
docker pull ghcr.io/open-webui/open-webui:main

cd /mnt/h/大模型/docker/
docker save -o open-webui.tar ghcr.io/open-webui/open-webui:main
docker load -i open-webui.tar

# Ollama 在同一台电脑
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
# host模式
docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui88 --restart always ghcr.io/open-webui/open-webui:main

# Ollama 在不同一台电脑
docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=https://example.com -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

# 更新 open-webui
docker run --rm --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --run-once open-webui

部署后启动（时间有点长）打开网站 http://127.0.0.1:3000先注册，选择模式（正常情况，会显示出刚才ollama已经下载好的模型）

Installation with `pip` (Beta)

For users who prefer to use Python’s package manager pip, Open WebUI offers a installation method. Python 3.11 is required for this method.

Install Open WebUI: Open your terminal and run the following command:
1
pip install open-webui
Start Open WebUI: Once installed, start the server using:
1
open-webui serve

This method installs all necessary dependencies and starts Open WebUI, allowing for a simple and efficient setup. After installation, you can access Open WebUI at http://localhost:8080. Enjoy! 😄

Install from Open WebUI Github Repo

INFO

Open WebUI consists of two primary components: the frontend and the backend (which serves as a reverse proxy, handling static frontend files, and additional features). Both need to be running concurrently for the development environment.

Requirements 📦

🐰 Node.js >= 20.10
🐍 Python >= 3.11

Build and Install 🛠️

Run the following commands to install:

git clone https://github.com/open-webui/open-webui.git
cd open-webui/

# Copying required .env file
cp -RPp .env.example .env

# Building Frontend Using Node
npm i
npm run build

# Serving Frontend with the Backend
cd ./backend
pip install -r requirements.txt -U
bash start.sh

You should have Open WebUI up and running at http://localhost:8080/. Enjoy! 😄

ollama环境参数

OLLAMA_HOST：这个变量定义了Ollama监听的网络接口。通过设置OLLAMA_HOST=0.0.0.0，我们可以让Ollama监听所有可用的网络接口，从而允许外部网络访问。
OLLAMA_MODELS：这个变量指定了模型镜像的存储路径。通过设置OLLAMA_MODELS=F:\OllamaCache，我们可以将模型镜像存储在E盘，避免C盘空间不足的问题。
OLLAMA_KEEP_ALIVE：这个变量控制模型在内存中的存活时间。设置OLLAMA_KEEP_ALIVE=24h可以让模型在内存中保持24小时，提高访问速度。
OLLAMA_PORT：这个变量允许我们更改Ollama的默认端口。例如，设置OLLAMA_PORT=8080可以将服务端口从默认的11434更改为8080。
OLLAMA_NUM_PARALLEL：这个变量决定了Ollama可以同时处理的用户请求数量。设置OLLAMA_NUM_PARALLEL=4可以让Ollama同时处理两个并发请求。
OLLAMA_MAX_LOADED_MODELS：这个变量限制了Ollama可以同时加载的模型数量。设置OLLAMA_MAX_LOADED_MODELS=4可以确保系统资源得到合理分配。

安装 GPU 驱动程序

安装 WSL2

安装CUDA Toolkit

WSL2 安装 Docker

安装 NVIDIA 容器工具包

docker-desktop

问题1

问题2

Ollama 安装

Docker方式安装

在Linux上设置环境变量(非docker)

导入模型

导入 (GGUF)

导入（PyTorch 和 Safetensors）

调用模型

Curl调用模型

Python调用模型

Open WebUI 部署

Docker方式安装

Installation with pip (Beta)

Install from Open WebUI Github Repo

Requirements 📦

Build and Install 🛠️

ollama环境参数

Installation with `pip` (Beta)