2. 安装配置

本章帮你把 Triton 装好、跑通第一个示例。如果你装过 PyTorch，多数情况下 Triton 已经跟着 PyTorch 一起装好了——只需要 import triton 验证一下。

2.1 环境要求

2.1.1 操作系统

系统	支持情况
Linux（Ubuntu 22.04+ 等主流发行版）	✅ 官方主线全力支持
Windows	⚠️ 主线不支持，需用社区 fork `triton-windows` 或 WSL2
macOS（Intel / Apple Silicon）	❌ 完全不支持，需远程 Linux GPU

Mac 用户怎么办

Mac 上没有 NVIDIA / AMD GPU 后端可用。建议：

使用 Google Colab（免费 T4，付费可选 A100）
使用 Kaggle Notebook（每周 30 小时免费 GPU）
自建 / 租用 Linux GPU 服务器，本地 VSCode + Remote-SSH 开发
写代码时可设 TRITON_INTERPRET=1 用 CPU 解释模式做语法调试（性能很差，仅用于排错）

2.1.2 Python 版本

官方支持 CPython 3.10 – 3.14，强烈推荐 3.11 或 3.12（生态最稳定）。

版本冲突提示

网上一些 2024 年初的教程说 Python 3.8–3.12，那是过时信息。请以官方安装文档为准。

2.1.3 GPU 与驱动

NVIDIA GPU：

Compute Capability ≥ 7.5（Turing 架构起步，即 RTX 20 系列 / T4 / Quadro RTX 等）
Triton 3.3 起已正式放弃 7.5 之前的架构（Pascal、Volta 早期）
推荐 Ampere (8.0+) 或更新（A100 / RTX 30 系 / RTX 40 系 / H100 / RTX 50 系），新特性支持最完整
NVIDIA 驱动：必须够新，能支持 CUDA 12.x；CUDA Toolkit 本身非强制（Triton 通过驱动直接 JIT 编译 PTX）

AMD GPU：

需要 ROCm 5.7 或更新
数据中心卡：MI210 / MI250 / MI300 系列
消费级卡：通过 "ROCm on Radeon" 路径有限支持

如何查看自己的 GPU 算力？

bash

nvidia-smi --query-gpu=name,compute_cap --format=csv

2.2 通过 pip 安装

2.2.1 推荐：跟着 PyTorch 一起装

绝大多数人不需要单独装 Triton——装 PyTorch 时它会作为依赖自动装好。

bash

# CUDA 12.8（对应 PyTorch 2.8 + Triton 3.4）
pip install torch --index-url https://download.pytorch.org/whl/cu128

# 验证 Triton 已附带
python -c "import triton; print(triton.__version__)"

2.2.2 单独安装最新稳定版

bash

pip install triton

PyTorch ↔ Triton 版本对照

PyTorch	Triton
2.6	3.2
2.7	3.3
2.8	3.4
2.9	3.5
2.10	3.6

不要随意升级 Triton 单包，保持与 PyTorch 配套最稳妥。如果非要单独升级，记得做完整测试。

2.2.3 Windows 用户：用 fork

powershell

pip install triton-windows

自 triton-windows 3.2.0.post11 起，wheel 内置最小 CUDA toolchain，不需手动安装 CUDA Toolkit
自 triton-windows 3.2.0.post13 起，wheel 内置 TinyCC，不需手动安装 C 编译器
要求 CUDA 12+（CUDA 11 不支持）

也可以走 WSL2 路径——在 Windows 上装 WSL2 + Ubuntu，然后按 Linux 流程安装，体验与原生 Linux 几乎一致。

2.2.4 AMD ROCm 用户

跟着 PyTorch ROCm 版本一起装就行：

bash

pip install torch --index-url https://download.pytorch.org/whl/rocm6.2

2.3 从源码编译

只有在以下情况才需要源码编译：

你要修改 Triton 编译器本身
你需要尚未发布的某个 main 分支特性
你要适配新硬件后端

bash

git clone https://github.com/triton-lang/triton.git
cd triton
pip install ninja cmake wheel pybind11    # 构建依赖
pip install -e python                      # 可编辑安装

编译耗时

首次源码编译耗时通常 30 分钟到 1 小时（即使在多核机器上），因为要构建 LLVM。如果系统没有 LLVM，setup.py 会自动下载官方 LLVM 静态库。

可以用以下命令跑测试验证：

bash

# 需要 GPU
pip install pytest
pytest python/test/unit

2.4 验证安装

最快的"我装好了"验证：

python

import torch
import triton
import triton.language as tl

print(f"Triton 版本: {triton.__version__}")
print(f"CUDA 可用:   {torch.cuda.is_available()}")
print(f"GPU 名称:    {torch.cuda.get_device_name(0)}")

# 跑一个最简单的核函数
@triton.jit
def hello_kernel(x_ptr, BLOCK_SIZE: tl.constexpr):
    offsets = tl.arange(0, BLOCK_SIZE)
    tl.store(x_ptr + offsets, offsets * 2)

x = torch.empty(16, dtype=torch.int32, device='cuda')
hello_kernel[(1,)](x, BLOCK_SIZE=16)
print(f"核函数输出: {x.tolist()}")
# 预期：[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]

如果上面这段能跑出预期结果，恭喜你，环境已经准备就绪。

2.5 常见问题排查

报错：PTX was compiled with an unsupported toolchain

原因：NVIDIA 驱动版本过旧，不支持 Triton 生成的较新 PTX。解决：升级 NVIDIA 驱动到能支持 CUDA 12.x 的版本（通常 550+）。

报错：RuntimeError: Triton requires CUDA

原因：当前环境没有 GPU，或 PyTorch 是 CPU 版本。解决：

nvidia-smi 确认有 GPU 且驱动正常
重装 PyTorch CUDA 版本：pip install torch --index-url https://download.pytorch.org/whl/cu128

报错：libcuda.so.1: cannot open shared object file

原因：找不到 CUDA driver 库（常见于容器或 CI 环境）。解决：

在 Docker 中确保使用 --gpus all 启动
或确认 /usr/lib/x86_64-linux-gnu/libcuda.so.1 存在

Conda 环境冲突

Conda 通过 cudatoolkit 装的 CUDA 运行时可能与系统驱动冲突。推荐做法：建一个干净的 venv 或 Conda 环境，只用 pip 装 PyTorch + Triton，不要混用 conda-forge 的 CUDA 包。

第一次核函数 launch 很慢？

正常现象。Triton 首次见到 (dtype, BLOCK_SIZE, ...) 组合时会 JIT 编译，耗时 几百毫秒到几秒。结果会缓存到 ~/.triton/cache/，第二次起就是微秒级启动。

如果想在 CI 中预热缓存，可以在主流程前先调用一次核函数。

想用 CPU 调试 Triton 代码？

设置环境变量后，Triton 会在 CPU 上模拟执行（性能极差，仅用于排逻辑错误）：

bash

TRITON_INTERPRET=1 python your_script.py

在这个模式下可以用 print() 调试，也可以单步 pdb。

本章小结

Triton 在 Linux + NVIDIA（CC ≥ 7.5）上体验最佳；Windows 走 fork 或 WSL2；macOS 完全不支持。
Python 3.10 – 3.14 都行，推荐 3.11 / 3.12。
装 Triton 的最稳路径是装 PyTorch 时让它自动带上，版本要保持配套。
验证靠一段 10 行的最简核函数；常见报错绝大多数与驱动版本或 CUDA 环境有关。
第一次 launch 慢是正常现象，是 JIT 编译；结果会缓存到 ~/.triton/cache/。

环境装好之后，下一章我们正式进入 Triton 的编程模型——理解程序实例 (program)、网格 (grid)、块 (block) 这几个概念，是后面所有核函数的基础。

思考题

你的同事的笔记本只有集成显卡，没有 NVIDIA / AMD GPU。他想在本地学习 Triton 语法、调试核函数逻辑（暂时不关心性能），你会推荐他怎么做？
团队 CI 环境每次跑测试都因为 Triton JIT 编译耗时太久而超时。你能想到哪两种降低这个成本的办法？
同事问："我已经装了 PyTorch 2.8，需要再 pip install triton 吗？" 你会怎么回答？再追问："那如果我想用 Triton 3.5，怎么办？" 你又怎么回答？

2. 安装配置 ​

2.1 环境要求 ​

2.1.1 操作系统 ​

2.1.2 Python 版本 ​

2.1.3 GPU 与驱动 ​

2.2 通过 pip 安装 ​

2.2.1 推荐：跟着 PyTorch 一起装 ​

2.2.2 单独安装最新稳定版 ​

2.2.3 Windows 用户：用 fork ​

2.2.4 AMD ROCm 用户 ​

2.3 从源码编译 ​

2.4 验证安装 ​

2.5 常见问题排查 ​

本章小结 ​

思考题 ​

2. 安装配置

2.1 环境要求

2.1.1 操作系统

2.1.2 Python 版本

2.1.3 GPU 与驱动

2.2 通过 pip 安装

2.2.1 推荐：跟着 PyTorch 一起装

2.2.2 单独安装最新稳定版

2.2.3 Windows 用户：用 fork

2.2.4 AMD ROCm 用户

2.3 从源码编译

2.4 验证安装

2.5 常见问题排查

本章小结

思考题