Notes on Pytorch

记录学习Pytorch的过程.

配置PyTorch

Setting up and Configuring CUDA, CUDNN and PYTorch for Python Machine Learning.: 解释了CUDA, cuDNN, 安装步骤详细

Why torch.cuda.is_available() returns False even after installing pytorch with cuda?: 解释了nvidia-smi输出的内容, 如CUDA Version是GPU最高支持的CUDA版本

Getting your NVIDIA Virtual GPU Software Version: The NVIDIA Virtual GPU Manager version appears in the first line of text after the date, immediately after the text NVIDIA-SMI

Windows

  1. 安装CUDA

    • 确认GPU支持CUDA

    • 安装CUDA Toolkit

    • 确认安装版本: nvcc -V

    • nvidia-smi

  2. 安装PyTorch

    • 使用官方的安装方法, 可能出现下载失败的情况

      1
      conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
    • 为避免出现PyTorch下载速度慢的问题, 在Anaconda Cloud上直接下载相应安装包并安装

      1
      conda install --use-local pytorch-1.7.1-py3.8_cuda110_cudnn8_0.tar.bz2
    • 确认安装版本

      1
      2
      import torch
      print(torch.__version__)

Ubuntu on Windows

First Try

2022年春季学期, 给学校的3090装.

CUDA Toolkit 11.7 Downloads这里可以找到详细的安装命令, 安装后nvcc -V显示命令无效, 要求sudo apt install nvidia-cuda-toolkit安装, 安好以后发现是10.1版本的, 然后找一下切换版本的办法.

How to change CUDA version中的sudo update-alternatives --display cuda可以明确找到安装的11.7版本的路径. 由Multiple CUDA versions on machine nvcc -V confusion, sudo vim ~/.bashrc后在末尾加上如下三行即可:

1
2
3
export CUDA_HOME="/usr/local/cuda-11.7"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda-11.7/bin:$PATH"

发现使用11.3安装的Pytorch检测不到11.7, 那么先sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*"把它卸了, 见How to remove cuda completely from ubuntu?.

出了一些问题要重头安装, 发现sudo apt-get -y install cuda报错you have held broken packages, 见Problem while installing cuda toolkit in ubuntu 18.04, 命令如下:

1
2
3
sudo apt clean
sudo apt update
sudo apt purge nvidia-*
1
sudo apt autoremove
1
sudo apt install -y cuda

WSL2- $nvidia-smi command not running

问题为nvidia-smi报错

1
2
- Failed to initialize NVML: GPU access blocked by the operating system
- Failed to properly shut down NVML: GPU access blocked by the operating system

尝试了rellik的回复, nvidia-smi.exe能正常输出结果, 但这个结果应该是Windows主系统上的.

破案了, 原因是Windows没更新, 见Why does nvidia-smi return "GPU access blocked by the operating system" in WSL2 under Windows 10 21H2 [closed].

PyTorch not recognizing GPU on WSL - installed cudnn and cuda #73487

  • conda环境用pip安装包的方法: /anaconda/envs/venv_name/bin/pip install package_name, 见Using Pip to install packages to Anaconda Environment.
  • 查看torch环境方法: python -m torch.utils.collect_env, 见Quick Tips #1: How to obtain environment information using PyTorch

Trouble installing torch with CUDA using conda on WSL2提到安装1.8.1+cu111版本的Pytorch, 新建conda环境后安装发现有效, 此时环境输出为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Collecting environment information...
PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.9 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 512.77
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.2
[conda] numpy 1.22.3 pypi_0 pypi
[conda] torch 1.8.1+cu111 pypi_0 pypi
[conda] torchaudio 0.8.1 pypi_0 pypi
[conda] torchvision 0.9.1+cu111 pypi_0 pypi

Second Try

2022年暑假, 给笔记本RTX 2070-Max装.

  • Using PyTorch with CUDA on WSL2 - Christian Mills: 指出了WSL2的GPU使用过程中的现存问题

  • Enable NVIDIA CUDA on WSL 2 - Windows - Microsoft Docs: 包含大体流程

  • CUDA on WSL User Guide - NVIDIA Documentation Center: 官方链接, 包含主要步骤

具体为Windows上安装2.1. Step 1: Install NVIDIA Driver for GPU Support, WSL2上安装3. CUDA Support for WSL 2, 随后进行一些检查 (这些命令在安装前都是无法运行的, 即使Windows上曾安装过)

1
2
3
4
5
(base) carlos@LAPTOP-00000000:~/Downloads$ nvcc --version

Command 'nvcc' not found, but can be installed with:

sudo apt install nvidia-cuda-toolkit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(base) carlos@LAPTOP-00000000:~/Downloads$ nvidia-smi
Sun Jul 17 01:08:35 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57 Driver Version: 516.59 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 48C P8 3W / N/A | 263MiB / 8192MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

发现nvcc --version没有输出, 参考Nvcc –version returns nothing despite correct install, vim ~/.bashrc后在末尾加入

1
2
3
export CUDA_HOME=/usr/local/cuda-11.7
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

之后source ~/.bashrc重载, 再检查一下

1
2
3
4
5
6
(base) carlos@LAPTOP-00000000:~/Downloads$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

先更换一下镜像

  • Anaconda 镜像使用帮助 - 清华大学开源软件镜像站
  • Pypi 镜像使用帮助 - 清华大学开源软件镜像站

运行conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch安装.

使用python -m torch.utils.collect_env查看环境

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Collecting environment information...
PyTorch version: 1.12.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.64
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2070 with Max-Q Design
Nvidia driver version: 516.59
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==1.12.0
[pip3] torchaudio==0.12.0
[pip3] torchvision==0.13.0
[conda] blas 1.0 mkl defaults
[conda] cudatoolkit 11.3.1 h2bc3f7f_2 defaults
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640 defaults
[conda] mkl-service 2.4.0 py39h7f8727e_0 defaults
[conda] mkl_fft 1.3.1 py39hd3c417c_0 defaults
[conda] mkl_random 1.2.2 py39h51133e4_0 defaults
[conda] numpy 1.22.3 py39he7a7128_0 defaults
[conda] numpy-base 1.22.3 py39hf524024_0 defaults
[conda] pytorch 1.12.0 py3.9_cuda11.3_cudnn8.3.2_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.12.0 py39_cu113 pytorch
[conda] torchvision 0.13.0 py39_cu113 pytorch

随后分别使用conda install -c conda-forge jupyterlabpip install pettingzoo安装包, 均与Windows操作一致.

Fix First Try

2022年秋, 又用回3090了, 试试能不能把环境改到最新的cuda和pytorch版本.

Nivida-smi shows different mesages under wsl and windows

使用GPU

使用如下函数输出GPU信息并返回, 见How to check if pytorch is using the GPU?.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def get_device():
print('torch.cuda.is_available():', torch.cuda.is_available())
device_count = torch.cuda.device_count()
print('torch.cuda.device_count():', device_count)
device_idxes = list(range(device_count))
print('device_idxes:', device_idxes)
devices = [torch.cuda.device(_) for _ in device_idxes]
print('devices:', devices)
device_names = [torch.cuda.get_device_name(_) for _ in device_idxes]
print('device_names:', device_names, '\n')

current_device = torch.cuda.current_device()
print('torch.cuda.current_device():', current_device)
print('torch.cuda.device(current_device):', devices[current_device])
print('torch.cuda.get_device_name(current_device):', device_names[current_device], '\n')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device, '\n')

if device.type == 'cuda':
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0) / 1024 ** 3, 1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0) / 1024 ** 3, 1), 'GB')

return device, device_count

对模型和输入调用.to(device), 见Porting PyTorch code from CPU to GPU.

使用多GPU, 见How to use multiple GPUs in pytorch?.

使用watch -n 2 nvidia-smi查看所有GPU的使用情况, 见How to check if pytorch is using the GPU?

安装多版本CUDA

安装时遇到报错cuda you already have a newer version of the nvidia frameview sdk installed, 依次卸载以下软件后可以继续安装:

  1. PhysX
  2. NVIDIA GeForce Experience
  3. NVIDIA FrameView SDK

参见WIndows 10 CUDA installation failure solved, CUDA installation problem.

In-Place Operation

1
2
3
4
5
6
7
8
9
10
11
12
>>> x = torch.rand(1)
>>> y = torch.rand(1)
>>> x
tensor([0.2738])
>>> id(x)
140736259305336
>>> x = x + y # Normal operation
>>> id(x)
140726604827672 # New location
>>> x += y
>>> id(x)
140726604827672 # Existing location used (in-place)

DeepaliDeepali Patel, What is in-place operation?

Which is faster?

.expand().clone() or .repeat()?

Keep in mind though that if you plan on changing this expanded tensor inplace, you will need to use .clone() on it before so that it actually is a full tensor (with memory for each element). But even .expand().clone() should be faster than .repeat() I think.

albanD, Torch.repeat and torch.expand which to use?

.unsqueeze(dim=1).expand(-1, 2).clone().view(-1) or .repeat_interleave(2)

1
2
3
a = torch.arange(3)  # tensor([0, 1, 2]) 
a.unsqueeze(dim=1).expand(-1, 2).clone().view(-1) # tensor([0, 0, 1, 1, 2, 2])
a.repeat_interleave(2) # tensor([0, 0, 1, 1, 2, 2])