TensorFlow Linux GPU + jupyterlab environment installation (Docker) (Ubuntu Deepin Manjaro)

Copyright statement: This article is neucrack's original article and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting.
Original link: https://neucrack.com/p/116

Using docker here, the installation environment is simpler (you only need to install the NVIDIA driver, you don’t need to install cuda, and of course you don’t have to worry about the cuda version) and stable~
And you can run multiple dockers at the same time, such as running multiple jupyterlabs at the same time for different people to use

Install docker

Install docker, version must be 19.03 and above (you can use docker --version to view), if the version is lower than this version, later use of nvidia-docker driver will fail and you will be prompted to find it --gpu all` parameter

Installation

  • If it is Manjaro, directly yay -S docker
  • Other releases:

See the official tutorial: https://docs.docker.com/install/linux/docker-ce/debian/

deepin is based on debian 9.0
If it is deepin, you need to modify the unstable in sudo vim /usr/share/python-apt/templates/Deepin.info to stable
And use the command sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian stretch stable"

Set up proxy

If the download is slow, you may need to set up a proxy, or you can use a domestic mirror instead of an official mirror, such as daocloud mirror acceleration

Docker proxy setting reference: https://neucrack.com/p/286

When you pull the image, you can set the proxy to make the pull faster. It is recommended to remove the proxy when creating the container

Set the current user can access docker (non-root)

Reference here: https://docs.docker.com/install/linux/linux-postinstall/

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker # Or restart the terminal, if it does not take effect, you can restart

Common commands

docker images: list image list
docker run [options] image_name [command]: create a new container from the image
docker ps: running container
docker ps -a: all containers, including those that are not running
docker rm container_name: delete the container
docker rmi image_name: delete image
docker start container_name: start the container
docker attatch container_name: attach to the container
docker exec conrainer_name [comand]: execute commands in the container
docker logs container_name: view container execution log

docker build -t image_name .: build an image from Dockerfile

docker run common parameters

-it: Enable interactive terminal
-rm: delete in time, do not save the container, that is, delete after exit
--gpus all: enable all GPU support
-p port1:port2: host and container port mapping, port1 is the port of the host
-v volume1:volume2: the disk mapping between the host and the container, volume1 is the folder of the host, such as mapping /home/${USER}/notes to /tf/notes
--name name: Give the container a name. Without this parameter, the name is randomly generated
--device device:container_device: hang on the device, such as /dev/ttyUSB0:/dev/ttyUSB0
--network=host: Use the host's network
--restart: Automatically start, you can use this setting to start automatically, if you forget to run it, you can use docker update --restart=always container name to update

no: Do ​​not restart the container automatically. (default value)
on-failure: The container exits due to an error (the exit status of the container is not 0) restart the container
unless-stopped: Restart the container when it has been stopped or Docker stopped/restarted
always: restart the container when the container has been stopped or Docker stopped/restarted

Install graphics card driver

The graphics card installation part has written an independent article, refer to Linux Nvidia graphics card installation

Install mirror

Refer to the official document: https://www.tensorflow.org/install/docker

For example, my Ubuntu here: (Be sure to read the documentation, it may be different, there are updates)

Just follow the installation guide in the readme, for example, Ubuntu:

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add-
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

If it is deepin, you need to change the system version

distribution="ubuntu18.04"
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add-
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

If it is Manjaro, just command yay -S nvidia-docker! (If you encounter slow downloads, you can use poipo to set up a global proxy, refer to Terminal proxy setting method)

  • Test whether nvidia-docker and cuda can be used

Use the image of nvidia/cuda, this image is just for testing, you can delete it when you use it up, if you don’t have a proxy set up, and you don’t want to spend too much time pulling the image, you can use this image directly instead of tensorflow/tensorflow:latest -gpu-py3 this mirror or neucrack/tensorflow-gpu-py3-jupyterlab (or daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab) this mirror (recommended) (jupyterlab is installed on the basis of the former , And do better user rights management)

lspci | grep -i nvidia
docker run --gpus all --rm nvidia/cuda nvidia-smi

such as:

➜ ~ sudo docker run --gpus all --rm nvidia/cuda nvidia-smi
Tue Mar 10 15:57:12 2020
+------------------------------------------------- ----------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------- -----+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+================= =====+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 On | N/A |
| 33% 39C P0 27W / 120W | 310MiB / 6075MiB | 0% Default |
+-------------------------------+----------------- -----+----------------------+

+------------------------------------------------- ----------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|================================================ ============================|
+------------------------------------------------- ----------------------------+

Wed Mar 11 02:04:26 2020
+------------------------------------------------- ----------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------- -----+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+================= =====+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 35% 41C P5 25W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------- -----+----------------------+
| 1 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A |
| 39% 36C P5 19W / 250W | 0MiB / 11178MiB | 2% Default |
+-------------------------------+----------------- -----+----------------------+

+------------------------------------------------- ----------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|================================================ ============================|
| No running processes found |
+------------------------------------------------- ----------------------------+

If the driver version is too low, there will be a prompt to update the driver

At the same time, notice that the cuda version is 10.2, maybe tensorflow only supports 10.1. If tensorflow is installed directly on the host, it will report an error and not support. The benefits of using docker here are reflected. Don’t bother, just make sure that the driver is installed. Up

Deepin has an error

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout:, stderr: nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory\\\\n\\\"\"": unknown .

Refer to the solution here: https://github.com/NVIDIA/nvidia-docker/issues/614 to solve:

ln -s /sbin/ldconfig /sbin/ldconfig.real

docker error: nvidia-container-cli: initialization error: cuda error: unknown error
Restart the system to be resolved

Run tensorflow with GPU

Pull the mirror, pull directly

docker pull neucrack/tensorflow-gpu-py3-jupyterlab
# docker pull tensorflow/tensorflow:latest-gpu-py3-jupyter
# docker pull tensorflow/tensorflow
# docker pull tensorflow/tensorflow:latest-gpu

The image on daocloud can be used in China, and the speed will be faster:

docker pull daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab

Execute the test statement:

docker run --gpus all -it --rm neucrack/tensorflow-gpu-py3-jupyterlab python -c "import tensorflow as tf; print('-----version:{}, gpu:{}, 1+2 ={}'.format(tf.__version__, tf.test.is_gpu_available(), tf.add(1, 2).numpy()) );"

If daocloud is used, the image name needs to be changed to daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab

If there is no problem, the following output will appear (it will be accompanied by a lot of debugging information and there may be warning messages, you can take a closer look):

-----version:2.1.0, gpu:True, 1+2=3

Jupyterlab

docker run --gpus all --name jupyterlab-gpu -it -p 8889:8889 -e USER_NAME=$USER -e USER_ID=`id -u $USER` -e GROUP_NAME=`id -gn $USER` -e GROUP_ID =`id -g $USER` -v /home/${USER}:/tf neucrack/tensorflow-gpu-py3-jupyterlab

If daocloud is used, the image name needs to be changed to daocloud.io/neucrack/tensorflow-gpu-py3-jupyterlab

Then you can use the browser to use jupyterlab at the address of http://127.0.0.1:8889/, and the directory corresponds to the set /home/${USER} directory

jupyterlab
jupyterlab.png

Exit directly with Ctrl+C
This container will always exist on the computer after it is created, you can use docker ps -a to view it, and use it next time you start it

docker start jupyterlab_gpu

Can also be attached to the container:

docker attatch jupyterlab_gpu

Stop the container:

docker stop jupyterlab_gpu

Delete the container:

docker rm jupyterlab_gpu

Modify the user and root passwords so that you can use the sudo command

docker exec -it jupyterlab_gpu /bin/bash
passwd $USER
passwd root

If you need to create a new container every time and delete it when you use it up, you only need to add a -rm parameter after the run command

other questions

  • Prompt when running the program: ResourceExhaustedError: OOM when allocating tensor with shape[784,128]

Use nvidia-smi to view memory usage

tensorflow will apply for (almost) all video memory at once:

➜ ~ nvidia-smi
Fri Mar 20 09:18:48 2020
+------------------------------------------------- ----------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------- -----+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+================= =====+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 On | N/A |
| 0% 48C P2 60W / 250W | 10726MiB / 11178MiB | 0% Default |
+-------------------------------+----------------- -----+----------------------+
| 1 GeForce GTX 108... Off | 00000000:81:00.0 Off | N/A |
| 0% 47C P2 58W / 250W | 197MiB / 11178MiB | 0% Default |
+-------------------------------+----------------- -----+----------------------+

+------------------------------------------------- ----------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|================================================ ============================|
| 0 3099 G /usr/lib/xorg/Xorg 21MiB |
| 0 40037 C /usr/bin/python3 10693MiB |
| 1 40037 C /usr/bin/python3 185MiB |
+------------------------------------------------- ----------------------------+

There may be too many processes using video memory, and some processes can be properly exited;
It is also possible that the memory application is repeated, you can try to restart the container to solve it

  • Has been running without results

Restart the docker container to solve it. Anyway, if something is indecisive, restart to solve it. .

  • Prompt could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED

Multi-process may be used. The new process directly copies the environment of the current process, resulting in an error. The solution is ** the parent process needs to be imported**, which is imported separately when needed in the child process, instead of writing to the global, reference Here: https://abcdabcd987.com/python-multiprocessing/

  • ImportError: libGL.so.1: cannot open shared object file: No such file or directory
apt install libgl1-mesa-glx
  • Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

The graphics card memory is insufficient. Check if it is occupied by other programs. If there are multiple graphics cards, you can set the environment variable CUDA_VISIBLE_DEVICES to set the graphics card to be used. For example, there are three graphics cards, the subscripts are 0, 1 , 2, select the third card and set it to 2

import os

os.environ["CUDA_VISIBLE_DEVICES"] = '2'

Reference