Conda on Spiedie (intermediate)


This tutorial illustrates the use of the Conda environment manager along with the GPU compute node on Spiedie.

If you’de like to run the same job, but using Singularity instead of Conda, click here!

Click here for more information on when to use Conda vs. Sinularity vs. Modules

Things covered in this guide:

  1. Enabling and using Conda
  2. Accessing a GPUCompute partition
  3. Running a GPU-enabled workload

Requirements to complete the guide:

  1. Familiarity with Spiedie (try the quick start if you haven’t)
  2. Familiarity with shell commands and python

Enabling Conda on Spiedie

For this tutorial, we will be enabling the Conda package and environment manager on Spiedie to set up our environment and install packages.

In order to activate Conda at log in on Spiedie, log in and run the following command:

/cm/shared/apps/miniconda/bin/conda init

Then, close the window and log back in.

You should see the terminal include the base environment tag like:

(base)[username@spiedie81 ~]:

NOte: You do not have to repeat this step every time you want to use Conda. This will automatically start every time you log in

Note: If you would like to disable Conda at log in and would like to use Conda, run:

conda config --set auto_activate_base false

Install packages

We will be using GPU enabled TensorFlow to test out our python GPU code. To create a new virtual environment called spiedie_tf_gpu and install tensorflow-gpu on it, run

conda create --name spiedie_tf_gpu tensorflow-gpu

Access GPU partition

For this tutorial, we will be accessing the interactive shell on a GPUCompute node. In order to request a shell session on SLURM, run:

srun --partition=gpucompute --gres=gpu:1 --pty bash

Since, we will only be running a simple gpu test program, we will not request additional memory or resources.

Running the program

Once assigned to a GPU node, switch to the new virtual environment using:

conda activate spiedie_tf_gpu

We will be running a simple 5000 element dot product on our P100 GPUs and logging the device placement. You can download here.

Download the source code and transfer it to your home directory. For instructions, click here

Before running the code, we must load the CUDA-toolkit drivers, by running:

module load cuda10.1/toolkit/10.1.105

** Note: We have used CUDA 10.1 for the purposes of this tutorial **

Once, the module is loaded, we can simply run the program with:

python tf_gpu.py 

You should see the following output:

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:83:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:1 -> device: XLA_GPU device
2019-07-15 13:42:49.767029: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:83:00.0, compute capability: 6.0
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:1 -> device: XLA_GPU device

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-07-15 13:42:49.768241: I tensorflow/core/common_runtime/placer.cc:1059] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-07-15 13:42:49.768275: I tensorflow/core/common_runtime/placer.cc:1059] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-07-15 13:42:49.768300: I tensorflow/core/common_runtime/placer.cc:1059] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0

The above workflow can modified to work with sbatch and for general purpose TensorFlow GPU usage.