GPU software installation and configuration for cmslpc developers

How to connect to the CMS LPC GPU nodes:

  • make sure your ~/.ssh/config is setup for the same rules for cmslpc-sl6.fnal.gov as cmslpcgpu*.fnal.gov, easiest to use Host cmslpc*.fnal.gov, also your kerberos per CMS LPC CAF How to Connect web page
  • authenticate your kerberos as usual:
    kinit username@FNAL.GOV
  • Pick a node:
  • keep in mind these are interactive, so allocate the GPU you think you need. In case of conflict, please email the lpc-gpu@fnalNOSPAMPLEASE.gov mailing list
  • nodes are SL7, and do not have access to the condor batch queues

Monitoring

lpc-gpu mailing list:

Link to LPC Computing Discussions (CMS login required)

see slides and minutes

LPC cvmfs area information and gpu setup scripts

  • /cvmfs/cms-lpc.opensciencegrid.org/sl7/gpu
    • Setup.csh and Setup.sh
  • Note that bash is recommended, change environment by hand with bash if you have tcsh as your default. Note that eos aliases will no longer work changing environment by hand.

GPU software installed on cvmfs/nodes

  • See the Anaconda Enviornments below for how to setup Anaconda environments to use software
  • If you want another set of software installed, please contact Alexx Perloff, Marguerite Tonjes, and the lpc-gpu mailing list. It may take a bit of time to get the software synchronized to the LPC cvmfs area, so please keep in mind the changes won't be instant.

Centrally Installed

package name version where notes
CUDA Toolkit and Driver 9.2 GPU nodes may want to downgrade this for compatibility reasons
CUDA Toolkit and Driver 9.0.176 cvmfs VM used for TensorFlow installation

Major Packages Installed

Person package CUDA version needed other dependencies notes installed (y/n) environment
Alexx cuDNN 7.0.5 >=8.0 installed using Anaconda 3 version based on compatibility with currently installed CUDA version y central/mlenv2
Alexx TensorFlow 1.7.0 9.0 Cuda Toolkit 9.0, cuDNN 7.0 Install using pip y mlenv1
Zhenbin pytorch 0.3.1 7.5, 8.0, 9.0, or 9.1 cuDNN 6.0 for CUDA<=8.0, cuDNN 7.0 for CUDA>=8.0   y mlenv2
Alexx Anaconda 4.5.0-py36_0 N/A None   y N/A

Anaconda Environments

  • To enter an environment use the following commands (note that bash is recommended, change environment by hand with bash if you have tcsh as your default. Note that eos aliases will no longer work changing environment by hand.) How to change login shell default at cmslpc
source /cvmfs/cms-lpc.opensciencegrid.org/sl7/gpu/Setup.(c)sh # Needed once per session, use the version in [[https://twiki.cern.ch/twiki/bin/view/Main/GPUSoftwareInstallConfigure#LPC_cvmfs_area_information][cvmfs preferably]]
source activate <environment name>
  • To leave an environment use the following command (this works in bash, but may need modification of the conda activate <environment name> if you insist on using tcsh as your shell)
source deactivate
  • In some cases conda activate <environment name> and conda deactivate can also be used.

Environment Name mlenv0 mlenv1 mlenv2 (LTS) mlenv3 mlenv4 mlenv5
Notes Developer environment TensorFlow/Theano environment PyTorch environment TensorFlow environment (LTS) PyTorch environment HGCal (PyTorch) environment
bazel DONE (0.18.0) No No No No No
blaze DONE (0.11.3) DONE DONE DONE DONE No
cuda DONE (9.0) No (cental installation) DONE (9.0) No (cental installation) No (cental installation) No (cental installation)
cudatoolkit DONE (9.2) No (cental installation) DONE (8.0) DONE (9.2) DONE (8.0) 8.0
cudnn DONE (7.2.1) No (cental installation) DONE (7.0.5) DONE (7.2.1) DONE (7.0.5) 7.0.5
dask DONE (0.20.0) DONE DONE DONE (0.19.1) DONE No
dev No DONE DONE No DONE No
EnergyFlow DONE (0.10.5) No No No No No
graphviz DONE (2.40.1) DONE DONE DONE DONE DONE (2.40.1)
h5py DONE (2.8.0) DONE DONE DONE (2.8.0) DONE DONE (2.9.0)
keras DONE (2.2.4) DONE (2.1.5) DONE DONE (2.2.2) DONE No
matplotlib DONE (3.0.1) DONE DONE DONE DONE DONE (3.0.3)
mkl DONE (2019.3) No DONE DONE DONE DONE (2019.3)
nb_conda DONE (2.2.1) DONE DONE DONE DONE DONE (2.2.1)
numba DONE (0.40.0) DONE DONE DONE DONE DONE (0.43.1)
numpy DONE (1.16.2) DONE DONE DONE DONE DONE (1.15.4)
pandas DONE (0.24.2) DONE DONE DONE DONE DONE (0.24.2)
pip DONE (18.1) DONE DONE DONE DONE DONE (19.0.3)
pycuda DONE (2017.1) No DONE No DONE No
pydot DONE (1.2.4) DONE DONE DONE DONE No
pygpu DONE (0.7.6) No DONE No DONE No
python DONE (3.6.8) DONE (3.6) DONE (3.6) DONE (2.7) DONE (3.6) DONE (3.7)
PyTorch DONE (0.4.1) No DONE (0.3.1) No DONE (0.4.1) DONE (1.0.1)
root No No No No No DONE (6.16/00)
rootpy No No No No No No
scikit-image DONE (0.14.0) DONE DONE DONE DONE No
scikit-learn DONE (0.20.0) DONE DONE DONE DONE No
scipy DONE (1.1.0) DONE DONE DONE DONE No
TensorFlow (GPU) DONE (1.12.0) DONE (1.7.0) No DONE (1.10.0) No No
Theano DONE (1.0.3) No DONE No DONE No
torchvision DONE (0.2.1) No DONE (0.2.0) No DONE (0.2.1) DONE (0.2.2)
uproot DONE (3.4.18) DONE DONE DONE DONE (3.1.1) DONE (3.4.18)
wheel DONE (0.32.2) DONE No DONE DONE No

  • NOTE: Environment names in red are planned, but not yet installed.
  • NOTE: In each environment there are many other packages which are installed automatically by Anaconda, but which are not listed here.

Alexx Perloff Notes

List of tested TensorFlow configurations and dependencies https://www.tensorflow.org/install/install_sources#tested_source_configurations

Requirements to run TensorFlow on NVidia GPU:

https://www.tensorflow.org/install/install_linux

  • The P100 has a high enough compute capability:
https://developer.nvidia.com/cuda-gpus
CUDA-Enabled Tesla Products
===========================
Tesla Data Center Products
--------------------------
GPU   Compute Capability
----  ------------------
P100  6.0

  • We have CUDAź Toolkit 9.1 and only need 9.0

  • We have CUPTI as seen here /usr/local/cuda/extras/CUPTI/

Tech Specs

from email David Fagan

1U Rack chassis, 1400w Power Supply Platinum Level. 
3 x Hot-swap 3.5" SAS/SATA drive Bays. 2 PCIE 3.0 x16 slots (support double-width GPU), 1 PCIE 3.0 x16 low-profile slot.
Intel C621 chipset, Single socket P (LGA 3647) supports Intel Xeon scalable Processors, TDP 70-205W. 
6 DIMM slots, up to 768GB 3DS LRDIMM, 192GB 
ECC RDIM, DDR4-2666MHz. 
Onboard Intel X550 Dual Port 10GBase-T, IPMI 2.0 with virtual media over LAN and KVM-over-LAN support, ASPEED 
AST2500 Graphics, 6 SATA3 (6Gbps) ports. 
Dual M.2 Mini-PCIe support 2280.
Intel Xeon Silver 4140 8-Core 2.1GHz 11MB Cache 85W Processor, DDR4-2400MHz
32GB DDR4-2400 ECC LRDIMM
NVIDIA TESLA P100 12GB PCIe 3.0 PCEe 3.0 Passive GPU
HGST 2TB SATA 6Gb/s 128M 7200RPM 3.5" HDD

from deviceQuery

[tonjes@cmslpcgpu1 gpu]$ /usr/local/cuda/extras/demo_suite/deviceQuery 
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla P100-PCIE-12GB"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 12198 MBytes (12790923264 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1329 MHz (1.33 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              3072-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 101 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1, Device0 = Tesla P100-PCIE-12GB
Result = PASS

CPU

       Fermilab  policy  and  rules for computing, including appropriate
       use, may be found at http://www.fnal.gov/cd/main/cpolicy.html
------------------------------------------------------------------------------
                     ..::Powered by CMS-LPC::..                      

   Hostname: cmslpcgpu1.fnal.gov         OS Release: SLF 7.5 (Nitrogen)        
         IP: 131.225.188.177                 Subnet: 255.255.252.0             

     Kernel: 3.10.0-862.3.2                    Arch: x86_64                    
        RAM: 187.38 !GiB                        Swap: 32.00 !GiB                 
      Cores: 16                             Virtual: physical                  

GPUs on CMS connect and other cvmfs versions of TensorFlow

Add your own Notes

-- MargueriteTonjes - 2018-03-09

Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r31 - 2019-08-15 - MargueriteTonjes
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback