High Performance Computing Cluster "HPC1"
- Overview: General
- Overview: Intel Xeon E5-2670 Sandy Bridge Processors
- Overview: Intel Xeon PHI Coprocessors
- Overview: GPU
- Overview: Compilers and MPI
- Logging In to the Login Node
- Accessing Software on HPC1
- Running Jobs: Using Batch
- Running Jobs: Sample MPI Batch Jobs on Intel Xeon E5-2670 Processors
- Running Jobs: Sample Interactive Jobs on Intel Xeon PHI Coprocessors
- Running Jobs: Sample Batch Job on GPU
- Scratch Disk Space
- Compiler Documentation: Intel
- Compiler Documentation: GPU
- Compiler Documentation: Portland
- Compiler Documentation: GNU
Overview
HPC1, the Code center Cluster, is reserved very largely for development work, not production work.
The login node hostname is hpc1.csc.bnl.gov .
In addition to the login node, the cluster has 16 compute nodes, though only 15 of these are accessible to users: node01, node02,...,node10, node12, ...,node16 . node11 is not accessible.
HPC1 has a system monitoring tool Ganglia.
Intel Xeon E5-2670 Processors
Each of node01,node02,...,node10,node12,...,node16 has two Intel E5-2670 "Sandy Bridge" processors having an approximate speed of 2.594 GHz, 128 GB memory, and 8 cores per processor, meaning that 16 threads can be run on each node.
GPU
Additionally, node01, node02,...,node08 each have a CUDA capable NVIDIA board, one Telsa K20Xm GPU, compute capability 3.5, with about 6.04 GB global memory, 49152 bytes shared memory/block, and processor clock rate 732 MHZ.
Here are some other specs for each GPU:
warp size = 32, registers/block = 65536, Max threads/block = 1024, Max
Threads Dim = 1024x1024x64, Max Grid Size = 2147483647x65535x65535,
Multi-processor count = 14, Max Threads/multiprocessor = 2048 .
Note that GPUs do not have any vector registers. They are SIMT, "Single Instruction, Multiple Threads", not SIMD "Single Instruction Multiple Data".
Intel Xeon PHI Knight's Corner Coprocessors
node09,node10,node12,node13,...,node16 are the PHI host nodes, from which one can use the launcher to run a program on its Intel PHI "Knights Corner" coprocessor, which is a model 5110P with 61 cores (one for the OS, and 60 for computing) and 4-way threading , and which has an approximate speed and memory of 1.053 GHZ and 8 GB respectively.
One can also login to the PHI coprocessor directly to run , by doing ssh from the HPC1 login node to one of node09-mic0, node10-mic0, node12-mic0, ..., node16-mic0. Each of these is an Intel PHI card and runs BusyBox .
Both of the above methods of running on the Intel PHI coprocessors are referred to as Native Mode: the coprocessor functions as a standalone multicore Linux SMP computer. The -mmic flag must be specified to compile for native mode; and if one's source code includes OpenMP directives, one specifies -mmic -openmp.
Another way to run on the Intel PHI coprocessors is referred to as Offload
Mode: data and instructions for source code compiled on the host (HPC1 login
node) are automatically sent to the coprocessor for execution, and the host
waits for completion. ( If a coprocessor is not available, an Intel
E5-2670-processor-compiled version of the application's offload regions is
run on one of the Intel e5-2670 processors instead. )
Regions of the application source code that are to be run in offload mode
are specified using compiler directives in the source code: pragmas in
C/C++, directives in Fortran.
Variables used in an offload region of code that are declared outside the
scope of that region are by default copied to the coprocessor before
execution on it, and copied back to the host upon completion. This is called
the Explicit Memory Copy Model.
One can instead use the Implicit Memory Copy Model, by specifying Cilk
keywords in one's source code. Then variables so-marked to be shared (_Cilk_shared
keyword) between the host and coprocessor are placed at the same virtual
addresses on both machines, and their values at the start and
end of offload function calls (marked with _Cilk_offload) are synchronized . Note this is not
shared memory, in that there is no hardware mapping a portion of coprocessor
memory to the host - their memory subsystems are independent. This Model is
simply a way of copying data between the memory subsystems, and the copying
is implicit in that there is no specification of the data to be copied at the
synchronization points: runtime determines what data has changed between
the host and coprocessor.
One needs to learn how to achieve good vectorization to get good performance on the Intel PHI coprocessor. Each coprocessor core has a vector arithmetic unit capable of executing SIMD vector instructions, and can execute up to four hardware threads simultaneously. The vector unit can execute 16 single precision/8 double precision floating point operations per cycle (32 and 16 if the Fused Multiply-Add is used).
Logging In to the Login Node
From the BNL site: See the Windows or Unix instructions, whichever is the operating system of the computer from which you will be logging in.
From offsite: Follow the Windows or Unix instructions for creating an SSH keypair, whichever is the operating system from which you will be logging in. But in those instructions:
- For Windows, use Putty to login to the host ssh.bnl.gov (rather than hpc1.csc.bnl.gov) and use your SecurID to login, then once on ssh.bnl.gov, create your SSH keypair there (rather than on your local computer), email its public key only, and once you are notified the public key has been deployed , type ssh hpc1.csc.bnl.gov (while on ssh.bnl.gov) and enter your SSH passphrase when prompted. You should email the public key to itdhelp@bnl.gov with the subject line: Please deploy public key on hpc1.csc.bnl.gov cluster
- For Unix, ssh ssh.bnl.gov (rather than ssh hpc1.csc.bnl.gov) and use your SecurID to login, then once on ssh.bnl.gov, create your SSH keypair there (rather than on your local computer) , email its public key only , and once you are notified the public key has been deployed, type ssh hpc1.csc.bnl.gov (while on ssh.bnl.gov) and enter your SSH passphrase when prompted. You should email the public key to itdhelp@bnl.gov with the subject line: Please deploy public key on hpc1.csc.bnl.gov cluster
Compilers and MPI
The compilers and their MPI wrapper scripts (as well as all available software on HPC1) are accessed using modules.
Regarding compilation for running on the Intel E5-2670 processors, typing module avail on the login node reveals that currently the GNU, Intel, and Portland Group compilers are available, and that Mvapich2, Mpich2, and OpenMPI MPI are available for the GNU compiler, Intel MPI (impi) for the Intel compiler, and OpenMPI for the Portland compiler. Using the Intel compiler may give you the best performance.
The Intel Fortran and C/C++ compiler invocation commands are ifort, icc, and icpc, respectively, and the corresponding MPI wrapper scripts are mpiifort, mpiicc, and mpiicpc.
The GNU Fortran, C, and C++ compiler invocations are gfortran, gcc, and g++ respectively, and the corresponding MPI wrapper scripts are mpif77, mpif90, mpicc, and mpicxx.
The Portland Group Fortran compiler invocations are pgf77, pgf90, pgf95,
and pgfortran, where the latter three are aliases for each other.
The Portland C compiler invocation is pgcc, and the C++ invocation pgcpp or pgCC. If one wants to generate C++ code
conforming
to the C++ Application Binary Interface , one should invoke pgc++ .
mpif77 and mpif90 are the Portland Fortran MPI wrapper scripts, mpicc is the
C MPI wrapper script, and mpiCC, mpicxx, and mpic++ are the C++ MPI wrapper
scripts . The latter three all invoke the same underlying Portland C++
compiler.
Depending upon the flag specified, the Intel compiler can be used to compile a program to run on either the Intel E5-2670 processors or the Intel PHI coprocessors. Documentation for the Intel compiler can be found on the login node at /opt/intel/composerxe/Documentation/en_US .
Nvidia's CUDA compiler (nvcc) is available for running on the GPU on any one of the GPU nodes (node01, node02,...,node08).
See the Documentation section further below for a list of documentation for the GNU, Intel, and GPU compilers.
Scratch Disk Space
The /lscr directory on each compute node is local scratch space , and has better i/o performance than /scratch on those nodes.
Documentation
Intel Compiler Documentation
On the HPC1 login node:
module load intel man ifort man icc man icpc mpiicc -help mpiifort -help mpiicpc -helpIntel C++ Compiler: /opt/intel/composerxe/Documentation/en_US/compiler_c/main_cls/index.htm
Intel Fortran Compiler: /opt/intel/composerxe/Documentation/en_US/compiler_f/main_cls/index.htm
Using the Intel MIC Architecture (Xeon Phi Coprocessor):
See "Key Features" section under "Programming for the Intel
MIC Architecture" in the above Intel Compiler documentation
Also see "Compiler Reference/Intrinsics" section under "Intrinsics
for Intel MIC Architecture" for Intel MIC Intrinsics
Intel Math Kernel Library (MKL) User's Guide: /opt/intel/composerxe/Documentation/en_US/mkl
"Using the Intel Math Kernel Library on Intel MIC Core
Architecture Coprocessors" is a section in the above.
Sample Intel Xeon Phi Coprocessor Offload Code Using
Explicit Memory Copy Model:
Intel Sample C++ Xeon Phi Coprocessor Offload Code: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/intro_sampleC
Intel Sample Fortran Xeon Phi Coprocessor Offload Code: /opt/intel/composerxe/Samples/en_US/Fortran/mic_samples/
Sample Intel Xeon Phi Coprocessor Offload Code Using
Implicit Memory Model:
Intel Sample C++ Xeon Phi Coprocessor Offload Implicit: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleCPP
Intel Sample C Xeon Phi Coprocessor Offload Implicit: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleC
On the Web:
Intel Xeon Phi Coprocessor:
http://software.intel.com/mic-developer
https://portal.tacc.utexas.edu/user-guides/stampede
includes MIC Programming information
https://www.tacc.utexas.edu/user-services/training/course-materials
includes Xeon Phi Training Course Materials
https://wiki.jlab.org/cc/external/wiki/index.php/Intel_Xeon_Phi_(MIC)_Cluster
Jefferson Lab Xeon Phi Cluster
http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML
Xeon Phi Best Practice Guide
https://software.intel.com/sites/default/files/managed/f5/60/intel-xeon-phi-coprocessor-quick-start-developers-guide-mpss-3.2.pdf
GPU Compiler Documentation
On the HPC1 login node:
module load cuda man nvccSample CUDA Code: /software/cuda/6.0/samples
CUDA Documentation: /software/cuda/6.0/doc
On the Web:
https://github.com/davestampf/BNLWorkshop2013/blob/master/Monday.pdf?raw=true
http://developer.nvidia.com/category/zone/cuda-zone
http://nvidia.com/object/cuda_home_new.html
MOOC: "Introduction to Parallel Programming" course at
http://www.udacity.com
MOOC: "Heterogeneous Parallel Programming" course at
http://www.coursera.org
Books (Available Through Safari Online):
"CUDA Programming/A Developer's Guide to Parallel
Computing with GPUs", Shane Cook, 2013, Morgan Kaufmann
"Programming Massively Parallel Processors/A Hands-on
Approach", David Kirk & Wen-Mei W. Hwu, Second Edition,
2013, Morgan
Kaufmann
"GPU Computing Gems/Jade Edition", Wen-Mei W. Hwu, editor. 2012, Morgan Kaufmann. Outstanding collection of 36 papers.
Portland Group Compiler Documentation
On the HPC1 login node:
module load pgi <man pgf77 man pgf90 man pgf95 man pgfortran man pgcc man pgCC man pgcpp man pgc++ module load openmpi/1.6.5-pgi man mpif77 man mpif90 man mpicc man mpiCC man mpicxx man mpic++Also specify the -help flag to the wrapper, example: mpif90 -help
PGI User's Guide, Fortran Reference, CUDA Fortran User's Guide, Profiler User's Guide, Workstation Release Notes:
/software/pgi/linux86-64/13.10/doc
On the Web:
PGI Documentation for Intel, AMD, and NVIDIA Processors
GNU Compiler Documentation
On the HPC1 login node:
man gfortran man gcc man g++ module load openmpi/1.6.5-gnu (for example) man mpif77 man mpif90 man mpicc man mpic++ man mpicxx man mpirun man mpiexecOn the Web: