General Lab Information

High Performance Computing Cluster "HPC1"

Back to HPC Home

Overview

HPC1, the Code center Cluster, is reserved very largely for development work, not production work.

The login node hostname is hpc1.csc.bnl.gov .

In addition to the login node, the cluster has 16 compute nodes, though only 15 of these are accessible to users: node01, node02,...,node10, node12, ...,node16 . node11 is not accessible.

HPC1 has a system monitoring tool Ganglia.

Intel Xeon E5-2670 Processors

Each of node01,node02,...,node10,node12,...,node16 has two Intel E5-2670 "Sandy Bridge" processors having an approximate speed of 2.594 GHz, 128 GB memory, and 8 cores per processor, meaning that 16 threads can be run on each node.

GPU

Additionally, node01, node02,...,node08 each have a CUDA capable NVIDIA board, one Telsa K20Xm GPU, compute capability 3.5, with about 6.04 GB global memory, 49152 bytes shared memory/block, and processor clock rate 732 MHZ.

Here are some other specs for each GPU:
warp size = 32, registers/block = 65536, Max threads/block = 1024, Max Threads Dim = 1024x1024x64, Max Grid Size = 2147483647x65535x65535, Multi-processor count = 14, Max Threads/multiprocessor = 2048 .

Note that GPUs do not have any vector registers. They are SIMT, "Single Instruction, Multiple Threads", not SIMD "Single Instruction Multiple Data".

Intel Xeon PHI Knight's Corner Coprocessors

node09,node10,node12,node13,...,node16 are the PHI host nodes, from which one can use the launcher to run a program on its Intel PHI "Knights Corner" coprocessor, which is a model 5110P with 61 cores (one for the OS, and 60 for computing) and 4-way threading , and which has an approximate speed and memory of 1.053 GHZ and 8 GB respectively.

One can also login to the PHI coprocessor directly to run , by doing ssh from the HPC1 login node to one of node09-mic0, node10-mic0, node12-mic0, ..., node16-mic0. Each of these is an Intel PHI card and runs BusyBox .

Both of the above methods of running on the Intel PHI coprocessors are referred to as Native Mode: the coprocessor functions as a standalone multicore Linux SMP computer. The -mmic flag must be specified to compile for native mode; and if one's source code includes OpenMP directives, one specifies -mmic -openmp.

Another way to run on the Intel PHI coprocessors is referred to as Offload Mode: data and instructions for source code compiled on the host (HPC1 login node) are automatically sent to the coprocessor for execution, and the host waits for completion. ( If a coprocessor is not available, an Intel E5-2670-processor-compiled version of the application's offload regions is run on one of the Intel e5-2670 processors instead. )
Regions of the application source code that are to be run in offload mode are specified using compiler directives in the source code: pragmas in C/C++, directives in Fortran.

Variables used in an offload region of code that are declared outside the scope of that region are by default copied to the coprocessor before execution on it, and copied back to the host upon completion. This is called the Explicit Memory Copy Model.
One can instead use the Implicit Memory Copy Model, by specifying Cilk keywords in one's source code. Then variables so-marked to be shared (_Cilk_shared keyword) between the host and coprocessor are placed at the same virtual addresses on both machines, and their values at the start and end of offload function calls (marked with _Cilk_offload) are synchronized . Note this is not shared memory, in that there is no hardware mapping a portion of coprocessor memory to the host - their memory subsystems are independent. This Model is simply a way of copying data between the memory subsystems, and the copying is implicit in that there is no specification of the data to be copied at the synchronization points: runtime determines what data has changed between the host and coprocessor.

One needs to learn how to achieve good vectorization to get good performance on the Intel PHI coprocessor. Each coprocessor core has a vector arithmetic unit capable of executing SIMD vector instructions, and can execute up to four hardware threads simultaneously. The vector unit can execute 16 single precision/8 double precision floating point operations per cycle (32 and 16 if the Fused Multiply-Add is used).

Logging In to the Login Node

From the BNL site: See the Windows or Unix instructions, whichever is the operating system of the computer from which you will be logging in.

From offsite: Follow the Windows or Unix instructions for creating an SSH keypair, whichever is the operating system from which you will be logging in. But in those instructions:

  • For Windows, use Putty to login to the host ssh.bnl.gov (rather than hpc1.csc.bnl.gov) and use your SecurID to login, then once on ssh.bnl.gov, create your SSH keypair there (rather than on your local computer), email its public key only, and once you are notified the public key has been deployed ,  type ssh hpc1.csc.bnl.gov (while on ssh.bnl.gov) and enter your SSH passphrase when prompted.  You should email the public key to itdhelp@bnl.gov with the subject line:  Please deploy public key on hpc1.csc.bnl.gov cluster
  • For Unix, ssh ssh.bnl.gov (rather than ssh hpc1.csc.bnl.gov) and use your SecurID to login, then once on ssh.bnl.gov, create your SSH keypair there (rather than on your local computer) , email its public key only , and once you are notified the public key has been deployed, type ssh hpc1.csc.bnl.gov (while on ssh.bnl.gov) and enter your SSH passphrase when prompted. You should email the public key to itdhelp@bnl.gov with the subject line: Please deploy public key on hpc1.csc.bnl.gov cluster

Compilers and MPI

The compilers and their MPI wrapper scripts (as well as all available software on HPC1) are accessed using modules.

Regarding compilation for running on the Intel E5-2670 processors, typing module avail on the login node reveals that currently the GNU, Intel, and Portland Group compilers are available, and that Mvapich2, Mpich2, and OpenMPI MPI are available for the GNU compiler, Intel MPI (impi) for the Intel compiler, and OpenMPI for the Portland compiler. Using the Intel compiler may give you the best performance.

The Intel Fortran and C/C++ compiler invocation commands are ifort, icc, and icpc, respectively, and the corresponding MPI wrapper scripts are mpiifort, mpiicc, and mpiicpc.

The GNU Fortran, C, and C++ compiler invocations are gfortran, gcc, and g++ respectively, and the corresponding MPI wrapper scripts are mpif77, mpif90, mpicc, and mpicxx.

The Portland Group Fortran compiler invocations are pgf77, pgf90, pgf95, and pgfortran, where the latter three are aliases for each other.
The Portland C compiler invocation is pgcc, and the C++ invocation pgcpp or pgCC. If one wants to generate C++ code conforming to the C++ Application Binary Interface , one should invoke pgc++ .
mpif77 and mpif90 are the Portland Fortran MPI wrapper scripts, mpicc is the C MPI wrapper script, and mpiCC, mpicxx, and mpic++ are the C++ MPI wrapper scripts . The latter three all invoke the same underlying Portland C++ compiler.

Depending upon the flag specified, the Intel compiler can be used to compile a program to run on either the Intel E5-2670 processors or the Intel PHI coprocessors. Documentation for the Intel compiler can be found on the login node at /opt/intel/composerxe/Documentation/en_US .

Nvidia's CUDA compiler (nvcc) is available for running on the GPU on any one of the GPU nodes (node01, node02,...,node08).

See the Documentation section further below for a list of documentation for the GNU, Intel, and GPU compilers.

Scratch Disk Space

The /lscr directory on each compute node is local scratch space , and has better i/o performance than /scratch on those nodes.

Documentation

Intel Compiler Documentation

On the HPC1 login node:

module load intel man ifort man icc man icpc mpiicc -help mpiifort -help mpiicpc -help

Intel C++ Compiler: /opt/intel/composerxe/Documentation/en_US/compiler_c/main_cls/index.htm

Intel Fortran Compiler: /opt/intel/composerxe/Documentation/en_US/compiler_f/main_cls/index.htm

Using the Intel MIC Architecture (Xeon Phi Coprocessor):
See "Key Features" section under "Programming for the Intel MIC Architecture" in the above Intel Compiler documentation
Also see "Compiler Reference/Intrinsics" section under "Intrinsics for Intel MIC Architecture" for Intel MIC Intrinsics

Intel Math Kernel Library (MKL) User's Guide: /opt/intel/composerxe/Documentation/en_US/mkl
"Using the Intel Math Kernel Library on Intel MIC Core Architecture Coprocessors" is a section in the above.

Sample Intel Xeon Phi Coprocessor Offload Code Using Explicit Memory Copy Model:
Intel Sample C++ Xeon Phi Coprocessor Offload Code: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/intro_sampleC
Intel Sample Fortran Xeon Phi Coprocessor Offload Code: /opt/intel/composerxe/Samples/en_US/Fortran/mic_samples/

Sample Intel Xeon Phi Coprocessor Offload Code Using Implicit Memory Model:
Intel Sample C++ Xeon Phi Coprocessor Offload Implicit: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleCPP
Intel Sample C Xeon Phi Coprocessor Offload Implicit: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleC

On the Web:

Intel Xeon Phi Coprocessor:
http://software.intel.com/mic-developer
https://portal.tacc.utexas.edu/user-guides/stampede includes MIC Programming information
https://www.tacc.utexas.edu/user-services/training/course-materials includes Xeon Phi Training Course Materials
https://wiki.jlab.org/cc/external/wiki/index.php/Intel_Xeon_Phi_(MIC)_Cluster Jefferson Lab Xeon Phi Cluster
http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML Xeon Phi Best Practice Guide
https://software.intel.com/sites/default/files/managed/f5/60/intel-xeon-phi-coprocessor-quick-start-developers-guide-mpss-3.2.pdf

GPU Compiler Documentation

On the HPC1 login node:

module load cuda man nvcc

Sample CUDA Code: /software/cuda/6.0/samples
CUDA Documentation: /software/cuda/6.0/doc

On the Web:

https://github.com/davestampf/BNLWorkshop2013/blob/master/Monday.pdf?raw=true
http://developer.nvidia.com/category/zone/cuda-zone
http://nvidia.com/object/cuda_home_new.html
MOOC: "Introduction to Parallel Programming" course at http://www.udacity.com
MOOC: "Heterogeneous Parallel Programming" course at http://www.coursera.org

Books (Available Through Safari Online):

"CUDA Programming/A Developer's Guide to Parallel Computing with GPUs", Shane Cook, 2013, Morgan Kaufmann
"Programming Massively Parallel Processors/A Hands-on Approach", David Kirk & Wen-Mei W. Hwu, Second Edition, 2013, Morgan
Kaufmann
"GPU Computing Gems/Jade Edition", Wen-Mei W. Hwu, editor. 2012, Morgan Kaufmann. Outstanding collection of 36 papers.

Portland Group Compiler Documentation

On the HPC1 login node:

module load pgi <man pgf77 man pgf90 man pgf95 man pgfortran man pgcc man pgCC man pgcpp man pgc++ module load openmpi/1.6.5-pgi man mpif77 man mpif90 man mpicc man mpiCC man mpicxx man mpic++

Also specify the -help flag to the wrapper, example: mpif90 -help

PGI User's Guide, Fortran Reference, CUDA Fortran User's Guide, Profiler User's Guide, Workstation Release Notes:
/software/pgi/linux86-64/13.10/doc

On the Web:

PGI Documentation for Intel, AMD, and NVIDIA Processors

GNU Compiler Documentation

On the HPC1 login node:

man gfortran man gcc man g++ module load openmpi/1.6.5-gnu (for example) man mpif77 man mpif90 man mpicc man mpic++ man mpicxx man mpirun man mpiexec

On the Web:

GNU Compiler Documentation