BNL Home

High Performance Computing Cluster "HPC1"

Back to HPC Home

Overview: General
Overview: Intel Xeon E5-2670 Sandy Bridge Processors
Overview: Intel Xeon PHI Coprocessors
Overview: GPU
Overview: Compilers and MPI
Logging In to the Login Node
Accessing Software on HPC1
Running Jobs: Using Batch
Running Jobs: Sample MPI Batch Jobs on Intel Xeon E5-2670 Processors
Running Jobs: Sample Interactive Jobs on Intel Xeon PHI Coprocessors
Running Jobs: Sample Batch Job on GPU
Scratch Disk Space
Compiler Documentation: Intel
Compiler Documentation: GPU
Compiler Documentation: Portland
Compiler Documentation: GNU

Overview

HPC1, the Code center Cluster, is reserved very largely for development work, not production work.

The login node hostname is hpc1.csc.bnl.gov .

In addition to the login node, the cluster has 16 compute nodes, though only 15 of these are accessible to users: node01, node02,...,node10, node12, ...,node16 . node11 is not accessible.

HPC1 has a system monitoring tool Ganglia .

Top of Page

Intel Xeon E5-2670 Processors

 Each of node01,node02,...,node10,node12,...,node16 has two Intel E5-2670 "Sandy Bridge" processors having an approximate speed of 2.594 GHz, 128 GB memory,  and  8 cores per processor, meaning that 16 threads can be run on each node.

Top of Page

GPU

Additionally, node01, node02,...,node08 each have a CUDA capable NVIDIA board, one Telsa K20Xm GPU, compute capability 3.5, with about 6.04 GB global memory, 49152 bytes shared memory/block, and processor clock rate 732 MHZ.

Here are some other specs for each GPU:
warp size = 32, registers/block = 65536, Max threads/block = 1024, Max Threads Dim = 1024x1024x64, Max Grid Size = 2147483647x65535x65535, Multi-processor count = 14, Max Threads/multiprocessor = 2048 .

Note that GPUs do not have any vector registers. They are SIMT, "Single Instruction, Multiple Threads", not SIMD "Single Instruction Multiple Data".

Top of Page

Intel Xeon PHI Knight's Corner Coprocessors

node09,node10,node12,node13,...,node16 are the PHI host nodes, from which one can use the launcher to run a program on its Intel PHI "Knights Corner" coprocessor, which is a model 5110P with 61 cores (one for the OS, and 60 for computing) and 4-way threading , and which has an approximate speed and memory of 1.053 GHZ and 8 GB respectively.

One can also login to the PHI coprocessor directly to run , by doing ssh from the HPC1 login node to one of node09-mic0, node10-mic0, node12-mic0, ..., node16-mic0. Each of these is an Intel PHI card and runs BusyBox .

Both of the above methods of running on the Intel PHI coprocessors are referred to as Native Mode: the coprocessor functions as a standalone multicore Linux SMP computer. The -mmic flag must be specified to compile for native mode; and if one's source code includes OpenMP directives, one specifies -mmic -openmp.

Another way to run on the Intel PHI coprocessors is referred to as Offload Mode: data and instructions for source code compiled on the host (HPC1 login node) are automatically sent to the coprocessor for execution, and the host waits for completion. ( If a coprocessor is not available, an Intel E5-2670-processor-compiled version of the application's offload regions is run on one of the Intel e5-2670 processors instead. )
Regions of the application source code that are to be run in offload mode are specified using compiler directives in the source code: pragmas in C/C++, directives in Fortran.

Variables used in an offload region of code that are declared outside the scope of that region are by default copied to the coprocessor before execution on it, and copied back to the host upon completion. This is called the Explicit Memory Copy Model.
One can instead use the Implicit Memory Copy Model, by specifying Cilk keywords in one's source code. Then variables so-marked to be shared (_Cilk_shared keyword) between the host and coprocessor are placed at the same virtual addresses on both machines, and their values at the start and end of offload function calls (marked with _Cilk_offload) are synchronized . Note this is not shared memory, in that there is no hardware mapping a portion of coprocessor memory to the host - their memory subsystems are independent. This Model is simply a way of copying data between the memory subsystems, and the copying is implicit in that there is no specification of the data to be copied at the synchronization points: runtime determines what data has changed between the host and coprocessor.

One needs to learn how to achieve good vectorization to get good performance on the Intel PHI coprocessor. Each coprocessor core has a vector arithmetic unit capable of executing SIMD vector instructions, and can execute up to four hardware threads simultaneously. The vector unit can execute 16 single precision/8 double precision floating point operations per cycle (32 and 16 if the Fused Multiply-Add is used).

Top of Page

Logging In to the Login Node

From the BNL site: See the  Windows or Unix instructions, whichever is the operating system of the computer from which you will be logging in.

From offsite :Follow the Windows or Unix instructions for creating an SSH keypair, whichever is the operating system from which you will be logging in. But in those instructions:

  • For Windows, use Putty to login to the host ssh.bnl.gov (rather than hpc1.csc.bnl.gov) and use your SecurID to login, then once on ssh.bnl.gov, create your SSH keypair there (rather than on your local computer), email its public key only, and once you are notified the public key has been deployed ,  type ssh hpc1.csc.bnl.gov (while on ssh.bnl.gov) and enter your SSH passphrase when prompted.  You should email the public key to itdhelp@bnl.gov with the subject line:  Please deploy public key on hpc1.csc.bnl.gov cluster
  • For Unix, ssh ssh.bnl.gov (rather than ssh hpc1.csc.bnl.gov) and use your SecurID to login, then once on ssh.bnl.gov, create your SSH keypair there (rather than on your local computer) , email its public key only , and once you are notified the public key has been deployed, type ssh hpc1.csc.bnl.gov (while on ssh.bnl.gov) and enter your SSH passphrase when prompted. You should email the public key to itdhelp@bnl.gov with the subject line: Please deploy public key on hpc1.csc.bnl.gov cluster

Top of Page

Compilers and MPI

The compilers and their MPI wrapper scripts (as well as all available software on HPC1) are accessed using modules .

Regarding compilation for running on the Intel E5-2670 processors, typing module avail on the login node reveals that currently the GNU, Intel, and Portland Group compilers are available, and that Mvapich2, Mpich2, and OpenMPI MPI are available for the GNU compiler, Intel MPI (impi) for the Intel compiler, and OpenMPI for the Portland compiler. Using the Intel compiler may give you the best performance.

The Intel Fortran and C/C++ compiler invocation commands are ifort, icc, and icpc, respectively, and the corresponding MPI wrapper scripts are mpiifort, mpiicc, and mpiicpc.

The GNU Fortran, C, and C++ compiler invocations are gfortran, gcc, and g++ respectively, and the corresponding MPI wrapper scripts are mpif77, mpif90, mpicc, and mpicxx.

The Portland Group Fortran compiler invocations are pgf77, pgf90, pgf95, and pgfortran, where the latter three are aliases for each other.
The Portland C compiler invocation is pgcc, and the C++ invocation pgcpp or pgCC. If one wants to generate C++ code conforming to the C++ Application Binary Interface , one should invoke pgc++ .
mpif77 and mpif90 are the Portland Fortran MPI wrapper scripts, mpicc is the C MPI wrapper script, and mpiCC, mpicxx, and mpic++ are the C++ MPI wrapper scripts . The latter three all invoke the same underlying Portland C++ compiler.

Depending upon the flag specified, the Intel compiler can be used to compile a program to run on either the Intel E5-2670 processors or the Intel PHI coprocessors. Documentation for the Intel compiler can be found on the login node at /opt/intel/composerxe/Documentation/en_US .

Nvidia's CUDA compiler (nvcc) is available for running on the GPU on any one of the GPU nodes (node01, node02,...,node08).

See the Documentation section further below for a list of documentation for the GNU, Intel, and GPU compilers.

Top of Page

Scratch Disk Space

The /lscr directory on each compute node is local scratch space , and has better i/o performance than /scratch on those nodes.

Top of Page

Documentation

Intel Compiler Documentation

On the HPC1 login node:

module load intel
man ifort
man icc
man icpc
mpiicc -help
mpiifort -help
mpiicpc -help

Intel C++ Compiler: /opt/intel/composerxe/Documentation/en_US/compiler_c/main_cls/index.htm

Intel Fortran Compiler: /opt/intel/composerxe/Documentation/en_US/compiler_f/main_cls/index.htm

Using the Intel MIC Architecture (Xeon Phi Coprocessor):
See "Key Features" section under "Programming for the Intel MIC Architecture" in the above Intel Compiler documentation
Also see "Compiler Reference/Intrinsics" section under "Intrinsics for Intel MIC Architecture" for Intel MIC Intrinsics

Intel Math Kernel Library (MKL) User's Guide: /opt/intel/composerxe/Documentation/en_US/mkl
"Using the Intel Math Kernel Library on Intel MIC Core Architecture Coprocessors" is a section in the above.

Sample Intel Xeon Phi Coprocessor Offload Code Using Explicit Memory Copy Model:
Intel Sample C++ Xeon Phi Coprocessor Offload Code: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/intro_sampleC
Intel Sample Fortran Xeon Phi Coprocessor Offload Code: /opt/intel/composerxe/Samples/en_US/Fortran/mic_samples/

Sample Intel Xeon Phi Coprocessor Offload Code Using Implicit Memory Model:
Intel Sample C++ Xeon Phi Coprocessor Offload Implicit: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleCPP
Intel Sample C Xeon Phi Coprocessor Offload Implicit: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleC

On the Web:

Intel Xeon Phi Coprocessor:
http://software.intel.com/mic-developer
https://portal.tacc.utexas.edu/user-guides/stampede includes MIC Programming information
https://www.tacc.utexas.edu/user-services/training/course-materials includes Xeon Phi Training Course Materials
https://wiki.jlab.org/cc/external/wiki/index.php/Intel_Xeon_Phi_(MIC)_Cluster Jefferson Lab Xeon Phi Cluster
http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML Xeon Phi Best Practice Guide
https://software.intel.com/sites/default/files/managed/f5/60/intel-xeon-phi-coprocessor-quick-start-developers-guide-mpss-3.2.pdf

Top of Page

GPU Compiler Documentation

On the HPC1 login node:

module load cuda
man nvcc

Sample CUDA Code: /software/cuda/6.0/samples
CUDA Documentation: /software/cuda/6.0/doc

On the Web:

https://github.com/davestampf/BNLWorkshop2013/blob/master/Monday.pdf?raw=true
http://developer.nvidia.com/category/zone/cuda-zone
http://nvidia.com/object/cuda_home_new.html
MOOC: "Introduction to Parallel Programming" course at http://www.udacity.com
MOOC: "Heterogeneous Parallel Programming" course at http://www.coursera.org

Books (Available Through Safari Online):

"CUDA Programming/A Developer's Guide to Parallel Computing with GPUs", Shane Cook, 2013, Morgan Kaufmann
"Programming Massively Parallel Processors/A Hands-on Approach", David Kirk & Wen-Mei W. Hwu, Second Edition, 2013, Morgan
Kaufmann
"GPU Computing Gems/Jade Edition", Wen-Mei W. Hwu, editor. 2012, Morgan Kaufmann. Outstanding collection of 36 papers.

Top of Page

Portland Group Compiler Documentation

On the HPC1 login node:

module load pgi
man pgf77
man pgf90
man pgf95
man pgfortran
man pgcc
man pgCC
man pgcpp
man pgc++
module load openmpi/1.6.5-pgi
man mpif77
man mpif90
man mpicc
man mpiCC
man mpicxx
man mpic++

Also specify the -help flag to the wrapper, example: mpif90 -help

PGI User's Guide, Fortran Reference, CUDA Fortran User's Guide, Profiler User's Guide, Workstation Release Notes:
/software/pgi/linux86-64/13.10/doc

On the Web:

PGI Documentation for Intel, AMD, and NVIDIA Processors

Top of Page

GNU Compiler Documentation

On the HPC1 login node:

man gfortran
man gcc
man g++
module load openmpi/1.6.5-gnu (for example)
man mpif77
man mpif90
man mpicc
man mpic++
man mpicxx
man mpirun
man mpiexec

On the Web:

GNU Compiler Documentation

Top of Page