NY Blue/L: Compiling and Linking On the Front End, for the Compute Nodes
Porting Your Code
- Your goal is not only to port your code to run on the compute nodes, but also to test it to make sure it is producing the correct results there. See sections 6.2 and 6.3 especially of Unfolding the IBM eServer Blue Gene Solution (September 2005) for guidance about porting your code to Blue Gene/L. Bear in mind that the compute nodes are more limited than standard linux in regard to certain features such as system calls, header files, profiling, and shell utilities.
- For debugging you might want to try compiler flags
-g -O -qarch=440 -qtune=440 -qmaxmem=64000
You should specify 440 for the qarch and qtune values, otherwise you may get -qarch=auto -qtune=auto which optimizes for the front-end (login node), and you want to optimize for the compute nodes instead
Back to Compile and Link Tips
Optimizing Your Code
- When you want to optimize your code see the code tuning chapter of XL Fortran for Blue Gene/L IBM redbook or XL C/C++ for Blue Gene/L IBM Redbook .
- -qarch=440d -qtune=440 will try to achieve performance benefits using the double FPU, but using it does not always produce the fastest code.
Therefore, for optimizing you might try flags:
-O3 -qarch=440/440d -qtune=440
i.e. for the -qarch specification choose either 440 or 440d; the latter is better if your code can exploit the Double FPU on each processor.
-qarch=440d will try to achieve performance benefits using the double FPU, but using it does not always produce the fastest code, so experiment and see whether it or -qarch=440 produces the faster code. - Bear in mind that using -O3 invokes -qnostrict by default, which allows some reordering of floating point computations, and reordering of possible exceptions such as division by zero
- After trying -O3 for optimizing you might try
-O4 -qarch=440/440d -qtune=440
or
-O5 -qarch=440/440d -qtune=440
and see whether at these higher (than -O3) optimization levels your code is faster.
For some software using -O4 or -O5 might result in slower code than using -O3! - Also consider trying
-O3 -qarch=440/440d -qtune=440 -qipa
-qipa does detailed performance analysis across procedures and may speed up your code (though it's possible it might instead slow it down, thus the need to test it).
-qipa is implied by each of -O4 and -O5.
- Bear in mind that it is important to check answers as you increase the level of compiler optimization.
- You should specify the qarch and qtune values specified above, otherwise you may get -qarch=auto
-qtune=auto which optimizes for the front-end (login node), and you want to optimize for the compute nodes instead
Back to Compile and Link Tips
Some Blue Gene Differences Relative to Other Computing Environments
Back to Compile and Link Tips
- Blue Gene/L has a distributed memory system and supports the MPI standard for message passing. Any application using another message passing library must be modified to call MPI routines instead, or use an intermediate layer to map its library calls to MPI.
- The Blue Gene/L compute nodes support 32-bit addressing only, the -q64 option is not supported. (The service and front-end nodes use 64-bit PowerPC hardware, but your applications must be built for and run on the compute nodes).
- No support for threads. OpenMP will not work. A user application runs as a single non-preemptable thread of execution on its processor. Multiple user threads running as part of the single user process are not supported.Since only a single user process running on each processor is supported, fork, which creates another process, is not supported.
But then how do the two processors on a node communicate with each other? They send messages, not using the network hardware but instead by implementing a virtual torus device in a region of their shared memory called the scratchpad, see Chapter 2 of Blue Gene/l: Application Development for more detail. So although the two processors of a given node share memory, they don't communicate with each other in the way to which you may be accustomed for shared memory.
- The level L1 cache of the two processors are cache incoherent . When your code invokes MPI to send a variable(s) from e.g. one processor on a node to the other processor on that same node, each processor has its own copy of the variable(s), i.e. the memory location is not the same for both processors. Thus you needn't worry about variable values being set to stale incorrect values because of the cache incoherency. This is true for both CO and VN modes .
- Asynchronous file i/o is not supported.
- Memory on Blue gene may be limited relative to that to which you are accustomed. If you specify
CO mode, there is one MPI process per node with about 1 GB memory for the process, whereas if you specify VN mode , there are two MPI processes per node with about 512 MB memory per process. So you may have to reduce the memory usage of your application on each compute node for it to run.
- Blue Gene has no virtual memory in the sense in which that term is often used, each node has 1 GB memory and that is all. More precisely, Blue Gene has a fixed size virtual address space, and memory pages can't be swapped out to disk.
- Dynamic libraries are not supported.
- Only 30% or 40% of standard Linux system calls are supported on the compute nodes. fork, exec, signal are not supported. Neither are calls to usleep, gethostname, getlogin .See chapter 6.1 of Unfolding the IBM eserver Blue Gene Solution for itemization of some additional system calls not supported on the compute nodes.
See the list of supported compute node system calls in the Blue Gene/L: Application Development IBM Redbook.
Back to Compile and Link Tips
Producing Compiler Listings
- -qlist produces a compiler listing, -qlistopt produces a compiler listing that shows all options in effect at the time of compiler invocation, and -qsource produces the source section of the listing. For each of these, a file having suffix lst is created containing the listing.
- -qreport produces transformation reports showing how the program was parallelized and how loops were optimized, but the flag can not be used alone, see the IBM XL C/C++ or IBM XL Fortran documentation.
Back to Compile and Link Tips
Mapping Tasks to Nodes
Back to Compile and Link Tips
Other Valuable Compile/Link Information
Finally, though all of the information may not pertain to our Blue Gene/L, you can learn alot perusing other site Blue Gene/L user support documentation, for example the Lawrence Livermore, RPI, SDSC, and Boston University Blue Gene/L Support Pages.
Back to Compile and Link Tips
This site maintained by: bgwebmaster@bnl.gov
|
One of ten national laboratories overseen and primarily
funded by the Office of Science of the U.S. Department of Energy (DOE), Brookhaven
National Laboratory conducts research in the physical, biomedical, and environmental
sciences, as well as in energy technologies and national security. Brookhaven Lab also
builds and operates major scientific facilities available to university, industry and
government researchers. Brookhaven is operated and managed for DOE's Office of Science by
Brookhaven Science Associates, a limited-liability company founded by Stony Brook University,
the largest academic user of Laboratory facilities, and Battelle, a nonprofit, applied science
and technology organization.
Privacy and Security Notice
|