General Lab Information

Sample GPU Batch Job and Run Instructions

Sample GPU Batch Job
To Run Sample GPU Batch Job
Notes
Back to HPC1 Documentation

Sample GPU Batch Job gpu.pbs

#!/bin/sh
#PBS -l nodes=1:ppn=16:nvidia
#PBS -o output$PBS_JOBID.log
#PBS -N GPU_test_job
#PBS -M johndoe@bnl.gov
#PBS -j oe

#  Print time and date and hostname
date
hostname

#  $PBS_O_WORKDIR is the directory from which you submitted the job
cd $PBS_O_WORKDIR

module load cuda

echo ./$program ${args}

time ./${program} ${args}

Top of Page

To Run This Sample GPU Batch Job

cp /software/samples/gpu-batch/*  ~johndoe/gpu-batch-test
Modify the email address in gpu.pbs to be your own
module load cuda
make map-cuda-small
qsub -v program=map-cuda-small,args=1024 gpu.pbs 

The batch job above will be submitted to the default batch queue, named batch . To submit it to e.g.
queue gpudev instead:
qsub -q gpudev -v program=map-cuda-small,args=1024 gpu.pbs 

Your batch output file should appear in the directory from which you submitted the job, and have a filename like:
output3065.hpc1.csc.bnl.local.log

The first few lines of this output file should be like:
Fri Mar 14 13:38:59 EDT 2014
node08
./map-cuda-small 1024
size = 1024
0.000000 -> 0.000000
1.000000 -> 1.000000

 The last few lines of this output file should be like:
1023.000000 -> 1046529.000000
copy data to device 25 micros
compute                  278 micros
copy data to host    19  micros

real   0m0.401s
user 0m0.016s
sys   0m0.377s

Top of Page

Notes

As you can see in map-cuda-small.cu, this particular cuda program has an input argument (we used 1024 above).
If it had not, the batch submission command would have been:
qsub -v program=map-cuda-small gpu.pbs

The data is copied from the host (node08 in the example above) to the device (GPU), then the kernel  (function that runs
on the GPU) runs many threads on the GPU, and finally the data is copied from the device back to the host.

The host (node08) has two Intel Xeon processors having 8 cores apiece thus 16 cores total. Thus the batch
job requests one host node and 16 cores, as well as use of the GPU for that node. It could alternately have requested
this with:
#PBS -l nodes=1:ppn=16:gpus=1

node01 through node08 each have a GPU. The GPU boards don't have a bus to communicate, MPI must
be used for them to communicate.   map-cuda-small.cu   above does not use MPI,  thus runs on only one GPU.

If our CUDA program had written output files that we wanted to write to a separate subdirectory for each run,
then the batch script would have used ${PBS_O_WORKDIR} a little differently, creating a subdirectory of the
directory from which we submitted the run, and  having the subdirectory name be the PBS jobid of the run:
WORK=${PBS_O_WORKDIR}/${PBS_JOBID}
 [ -d ${WORK} ] || mkdir -p ${PBS_O_WORKDIR}/${PBS_JOBID}
cp ${PBS_O_WORKDIR}/$program ${WORK}
cd ${WORK}

Top of Page

Back to HPC1 Documentation