Description
CSCI596 Assignment 1—Computational Complexity and Flop/s Performance
0.1. Asymptotic complexity analysis
Read a lecture note (https://cacs.usc.edu/education/cs596/AsymptoticAnalysis.pdf) and
“Appendix A.2—Order Analysis of Functions” of Introduction to Parallel Computing” by
Grama et al. (page 581 of the PDF file, to which the link is found at the course homepage,
https://cacs.usc.edu/education/cs596.html, under “Textbooks” heading).
0.2. Theoretical peak performance of a computer
Read a lecture note (https://cacs.usc.edu/education/cs596/PeakFlops.pdf) to learn how to
compute the theoretical peak floating-point performance of a computer.
Now, here is the actual assignment: Submit your answers to the following two questions.
1. Measuring Computational Complexity
Use the data file, MDtime.out, in the assignment 1 package. In the two-column file, the left
column is the number of atoms, N, simulated by the md.c program, whereas the right column is
the corresponding running time, T, of the program in seconds. Make a log-log plot of T vs. N.
Perform linear fit of logT vs. logN, i.e., logT = alogN +b, where a and b are fitting parameters.
Note that the coefficient a signifies the power with which the runtime scales as a function of
problem size N: T µ Na . For detail, see slide 31 in https://cacs.usc.edu/education/cs596/01MDVG.pdf). Submit (i) the plot and (ii) your fitted value of a.
For this and subsequent assignments, you need to use a scientific plotting software like Grace,
Origin, Kaleidagraph, Gnuplot, Matlab, Mathematica and Excel. Please make sure that you are
familiar with one such software. For this assignment, you also need to use the least-square fitting
feature of your plotting software. In case you cannot find such feature, you can do it yourself
following the lecture note on “least square fit of a line” (https://cacs.usc.edu/
education/cs596/LeastSquareFit.pdf).
2. Theoretical Flop/s Performance
Suppose that your computer has only one octa-core processor (a processor equipped with 8
processing cores) operating at a clock speed of 2.3 GHz (i.e., clock ticks 2.3´109 times per second),
in which each core can operate 1 multiplication and 1 addition operations per clock cycle using a
fused multiply-add (FMA) circuit. Assume that each multiply or add operation is performed on
vector registers, each holding 4 double-precision (i.e., 4´64 = 512 bits) operands. What is the
theoretical peak performance of your computer in terms of double-precision Gflop/s (gigaflop/s
or 109 flop/s)? Submit the computed number.
(Optional) A program named lmd_sqrt_flop.c is provided in the gzipped tar archive, cs596-
as01-tar.gz, on Blackboard, along with instructions to compile and run the program in
2
RAEDME file.
* This is a linked-list-cell molecular dynamics program, in which sqrt() function is
implemented using a polynomial for counting the number of floating-point operations (see the
lecture note on “Arithmetic implementation of sqrt() and floating-point performance”; see
https://cacs.usc.edu/education/cs596/Sqrt.pdf. Compile and run the lmd_sqrt_flop.c program on a
computer of your choice, and report the flop/s performance you get. Better, answer how many
% of the theoretical peak flop/s performance of the computer you achieved?
* To extract the files from the archive, type tar xvfz cs596-as01-tar.gz.
CSCI596 Assignment 2—Message Passing Interface
Goal: Implement Your Own Global Summation with Message Passing Interface
In this assignment, you will write your own global summation program (equivalent to
MPI_Allreduce) using MPI_Send and MPI_Recv. Your program should run with P = 2l
processes (or MPI ranks), where l = 0, 1,… Each process contributes a partial value, and at the
end, all the processes will have the globally-summed value of these partial contributions.
Your program will use a communication structure called butterfly, which is structured as a
series of pairwise exchanges (see the figure below where messages are denoted by arrows). This
structure allows a global reduction among P processes to be performed in log2P steps.
a000 + a001 + a010 + a011 + a100 + a101 + a110 + a111
= ((a000 + a001) + (a010 + a011)) + ((a100 + a101) + (a110 + a111))
At each level l, a process exchanges messages with a partner whose rank differs only at the lth bit position in the binary representation (Fig. 1).
Fig. 1: Butterfly network used in hypercube algorithms.
HYPERCUBE TEMPLATE
We can use the following template to perform a global reduction using any associative
operator OP (such as multiplication or maximum), (a OP b) OP c = a OP (b OP c).
1,2,3
procedure hypercube(myid, input, logP, output)
begin
mydone := input;
for l := 0 to logP-1 do
begin
partner := myid XOR 2l;
send mydone to partner;
receive hisdone from partner;
mydone = mydone OP hisdone
end
output := mydone
end
USE OF BITWISE LOGICAL XOR
Note that
0 XOR 0 = 1 XOR 1 = 0;
0 XOR 1 = 1 XOR 0 = 1.
so that a XOR 1 flips the bit a, i.e.,
2
a XOR 1 = `a
a XOR 0 = a
where `a is the complement of a (`a = 1|0 for a = 0|1). In particular, myid XOR 2l reverses the
l-th bit of the rank of this process, myid:
abcdefg XOR 0000100 = abcd`efg
Note that the XOR operator is ^ (caret symbol) in the C programming language.
ASSIGNMENT
Complete the following program by implementing the function, global_sum, using
MPI_Send and MPI_Recv functions and the hypercube template shown above.
Submit the source code as well as the printout from a test run on 4 processors and that on 8
processors.
#include “mpi.h”
#include <stdio.h>
int nprocs; /* Number of processes */
int myid; /* My rank */
double global_sum(double partial) {
/* Implement your own global summation here */
}
int main(int argc, char *argv[]) {
double partial, sum, avg;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
partial = (double) myid;
printf(“Node %d has %le\n”, myid, partial);
sum = global_sum(partial);
if (myid == 0) {
avg = sum/nprocs;
printf(“Global average = %le\n”, avg);
}
MPI_Finalize();
return 0;
}
References
1. Slides 20-24 in https://cacs.usc.edu/education/cs596/MPI-VG.pdf.
2. https://en.wikipedia.org/wiki/Hypercube_(communication_pattern).
3. I. Foster, Designing and Building Parallel Programs (Addison-Wesley, 1995) Chap. 11—
Hypercube algorithms https://www.mcs.anl.gov/~itf/dbpp/text/node123.html.
CSCI596 Assignment 3—Parallel Computation of � and Scalability Analysis
The purpose of this assignment is to acquire hands-on experience on the scalability analysis of a
parallel program — one of the key skills you learn in this class. We use a simple application that
utilizes the function you have written for assignment 2, where the purpose was to:
(i) Convince ourselves that MPI_Send() and MPI_Recv() are sufficient to build any parallel
programs, using global reduction as a concrete example.
(ii) Perform a unit software test of the global_sum() function used in this assignment.
Part I: Programming
Write a message passing interface (MPI) program, global_pi.c, to compute the value of p
based on the lecture note on “Parallel Computation of Pi” and using the global_sum() function
you have implemented and unit-tested in assignment 2. Please also utilize the serial program
pi.c (which computes the value of �) in the assignment 3 package.
(Assignment)
1. Submit the source code of global_pi.c.
(Note)
• Insert MPI_Wtime() function (which takes no argument and returns the wall-clock time in
seconds as double) to measure the running time of the program.
Part II: Scalability
In this assignment, we measure the scalability of global_pi.c.
(Assignment)
2. (Fixed problem-size scaling) Run your global_pi.c with a fixed number of quadrature
points, NBIN = 109
, while varying the number of compute nodes = 1, 2 and 4 with processor
per node to be 1 (i.e., the number of processors P = 1, 2 and 4). Plot the fixed problem-size
parallel efficiency as a function of P. Submit the plot.
3. (Isogranular scaling) In this scalability test, we consider a constant number of quadrature
points, NBIN/P = 109
, per processor for P = 1, 2 and 4. To do this, we slightly modify
global_pi.c by defining
#define NPERP 1000000000 /* Number of quadrature points per processor */
long long NBIN;
and determining the total number of quadrature points as
NBIN = (long long)NPERP*nprocs;
Run the resulting program global_pi_iso.c, and plot the isogranular parallel efficiency as
a function of P. Submit the plot.
(Note)
• Please perform the entire scaling tests in a single batch job to minimize measurement
fluctuations, using the Slurm script, global_pi.sl, in the assignment 3 package.
CSCI596 Assignment 4—Parallel Molecular Dynamics
The purpose of this assignment is to gain hands-on experience in practical use of message passing
interface (MPI) in real-world applications, thereby consolidating your understanding of
asynchronous message passing and communicators. In addition, you will get familiar with the
message-passing scheme used in common spatial-decomposition applications, using the parallel
molecular dynamics (MD) program, pmd.c, as an example.
(Part I—Asynchronous Messages)
Modify pmd.c such that, for each message exchange, it first calls MPI_Irecv, then MPI_Send, and
finally MPI_Wait. The asynchronous messages make the deadlock-avoidance scheme unnecessary,
and thus there is no need to use different orders of send and receive calls for even- and odd-parity
processes. In addition to just MPI_Send, insert other computations that do not depend on the
received messages between MPI_Irecv and MPI_Wait.
• Submit the modified source code, with your modifications clearly marked.
• Run both the original pmd.c and the modified program on 16 cores (requesting 4 nodes with 4
cores per node in your Slurm script), and compare the execution time for InitUcell = {3,3,3},
StepLimit = 1000, and StepAvg = 1001 in pmd.in (keep all the other parameter values as
they are as downloaded from the course home page) and vproc = {2,2,4} (i.e., nproc = 16) in
pmd.h. Which program runs faster? Repeat the comparison three times and report the average
runtime of both programs. Submit the timing data.
(Part II—Communicators)
Following the lecture note on “In situ analysis of molecular dynamics simulation data using
communicators,” modify pmd.c such that as many number of processes as that for MD simulations
is spawned to calculate the probability density function (PDF) for the atomic velocity.
• Submit the modified source code (name it pmd_split.c), with your modifications clearly
marked.
• Run the modified program on 16 cores (requesting 2 nodes with 8 cores per node in your Slurm
script), with which 8 cores perform MD simulation and the other 8 cores calculate PDF. In
pmd.h, choose vproc[3] = {2,2,2} and nproc = 8. Also, specify InitUcell = {5,5,5},
StepLimit = 30, and StepAvg = 10 in pmd.in. Submit the plot of calculated PDFs at time
steps 10, 20, and 30.
CSCI596 Assignment 5 Hybrid MPI+OpenMP Parallel Molecular Dynamics
1. Write a hybrid MPI+OpenMP parallel molecular dynamics (MD) program (name it hmd.c),
starting from the MPI parallel MD program, pmd.c, following the lecture note on “hybrid
MPI+OpenMP parallel MD”. Submit the source code of hmd.c, with your modifications
from pmd.c clearly marked.
2. (Verification) Run your hmd.c on two 4-core nodes (in total of 8 cores) with 2 MPI
processes, each with 4 OpenMP threads, using the following input parameters: InitUcell =
{24,24,12}, Density = 0.8, InitTemp = 1.0, DeltaT = 0.005, StepLimit =
100, StepAvg = 10. Use the following number of MPI processes and that of OpenMP
threads, vproc = {1,1,2}, nproc = 2, vthrd = {2,2,1}, nthrd = 4, in the header
file. Note the global number of atoms is: 4 atoms/unit cell ´ (24´24´12 unit cells) ´ 2 MPI
processes = 55,296. Submit the standard output from the run. Make sure that the total
energy is the same as that calculated by pmd.c using the same input parameters (shown
below) at least for ~5-6 digits.
al = 4.103942e+01 4.103942e+01 2.051971e+01
lc = 16 16 8
rc = 2.564964e+00 2.564964e+00 2.564964e+00
nglob = 55296
0.050000 0.877345 -5.137153 -3.821136
0.100000 0.462056 -4.513097 -3.820013
0.150000 0.510836 -4.587287 -3.821033
0.200000 0.527457 -4.611958 -3.820772
0.250000 0.518668 -4.598798 -3.820796
0.300000 0.529023 -4.614343 -3.820808
0.350000 0.532890 -4.620133 -3.820798
0.400000 0.536070 -4.624899 -3.820794
0.450000 0.539725 -4.630387 -3.820799
0.500000 0.538481 -4.628514 -3.820792
CPU & COMT = 3.836388e+00 2.632065e-02
3. (Scalability) Run your hmd.c on an 8-core node with one MPI process and the number of
threads varying from 1, 2, 4, to 8, with input parameters: InitUcell = {24,24,24},
Density = 0.8, InitTemp = 1.0, DeltaT = 0.005, StepLimit = 100, StepAvg =
101. Plot the strong-scaling parallel efficiency as a function of the number of threads and
submit the plot.
(Potential Final Project)
Optimize the performance of the hybrid MPI+OpenMP MD code. For example, we could
enclose the entire MD loop in a parallel clause in the main function to avoid the excessive forkjoin overhead. We could also use a lock variable for synchronization.
CSCI596 Assignment 6—Hybrid MPI+OpenMP+CUDA Programming
Part I: Pair-Distribution Computation with CUDA
In this part, you will write a CUDA program (name it pdf.cu) to compute a histogram nhist of atomic pair
distances in molecular dynamics simulation:
for all histogram bins i
nhist[i] = 0
for all atomic pairs (i,j)
++nhist[ë!�⃗!”!⁄Δ�û]
Here, !�⃗!”! is the distance between atomic pair (i, j) and Dr is the histogram bin size. The maximum atomic-pair
distance with the periodic boundary condition is the diagonal of half the simulation box,
�!”# = #∑ %
$%[‘]
) &
)
‘*+,-,. ,
and with Nhbin bins for the histogram, the bin size is Dr = Rmax/Nhbin. Here, al[a] is the simulation box size in the ath direction. With the minimal-image convention, however, the maximum distance, for which the histogram is
meaningful, is half the simulation box length, min’∈{+,-,.}(��[�]/2).
After computing the pair-distance histogram nhist, the pair distribution function (PDF) at distance ri =
(i+1/2)Dr is defined as �(�2) = �ℎ���(�) 2��2
) ⁄ Δ���, where r is the number density of atoms and N is the total
number of atoms.
(Assignment)
1. Modify the sequential PDF computation program pdf0.c to a CUDA program, following the lecture note on
“Pair distribution computation on GPU”. Submit your code.
2. Run your program by reading the atomic configuration pos.d (both pdf0.c and pos.d are available at the
class homepage). Plot the resulting pair distribution function, using Nhbin = 2000. Submit your plot.
Part II: MPI+OpenMP+CUDA Computation of p
In this part, you will write a triple-decker MPI+OpenMP+CUDA program (name it pi3.cu) to compute the
value of p, by modifying the double-decker MPI+CUDA program, hypi_setdevice.cu, described in the
lecture note on “Hybrid MPI+OpenMP+CUDA Programming”.
Your implementation should utilize two CPU cores and two GPU devices on each compute node. This is
achieved by launching one MPI rank per node, where each rank spawns two OpenMP threads that run on different
CPU cores and use different GPU devices as shown in the left figure on the next page. You can employ spatial
decomposition in the MPI+OpenMP layer as follows (for the CUDA layer, leave the interleaved assignment of
quadrature points to CUDA threads in hypi_setdevice.cu as it is); see the right figure on the next page.
#define NUM_DEVICE 2 // # of GPU devices = # of OpenMP threads
…
// In main()
MPI_Comm_rank(MPI_COMM_WORLD,&myid); // My MPI rank
MPI_Comm_size(MPI_COMM_WORLD,&nproc); // # of MPI processes
omp_set_num_threads(NUM_DEVICE); // One OpenMP thread per GPU device
nbin = NBIN/(nproc*NUM_DEVICE); // # of bins per OpenMP thread
step = 1.0/(float)(nbin*nproc*NUM_DEVICE);
…
#pragma omp parallel private(list the variables that need private copies)
{
…
mpid = omp_get_thread_num();
offset = (NUM_DEVICE*myid+mpid)*step*nbin; // Quadrature-point offset
cudaSetDevice(mpid%2);
…
}
2
Make sure to list all variables that need private copies in the private clause for the omp parallel directive.
The above OpenMP multithreading will introduce a race condition for variable pi. This can be circumvented
by data privatization, i.e., by defining float pid[NUM_DEVICE] and using the array elements as dedicated
accumulators for the OepnMP threads (or GPU devices).
To report which of the two GPUs have been used for the run, insert the following lines within the OpenMP
parallel block:
cudaGetDevice(&dev_used);
printf(“myid = %d; mpid = %d: device used = %d; partial pi = %f\n”,
myid,mpid,dev_used,pid[mpid]);
where int dev_used is the ID of the GPU device (0 or 1) that was used, myid is the MPI rank, mpid is the
OpenMP thread ID, pid[mpid] is a partial sum per OpenMP thread.
(Assignment)
1. Submit your MPI+OpenMP+CUDA code.
2. Run your code on 2 nodes, requesting 2 cores and 2 GPUs per node.
Submit your output, which should look like the following:
myid = 0; mpid = 0: device used = 0; partial pi = 0.979926
myid = 0; mpid = 1: device used = 1; partial pi = 0.874671
myid = 1; mpid = 0: device used = 0; partial pi = 0.719409
myid = 1; mpid = 1: device used = 1; partial pi = 0.567582
PI = 3.141588
CSCI596 Assignment 7—OpenMP Target Offload and DPC++ Programming
The purpose of this assignment is to gain hands-on experience in new open standards for
programming heterogeneous computers accelerated by graphics processing units (GPUs) and other
accelerators. Specifically, you will practice directive-based OpenMP target offload and unified
data-parallel programming language, DPC++.
Prerequisite
We will practice both OpenMP target and DPC++ on Intel developer’s cloud (DevCloud). To
do it, create your DevCloud account by registering at https://devcloud.intel.com/oneapi.
Part I: OpenMP Target Offload Computation of p
In this part, you will write a GPU offload program (name it omp_teams_pi.cu) to compute the value
of � using omp target, teams and distribute constructs.
(Assignment)
1. Modify the simple OpenMP target program omp_target_pi.cu to its teams-distribute counterpart
omp_teams_pi.cu, following the lecture note on “OpenMP Target Offload for Heterogeneous
Architectures”. Submit your code.
2. Compile and run your program on a GPU-accelerated computing node on DevCloud. Submit your
output, which should look like the following:
u49162@login-2:~$ cc -o omp_teams_pi omp_teams_pi.c -fopenmp
u49162@login-2:~$ qsub -I -l nodes=1:gpu:ppn=2
qsub: waiting for job 714173.v-qsvr-1.aidevcloud to start
qsub: job 714173.v-qsvr-1.aidevcloud ready
u49162@s001-n181:~$ ./omp_teams_pi
PI = 3.141593
Part II: DPC++ Computation of p
In this part, you will experience the compilation and running processes for a DPC++ program (pi.cpp)
to compute the value of �. While programming is not required for this part since C++ is not prerequisite to
this class, please use this opportunity to learn the essence of C++ and DPC++ programming by going
through the code and understanding why it worksfollowing the lecture note on “Data Parallel C++ (DPC++)
for Heterogeneous Architectures”.
(Assignment)
1. Compiler and run pi.cpp node. on a GPU-accelerated computing node on DevCloud. Submit your
output, which should look like the following:
Submit your output, which should look like the following:
u49162@login-2:~$ dpcpp -o pi pi.cpp
u49162@login-2:~$ qsub -I -l nodes=1:gpu:ppn=2
qsub: waiting for job 714154.v-qsvr-1.aidevcloud to start
qsub: job 714154.v-qsvr-1.aidevcloud ready
u49162@s001-n160:~$ ./pi
Running on: Intel(R) Gen9 HD Graphics NEO
Pi = 3.14159


