COMP4300: Laboratory 2
COMP4300/6430 2011: Laboratory 3
Synchronization, Performance Analysis, Performance Challenge
This lab will again use the NCI Xe system. You should all have
accounts on this system by now. If not you should email
comp4300@cs.anu.edu.au well or if you are in the lab you will need to
talk with the tutor.
A reminder that comprehensive documentation for the NCI Xe system available
HERE.
A tar file containing all the programs for this lab is available
HERE. Save this tar file on your local
desktop and then transfer it to the Xe system as per last week.
Part1: Synchronization
Program
heat.c sets up and solves a 2-D heat diffusion
problem. This involves a metal plate of size R
x by
R
y for which the temperature at the edge is held at some
constant value (T
edge). The aim is to determine the
temperature in the middle of the plate. This is done iteratively by
dividing the domain of the metal plate into a grid of points. The
new temperature at each grid point is calculated as the average of the
four current temperatures at the adjacent grid points. The iterations
continue until the temperatures of all grid points do not change
between iterations to within some threshold. A diagramatic
illustration of the problem is shown below:
The code requests as input
Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Tedge - the fixed temperature at the boundary
Max_iter - Maximum number of iterations permitted
Converge - the convergence criteria
Except at the edges, the grid is initially set to zero.
The code has been parallelised using MPI. It solves the problem twice,
the first time sequentially, then the second time in parallel. We time
both, and as a sanity check the results from the sequential code are compared
with the parallel results. You should only change the parallel part of
the code.
For problem sizes with Nx and Ny less than 20 and when using a
single process the resulting temperatures are printed. This is really
just for interest.
Run the code
mpirun -np 1 heat
with input
10 10
100.0
100
0.1
just to make sure it works.
NOTE: The code only solves the sequential and parallel problems once and
records the time. We are on a shared machine and it is possible that
any one timing measurement will be distorted by other users. So if
a timing looks strange run it a few times and take the shortest elapsed
time.
- With Nx = 15 and Ny = 19 and Tedge=100.0 how many iterations does
it take before grid point (7,5) is non-zero (assume grid points
begin at 0,0)? Generally for
Nx=Ny=(2n+1) how many iterations does it take for point (n,n) to
be none zero? Explain the rational behind your answer.
For our purpose we can happily restrict the number of iterations and
not worry whether or not the calculation converges (ie we don't care
about the result only the performance of the code!).
- Using Nx = Ny = 2000, and setting Max_iter=10 gather performance
information necessary to complete the following table (remember it
is probably best to run these performance tests using the batch
queue)
------------------------------------------------------------------
Number of processors
Time 1 2 4 8
------------------------------------------------------------------
Sequential - - -
Parallel
Speedup
------------------------------------------------------------------
(NB sequential time should be the same regardless of No of CPUs)
What is the parallel speedup on 2, 4 and 8 cpus. Comment on
the performance.
- The code has been parallelised by dividing the domain up into
regions. Explain how this decomposition is done.
You should have completed up to here within 30 minutes
- The exchange of data (i.e. communication) between the parallel
regions is far from optimal - can
you explain why? Implement an improved communication scheme, but
still using blocking sends and receives. Gather new performance
data to see if this is faster.
------------------------------------------------------------------
Number of processors
Time 1 2 4 8
------------------------------------------------------------------
Sequential - - -
Parallel
Speedup
------------------------------------------------------------------
You should have completed up to here within 45 minutes
- Change the communications to use non-blocking sends and receives
(MPI_Isend, MPI_Irecv, MPI_Wait). Recompute the performance
numbers of Q2.
------------------------------------------------------------------
Non Blocking Number of processors
Time 1 2 4 8
------------------------------------------------------------------
Sequential - - -
Parallel
Speedup
------------------------------------------------------------------
You should have completed part 1 within 1 hour
Part2: Performance Analysis
The first step to improving a program's performance is finding out
where most of the time is spent. To do this we profile the code.
There are two broad approaches to this
Computation Analysis
To analyze the computational breakdown we can use either hardware or
software measurements.
- Hardware performance counters
- Cray started this with something called the hardware performance monitor.
This was an easy means by which users could get MFlop rates at the end of
their calculations. The downside to this was the fact that it lead to users
quoting "machoflops" - meaning that in some cases it is possible to have very
good MFlop ratings, but the time to solution is actually longer than
that for an alternative algorithm that does less flops.
- Hardware performance counters in general have very low runtime
overhead, but require Si real estate that was used to be expensive.
However, with ever smaller etching chip space is no longer a major
issue and we are now seeing hardware performance counters on virtually
all processors, eg see
PAPI
- In this lab, however, we will not be using hardware counters.
- Software instrumentation of the program
- Add timing points around basic blocks, and statistically sample the
program counter to build a profile
- Low HW cost, higher runtime overhead
- Unix (and Linux) prof and gprof are well-known examples, other
vendors may offer their own tools, such as histx from SGI.
We can obtain a breakdown of time spent at the basic block level
by adding recompiling the code, running it and postprocessing
the output files as follows.
mpicc -fprofile-arcs -ftest-coverage -o heathpc.exe heat.c
mpirun -np 1 heathpc.exe < INPUT
gcov heat.c
This will give you a file "heat.c.gcov". If you look in here you
will see output of the form:
10: 80: iter++;
10: 81: jst = rank*chk+1;
10: 82: jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
19990: 83: for (j = jst; j < jfin; j++){
39940020: 84: for (i = 1; i < Nx-1; i++)
39920040: 85: tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
-: 86: told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
-: 87: }
where the large numbers on the left indicate the number of times that
particular line of code has been executed (# indicates the line
was not executed).
Profiling MPI Jobs
There are a variety of tools available for analysis MPI programs. Go
to the NCI
Performance Analysis
Tools web page.
- Run the ipm profiler for heat.exe with the 2000 by 2000 grid
size and inspect both the text and graphical output. What MPI
communication routine takes the most time? Does the plot of the
communication pattern conform with what you expect?
- Now run the mpiP profiler, but first recompile your code with
"-g" so that the MPI callsites are mapped to the source code. How does
this compare with ipm?
The iterative part heat.c is given below:
do {
iter++;
jst = rank*chk+1;
jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
for (j = jst; j < jfin; j++){
for (i = 1; i < Nx-1; i++)
tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
}
//Send to rank+1
if (rank+1 < size){
jst = rank*chk+chk;
MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1, 2,
MPI_COMM_WORLD);
}
if (rank-1 >= 0){
jst = (rank-1)*chk+chk;
MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE,rank-1,
2, MPI_COMM_WORLD, &status);
}
//Send to rank-1
if (rank-1 >= 0){
jst = rank*chk+1;
MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1, 1,
MPI_COMM_WORLD);
}
if (rank+1 < size){
jst= (rank+1)*chk+1;
MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
1, MPI_COMM_WORLD, &status);
}
// fix boundaries in tnew
j=0; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
j=Ny-1; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
i=0; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
i=Nx-1; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
jst = rank*chk+1;
lmxdiff = fabs( (double) (told[jst*Nx+1] - tnew[jst*Nx+1]));
jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
for (j = jst; j < jfin; j++){
for (i = 1; i < Nx-1; i++){
tdiff = fabs( (double) (told[j*Nx+i] - tnew[j*Nx+i]));
lmxdiff = (lmxdiff < tdiff) ? tdiff : lmxdiff;
}
}
for (i = 0; i < Nx*Ny; i++)told[i]=tnew[i];
MPI_Allreduce(&lmxdiff, &mxdiff, 1, MPI_DOUBLE,
MPI_MAX, MPI_COMM_WORLD);
if (!rank)printf(" iteration %d convergence %lf\n",iter,mxdiff);
}while ( mxdiff > converge && iter < Max_iter || iter == 1);
Any parallel program can be divided into the following categories:
- Parallel work: this is work that is efficiently
divided between all participating processes. The sum of the
parallel work across all processes is roughly constant regardless
of the number of processes used.
- Overhead: this is extra work that is only present because of the
parallelisation. For example communication to send data from
one process to another.
- Replicated/sequential work: this is work that is either
replicated on all
processes or is done by one process while the other processes
wait. The sum of the replicated work across all processes
increases as the number of parallel processes increases.
- Clearly annotate the iterative part of heat.c (ie the code
shown above) to indicate whether a given line/section of code is
parallel, overhead or replicated work.
Use coverage analysis with 1, 2 and 4 processes to verify your
conclusions from above. Try both a
- small problem 400x400 grid, Tedge of 100, 10 iterations
and threshold of 0.1
- Large problem 4000x4000 grid, Tedge of 100, 10 iterations
and threshold of 0.1
When running coverage analysis for multiple mpi processes, you will
obtain counts summed over all processes. You should be looking to
see what happens to the count values as you increase the process count.
Specifically you might expect to see the
count value for the
parallel work stay (roughly) constant, but the
count
value associated with replicated work double as you double the number
of processes. As a corollary to
this the % of the total time spent executing the replicated lines
will increase.
- Cut out the relevant portions of the coverage profiles for one
process when running an mpi parallel job using 1, 2
and 4 processors and for the LARGE problem. From this data and
your answer to Q6 demonstrate that what is stated in the
above paragraph is correct. What are the performance bottlenecks
in this code?
Run the
heat_log (Trace Analyzer) executable on 4 processes for
the large problem case. Play around with Jumpshot to see what the
various options do.
- From the above trace what can you say about the
performance characteristics of the program? The sort of thing you
should comment on is load imbalance, compute/communication times,
relative time for the section before the send/recv to that before
the reduction, the communication times as a fraction of peak etc.
You should have completed parts 1 and 2 within 2 hours
Part3: Tuning Challenge
- Your task is to produce a version of heat.c that runs as
fast as possible on a single processor, and scales as
best as possible on multiple processors. Your code must be
functionally correct. Post your times on the discussion
board. I've included the executable for my version in the tar
file as goodheat.exe. Nima (the tutor) will give five dollars to
the first person in each lab class to produce a version of heat that
performs better than my version!