This lab will again use the NCI XE system. You should have an account on this system by now. If not you should email comp4300@cs.anu.edu.au or if you are in the lab you will need to talk with the tutor.
A reminder that comprehensive documentation for the NCI XE system available HERE.
From the student systems, you can mount your home directory on XE using the following menu option:
Places -> Connect to Server...
Server = xe.nci.org.au
Type = SSH
Folder = /home/444/<username>
This will allow you to use the text editor of your choice (and other tools installed on the student lab machines).
A zip file containing all the programs for this lab is available HERE. Save this zip file on your local desktop and then transfer it to the XE system as per last week.
For this lab, we will use gcc rather than the default compiler on XE (which is the Intel C compiler icc). You will need to replace the compiler module on XE as follows:
module unload intel-cc; module load gcc
Program heat.c sets up and solves a 2-D heat diffusion problem.
This models a metal plate of size by for which the temperature at the edge is held at some constant value . The aim is to determine the temperature in the middle of the plate. This is done iteratively by dividing the domain of the metal plate into a grid of points. The new temperature at each grid point is calculated as the average of the current temperatures at the four adjacent grid points. Iteration continues until the maximum change in temperature for any grid point is less than some threshold. A diagrammatic illustration of the problem is shown below:

The code requests as input:
Except at the edges, each grid point is initially set to zero. For problem sizes with Nx and Ny less than 20 the resulting temperatures are printed. This is really just for interest.
The code has been parallelised using MPI.
Run the code mpirun -np 1 heat with input:
10 10
100.0
100
0.1
just to make sure it works.
NOTE:
We are on a shared machine and it is possible that any one timing measurement will be distorted by other users. So if a measurement looks strange, run it a few times and take the shortest elapsed time.
For our purpose we can happily restrict the number of iterations and not worry whether or not the calculation converges i.e. we don’t care about the result, only the performance of the code!.
| Number of processors | ||||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | ||||
| Speedup | 1.0 | |||
You should have completed up to here within 30 minutes
| Blocking | Number of processors | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | ||||
| Speedup | 1.0 | |||
You should have completed up to here within 45 minutes
| Non-blocking | Number of processors | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | ||||
| Speedup | 1.0 | |||
You should have completed part 1 within 1 hour
The first step to improving a program’s performance is finding out where most of the time is spent. To do this we profile the code. There are two broad approaches to this:
To analyze the computational breakdown we can use either hardware or software measurements.
Hardware performance counters
Cray started this with something called the hardware performance monitor. This was an easy means by which users could get MFlop rates at the end of their calculations. The downside to this was the fact that it lead to users quoting “machoflops” - meaning that in some cases it is possible to have very good MFlop ratings, but the time to solution is actually longer than that for an alternative algorithm that does less flops.
Hardware performance counters in general have very low runtime overhead, but require chip real estate that used to be expensive. However, with ever smaller feature sizes, chip space is no longer a major issue and we are now seeing hardware performance counters on virtually all processors, e.g. see: PAPI.
In this lab, however, we will not be using hardware counters.</li>
Software instrumentation of the program
Add timing points around basic blocks, and statistically sample the program counter to build a profile.
Low HW cost, higher runtime overhead.
Unix (and Linux) prof and gprof are well-known examples, other vendors may offer their own tools, such as histx from SGI.
We can obtain a rough estimate of time spent at the basic block level using a coverage tool like gcov. To do this, recompile the code with additional compiler options, running it and post-processing the output files as follows:
$ make heat_cov
mpicc -fprofile-arcs -ftest-coverage -o heat_cov heat.c
mpirun -np 1 ./heat_cov < heat.input
gcov heat.c
This will give you a file heat.c.gcov. If you look in here you will see output of the form:
10: 91: iter++;
-: 92:
-: 93: // update local values
10: 94: jst = rank*chk+1;
10: 95: jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
49990: 96: for (j = jst; j < jfin; j++) {
249850020: 97: for (i = 1; i < Nx-1; i++) {
249800040: 98: tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
-: 99: told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
-: 100: }
-: 101: }
where the large numbers on the left indicate the number of times that each line of code has been executed (a # indicates the line was not executed).
There are a variety of tools available for analysis MPI programs. Go to the NCI Performance Analysis Tools web page.
mpicc -g -o heat heat.c -lmpiP -lm -lbfd -liberty -lunwind unload module ipm and load module mpiP first.)The iterative part of heat.c is given below:
do {
iter++;
// update local values
jst = rank*chk+1;
jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
for (j = jst; j < jfin; j++) {
for (i = 1; i < Nx-1; i++) {
tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
}
}
// Send to rank+1
if (rank+1 < size) {
jst = rank*chk+chk;
MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
2, MPI_COMM_WORLD);
}
if (rank-1 >= 0) {
jst = (rank-1)*chk+chk;
MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE,rank-1,
2, MPI_COMM_WORLD, &status);
}
// Send to rank-1
if (rank-1 >= 0) {
jst = rank*chk+1;
MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1,
1, MPI_COMM_WORLD);
}
if (rank+1 < size) {
jst= (rank+1)*chk+1;
MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
1, MPI_COMM_WORLD, &status);
}
// fix boundaries in tnew
j=0; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
j=Ny-1; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
i=0; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
i=Nx-1; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
jst = rank*chk+1;
lmxdiff = fabs( (double) (told[jst*Nx+1] - tnew[jst*Nx+1]));
jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
for (j = jst; j < jfin; j++) {
for (i = 1; i < Nx-1; i++) {
tdiff = fabs( (double) (told[j*Nx+i] - tnew[j*Nx+i]));
lmxdiff = (lmxdiff < tdiff) ? tdiff : lmxdiff;
}
}
for (i = 0; i < Nx*Ny; i++) told[i]=tnew[i];
MPI_Allreduce(&lmxdiff, &mxdiff, 1, MPI_DOUBLE,
MPI_MAX, MPI_COMM_WORLD);
if (!rank) printf(" iteration %d convergence %lf\n",iter,mxdiff);
} while ( mxdiff > converge && iter < Max_iter);
Any parallel program can be divided into the following categories:
Parallel work: this is work that is efficiently divided between all participating processes. The sum of the parallel work across all processes is roughly constant regardless of the number of processes used.
Overhead: this is extra work that is only present because of the parallelisation. For example communication to send data from one process to another.
Replicated/sequential work: this is work that is either replicated on all processes or is done by one process while the other processes wait. The sum of the replicated work across all processes increases with the number of parallel processes.
Use coverage analysis with 1, 2 and 4 processes to verify your conclusions from above. Try both of:
When running coverage analysis for multiple MPI processes, you will obtain counts summed over all processes. You should be looking to see what happens to the count values as you increase the process count. Specifically you might expect to see the count value for the parallel work stay (roughly) constant, but the count value associated with replicated work double as you double the number of processes. As a corollary to this the % of the total time spent executing the replicated lines will increase.
Profile the heat application for 8 processes using hpctoolkit.
You should have completed parts 1 and 2 within 2 hours
goodheat. I will give five dollars to
the first person in each lab class to produce a version of heat that
performs better than my version!