CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

Parallel Systems
COMP4300: Laboratory 2

COMP4300/6430 2011: Laboratory 3

Synchronization, Performance Analysis, Performance Challenge

This lab will again use the NCI Xe system. You should all have accounts on this system by now. If not you should email comp4300@cs.anu.edu.au well or if you are in the lab you will need to talk with the tutor.

A reminder that comprehensive documentation for the NCI Xe system available HERE.

A tar file containing all the programs for this lab is available HERE. Save this tar file on your local desktop and then transfer it to the Xe system as per last week.


Part1: Synchronization


Program heat.c sets up and solves a 2-D heat diffusion problem. This involves a metal plate of size Rx by Ry for which the temperature at the edge is held at some constant value (Tedge). The aim is to determine the temperature in the middle of the plate. This is done iteratively by dividing the domain of the metal plate into a grid of points. The new temperature at each grid point is calculated as the average of the four current temperatures at the adjacent grid points. The iterations continue until the temperatures of all grid points do not change between iterations to within some threshold. A diagramatic illustration of the problem is shown below:

The code requests as input

Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Tedge - the fixed temperature at the boundary
Max_iter - Maximum number of iterations permitted
Converge - the convergence criteria
Except at the edges, the grid is initially set to zero. The code has been parallelised using MPI. It solves the problem twice, the first time sequentially, then the second time in parallel. We time both, and as a sanity check the results from the sequential code are compared with the parallel results. You should only change the parallel part of the code.

For problem sizes with Nx and Ny less than 20 and when using a single process the resulting temperatures are printed. This is really just for interest.

Run the code

	mpirun -np 1 heat
with input
10 10
100.0
100
0.1
just to make sure it works.

NOTE: The code only solves the sequential and parallel problems once and records the time. We are on a shared machine and it is possible that any one timing measurement will be distorted by other users. So if a timing looks strange run it a few times and take the shortest elapsed time.

  1. With Nx = 15 and Ny = 19 and Tedge=100.0 how many iterations does it take before grid point (7,5) is non-zero (assume grid points begin at 0,0)? Generally for Nx=Ny=(2n+1) how many iterations does it take for point (n,n) to be none zero? Explain the rational behind your answer.
For our purpose we can happily restrict the number of iterations and not worry whether or not the calculation converges (ie we don't care about the result only the performance of the code!).
  1. Using Nx = Ny = 2000, and setting Max_iter=10 gather performance information necessary to complete the following table (remember it is probably best to run these performance tests using the batch queue)
         ------------------------------------------------------------------
                          Number of processors
         Time          1        2        4        8
         ------------------------------------------------------------------
         Sequential             -        -        -
         Parallel
         Speedup
         ------------------------------------------------------------------
    
    (NB sequential time should be the same regardless of No of CPUs) What is the parallel speedup on 2, 4 and 8 cpus. Comment on the performance.
  2. The code has been parallelised by dividing the domain up into regions. Explain how this decomposition is done.

    You should have completed up to here within 30 minutes

  3. The exchange of data (i.e. communication) between the parallel regions is far from optimal - can you explain why? Implement an improved communication scheme, but still using blocking sends and receives. Gather new performance data to see if this is faster.
         ------------------------------------------------------------------
                          Number of processors
         Time          1        2        4        8
         ------------------------------------------------------------------
         Sequential             -        -        -
         Parallel
         Speedup
         ------------------------------------------------------------------
    

    You should have completed up to here within 45 minutes

  4. Change the communications to use non-blocking sends and receives (MPI_Isend, MPI_Irecv, MPI_Wait). Recompute the performance numbers of Q2.
         ------------------------------------------------------------------
         Non Blocking      Number of processors
         Time          1        2        4        8
         ------------------------------------------------------------------
         Sequential             -        -        -
         Parallel
         Speedup
         ------------------------------------------------------------------
    

You should have completed part 1 within 1 hour


Part2: Performance Analysis


The first step to improving a program's performance is finding out where most of the time is spent. To do this we profile the code. There are two broad approaches to this

Computation Analysis

To analyze the computational breakdown we can use either hardware or software measurements.
  • Hardware performance counters
    • Cray started this with something called the hardware performance monitor. This was an easy means by which users could get MFlop rates at the end of their calculations. The downside to this was the fact that it lead to users quoting "machoflops" - meaning that in some cases it is possible to have very good MFlop ratings, but the time to solution is actually longer than that for an alternative algorithm that does less flops.
    • Hardware performance counters in general have very low runtime overhead, but require Si real estate that was used to be expensive. However, with ever smaller etching chip space is no longer a major issue and we are now seeing hardware performance counters on virtually all processors, eg see PAPI
    • In this lab, however, we will not be using hardware counters.
  • Software instrumentation of the program
    • Add timing points around basic blocks, and statistically sample the program counter to build a profile
    • Low HW cost, higher runtime overhead
    • Unix (and Linux) prof and gprof are well-known examples, other vendors may offer their own tools, such as histx from SGI.
We can obtain a breakdown of time spent at the basic block level by adding recompiling the code, running it and postprocessing the output files as follows.
    mpicc -fprofile-arcs -ftest-coverage -o heathpc.exe heat.c
    mpirun -np 1 heathpc.exe < INPUT
    gcov heat.c
This will give you a file "heat.c.gcov". If you look in here you will see output of the form:
       10:   80:     iter++;
       10:   81:     jst = rank*chk+1;
       10:   82:     jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
    19990:   83:       for (j = jst; j < jfin; j++){
 39940020:   84:         for (i = 1; i < Nx-1; i++)
 39920040:   85:           tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
        -:   86:                              told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
        -:   87:       }
where the large numbers on the left indicate the number of times that particular line of code has been executed (# indicates the line was not executed).

Profiling MPI Jobs

There are a variety of tools available for analysis MPI programs. Go to the NCI Performance Analysis Tools web page.
  1. Run the ipm profiler for heat.exe with the 2000 by 2000 grid size and inspect both the text and graphical output. What MPI communication routine takes the most time? Does the plot of the communication pattern conform with what you expect?
  2. Now run the mpiP profiler, but first recompile your code with "-g" so that the MPI callsites are mapped to the source code. How does this compare with ipm?
The iterative part heat.c is given below:
   do {
     iter++;
     jst = rank*chk+1;
     jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
       for (j = jst; j < jfin; j++){
         for (i = 1; i < Nx-1; i++)
           tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
                              told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
       }
   //Send to rank+1
         if (rank+1 < size){
           jst = rank*chk+chk;
           MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1, 2,
	   MPI_COMM_WORLD);
         }
         if (rank-1 >= 0){
           jst = (rank-1)*chk+chk;
           MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE,rank-1, 
                                      2, MPI_COMM_WORLD, &status);
         }
   //Send to rank-1
         if (rank-1 >= 0){
           jst = rank*chk+1;
           MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1, 1,
	   MPI_COMM_WORLD);
         }
         if (rank+1 < size){
           jst= (rank+1)*chk+1;
           MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
                                      1, MPI_COMM_WORLD, &status);
				       }
     // fix boundaries in tnew
     j=0;    for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
     j=Ny-1; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
     i=0;    for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
     i=Nx-1; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;

     jst = rank*chk+1;
     lmxdiff = fabs( (double) (told[jst*Nx+1] - tnew[jst*Nx+1]));
     jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
     for (j = jst; j < jfin; j++){
       for (i = 1; i < Nx-1; i++){
         tdiff = fabs( (double) (told[j*Nx+i] - tnew[j*Nx+i]));
         lmxdiff = (lmxdiff < tdiff) ? tdiff : lmxdiff;
       }
     }
     for (i = 0; i < Nx*Ny; i++)told[i]=tnew[i];

     MPI_Allreduce(&lmxdiff, &mxdiff, 1, MPI_DOUBLE, 
                MPI_MAX, MPI_COMM_WORLD);
     if (!rank)printf(" iteration %d convergence %lf\n",iter,mxdiff);
   }while ( mxdiff > converge && iter < Max_iter || iter == 1);
Any parallel program can be divided into the following categories:
  • Parallel work: this is work that is efficiently divided between all participating processes. The sum of the parallel work across all processes is roughly constant regardless of the number of processes used.
  • Overhead: this is extra work that is only present because of the parallelisation. For example communication to send data from one process to another.
  • Replicated/sequential work: this is work that is either replicated on all processes or is done by one process while the other processes wait. The sum of the replicated work across all processes increases as the number of parallel processes increases.
  1. Clearly annotate the iterative part of heat.c (ie the code shown above) to indicate whether a given line/section of code is parallel, overhead or replicated work.
Use coverage analysis with 1, 2 and 4 processes to verify your conclusions from above. Try both a
  • small problem 400x400 grid, Tedge of 100, 10 iterations and threshold of 0.1
  • Large problem 4000x4000 grid, Tedge of 100, 10 iterations and threshold of 0.1
When running coverage analysis for multiple mpi processes, you will obtain counts summed over all processes. You should be looking to see what happens to the count values as you increase the process count. Specifically you might expect to see the count value for the parallel work stay (roughly) constant, but the count value associated with replicated work double as you double the number of processes. As a corollary to this the % of the total time spent executing the replicated lines will increase.

  1. Cut out the relevant portions of the coverage profiles for one process when running an mpi parallel job using 1, 2 and 4 processors and for the LARGE problem. From this data and your answer to Q6 demonstrate that what is stated in the above paragraph is correct. What are the performance bottlenecks in this code?
Run the heat_log (Trace Analyzer) executable on 4 processes for the large problem case. Play around with Jumpshot to see what the various options do.
  1. From the above trace what can you say about the performance characteristics of the program? The sort of thing you should comment on is load imbalance, compute/communication times, relative time for the section before the send/recv to that before the reduction, the communication times as a fraction of peak etc.

You should have completed parts 1 and 2 within 2 hours


Part3: Tuning Challenge


  1. Your task is to produce a version of heat.c that runs as fast as possible on a single processor, and scales as best as possible on multiple processors. Your code must be functionally correct. Post your times on the discussion board. I've included the executable for my version in the tar file as goodheat.exe. Nima (the tutor) will give five dollars to the first person in each lab class to produce a version of heat that performs better than my version!