COMP4300/6430 2013 - Laboratory 3

Synchronization, Performance Analysis and Tuning

This lab will again use the NCI XE system. You should have an account on this system by now. If not you should email comp4300@cs.anu.edu.au or if you are in the lab you will need to talk with the tutor.

A reminder that comprehensive documentation for the NCI XE system available HERE.

From the student systems, you can mount your home directory on XE using the following menu option:

    Places -> Connect to Server...
    Server = xe.nci.org.au
    Type = SSH
    Folder = /home/444/<username>

This will allow you to use the text editor of your choice (and other tools installed on the student lab machines).

A zip file containing all the programs for this lab is available HERE. Save this zip file on your local desktop and then transfer it to the XE system as per last week.

For this lab, we will use gcc rather than the default compiler on XE (which is the Intel C compiler icc). You will need to replace the compiler module on XE as follows:

module unload intel-cc; module load gcc

Part 1: Synchronization

Program heat.c sets up and solves a 2-D heat diffusion problem.

This models a metal plate of size Rx by Ry for which the temperature at the edge is held at some constant value Tedge. The aim is to determine the temperature in the middle of the plate. This is done iteratively by dividing the domain of the metal plate into a grid of points. The new temperature at each grid point is calculated as the average of the current temperatures at the four adjacent grid points. Iteration continues until the maximum change in temperature for any grid point is less than some threshold. A diagrammatic illustration of the problem is shown below:

Illustration of Jacobi iteration for heat problem. Grid of R_x by R_y points with temperature fixed to T_edge for edge points. New temperature for a grid point is average of current temperature at the four neighbouring grid points.

The code requests as input:

Nx
the number of data points along the x axis
Ny
the number of data points along the y axis
Tedge
the fixed temperature at the boundary
Max_iter
Maximum number of iterations permitted
converge
the convergence criteria

Except at the edges, each grid point is initially set to zero. For problem sizes with Nx and Ny less than 20 the resulting temperatures are printed. This is really just for interest.

The code has been parallelised using MPI.

Run the code mpirun -np 1 heat with input:

10 10
100.0
100
0.1

just to make sure it works.

NOTE:

We are on a shared machine and it is possible that any one timing measurement will be distorted by other users. So if a measurement looks strange, run it a few times and take the shortest elapsed time.

  1. With Nx = 15 and Ny = 19 and Tedge=100.0 how many iterations does it take before grid point (7,5) is non-zero (assume grid points begin at 0,0)? In general, with Nx=Ny=(2n+1) how many iterations does it take for point (n,n) to be non-zero? Explain the rationale behind your answer.

For our purpose we can happily restrict the number of iterations and not worry whether or not the calculation converges i.e. we don’t care about the result, only the performance of the code!.

  1. Using Nx = Ny = 5000 and setting Max_iter=10, gather performance information necessary to complete the following table. Remember that is best to run performance tests using the batch queue to avoid interference from other users.
      Number of processors
    1248
    Time (s)    
    Speedup1.0   

    What is the parallel speedup on 2, 4 and 8 CPUs? Comment on the performance.
  2. How is the domain decomposed? (In other words, how are the grid points divided between processes?)

You should have completed up to here within 30 minutes

  1. The exchange of data (i.e. communication) between the parallel regions is far from optimal - can you explain why? Implement an improved communication scheme, but still using blocking sends and receives. Gather new performance data to see if this is faster.
    Blocking Number of processors
    1248
    Time (s)    
    Speedup1.0   

You should have completed up to here within 45 minutes

  1. Change the communications to use non-blocking sends and receives (MPI_Isend, MPI_Irecv, MPI_Wait). Recompute the performance numbers of Q2.
    Non-blocking Number of processors
    1248
    Time (s)    
    Speedup1.0   

You should have completed part 1 within 1 hour

Part 2: Performance Analysis

The first step to improving a program’s performance is finding out where most of the time is spent. To do this we profile the code. There are two broad approaches to this:

Computation Analysis

To analyze the computational breakdown we can use either hardware or software measurements.

  1. Hardware performance counters

  2. Software instrumentation of the program

We can obtain a rough estimate of time spent at the basic block level using a coverage tool like gcov. To do this, recompile the code with additional compiler options, running it and post-processing the output files as follows:

$ make heat_cov
mpicc -fprofile-arcs -ftest-coverage -o heat_cov heat.c 
mpirun -np 1 ./heat_cov < heat.input
gcov heat.c

This will give you a file heat.c.gcov. If you look in here you will see output of the form:

           10:   91:        iter++;
            -:   92:
            -:   93:        // update local values
           10:   94:        jst = rank*chk+1;
           10:   95:        jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
        49990:   96:        for (j = jst; j < jfin; j++) {
    249850020:   97:            for (i = 1; i < Nx-1; i++) {
    249800040:   98:                tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
            -:   99:                              told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
            -:  100:            }
            -:  101:        }

where the large numbers on the left indicate the number of times that each line of code has been executed (a # indicates the line was not executed).

Profiling MPI Jobs

There are a variety of tools available for analysis MPI programs. Go to the NCI Performance Analysis Tools web page.

  1. Run the ipm profiler for heat with the 5000 by 5000 grid size with 8 processes and inspect both the text and graphical output. What MPI communication routine takes the most time? Does the plot of the communication topology conform with what you expect?
  2. Now run the mpiP profiler, but first recompile your code with:
    mpicc -g -o heat heat.c -lmpiP -lm -lbfd -liberty -lunwind
    so that the MPI callsites are mapped to the source code.
    (You will need to unload module ipm and load module mpiP first.)
    How does this compare with ipm?

The iterative part of heat.c is given below:

   do {
    iter++;
    // update local values
    jst = rank*chk+1;
    jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
    for (j = jst; j < jfin; j++) {
        for (i = 1; i < Nx-1; i++) {
            tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
                          told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
        }
    }
    // Send to rank+1
    if (rank+1 < size) {
        jst = rank*chk+chk;
        MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1, 
                            2, MPI_COMM_WORLD);
    }
    if (rank-1 >= 0) {
        jst = (rank-1)*chk+chk;
        MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE,rank-1, 
                            2, MPI_COMM_WORLD, &status);
    }
    // Send to rank-1
    if (rank-1 >= 0) {
        jst = rank*chk+1;
        MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1, 
                            1, MPI_COMM_WORLD);
    }
    if (rank+1 < size) {
        jst= (rank+1)*chk+1;
        MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
                            1, MPI_COMM_WORLD, &status);
    }
    // fix boundaries in tnew
    j=0;    for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
    j=Ny-1; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
    i=0;    for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
    i=Nx-1; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;

    jst = rank*chk+1;
    lmxdiff = fabs( (double) (told[jst*Nx+1] - tnew[jst*Nx+1]));
    jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
    for (j = jst; j < jfin; j++) {
        for (i = 1; i < Nx-1; i++) {
         tdiff = fabs( (double) (told[j*Nx+i] - tnew[j*Nx+i]));
         lmxdiff = (lmxdiff < tdiff) ? tdiff : lmxdiff;
        }
    }
    for (i = 0; i < Nx*Ny; i++) told[i]=tnew[i];

    MPI_Allreduce(&lmxdiff, &mxdiff, 1, MPI_DOUBLE, 
            MPI_MAX, MPI_COMM_WORLD);
    if (!rank) printf(" iteration %d convergence %lf\n",iter,mxdiff);
} while ( mxdiff > converge && iter < Max_iter);

Any parallel program can be divided into the following categories:

  1. Clearly annotate the iterative part of heat.c (i.e. the code shown above) to indicate whether a given line/section of code is [P]arallel, [O]verhead or [R]eplicated work.

Use coverage analysis with 1, 2 and 4 processes to verify your conclusions from above. Try both of:

When running coverage analysis for multiple MPI processes, you will obtain counts summed over all processes. You should be looking to see what happens to the count values as you increase the process count. Specifically you might expect to see the count value for the parallel work stay (roughly) constant, but the count value associated with replicated work double as you double the number of processes. As a corollary to this the % of the total time spent executing the replicated lines will increase.

  1. Cut out the relevant portions of the coverage profiles for 1, 2 and 4 MPI processes for the LARGE problem. From these data and your answer to Q6 demonstrate that what is stated in the above paragraph is correct. What are the performance bottlenecks in this code?

Profile the heat application for 8 processes using hpctoolkit.

  1. In which loop is the majority of wall time spent? Is this what you expected? How could this cost be reduced or avoided?

You should have completed parts 1 and 2 within 2 hours

Part 3: Tuning Challenge

  1. Your task is to produce a version of heat.c that runs as fast as possible on a single processor, and scales as best as possible on multiple processors. Your code must be functionally correct. Post your times on the discussion board. I've included the executable for my version in the tar file as goodheat. I will give five dollars to the first person in each lab class to produce a version of heat that performs better than my version!