Hands-On Session Parallelization Strategies #3:
Synchronous Computations

Objective: To gain experience in the the performance of applications parallelized by the Synchronous Computation strategy.
The code for this exercise can be found here.
Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

We again use the program heat.c, which sets up and solves a 2-D heat diffusion problem.

This models a metal plate of size Rx by Ry for which the temperature at the edge is held at some constant value Tedge. The aim is to determine the temperature in the middle of the plate. This is done iteratively by dividing the domain of the metal plate into a grid of points. The new temperature at each grid point is calculated as the average of the current temperatures at the four adjacent grid points. Iteration continues until the maximum change in temperature for any grid point is less than some threshold. A diagrammatic illustration of the problem is shown below:

Illustration of Jacobi iteration for heat problem. Grid
	of R_x by R_y points with temperature fixed to T_edge for edge
	points. New temperature for a grid point is average of current
	temperature at the four neighbouring grid points.

The code requests as input:

Nx
the number of data points along the x axis
Ny
the number of data points along the y axis
Tedge
the fixed temperature at the boundary
Max_iter
Maximum number of iterations permitted
converge
the convergence criteria

Except at the edges, each grid point is initially set to zero. For problem sizes with Nx and Ny less than 20 the resulting temperatures are printed. This is really just for interest.

The code has been parallelized using MPI.

Run the code mpirun -np 1 ./heat < heat.input

just to make sure it works.

  1. With Nx = 15 and Ny = 19 and Tedge = 100.0, how many iterations does it take before grid point (7,5) is non-zero (assume grid points begin at 0,0)? In general, with Nx = Ny = (2n+1) how many iterations does it take for point (n,n) to be non-zero? Explain the rationale behind your answer.

For our purpose we can happily restrict the number of iterations and not worry whether or not the calculation converges, i.e. we don’t care about the result, only the performance of the code!

  1. Using Nx = Ny = 5000 and setting Max_iter to 10, gather performance information necessary to complete the following table. Remember that is best to run performance tests using the batch queue to avoid interference from other users.
      Number of processors
    1248
    Time (s)    
    Speedup1.0   

    What is the parallel speedup on 2, 4 and 8 CPUs? Comment on the performance.
  2. How is the domain decomposed? (in other words, how are the grid points divided between processes?). How do the total memory requirements scale with the number of processes? How would you re-write the program so that it scaled better?
  1. The exchange of data (i.e. communication) between the parallel regions is far from optimal - can you explain why? (hint: from initial inspection of the code, you might expect it to deadlock - but it doesn't). Implement an improved communication scheme, but still using blocking sends and receives. Gather new performance data to see if this is faster.
    Blocking Number of processors
    1248
    Time (s)    
    Speedup1.0   
  1. Change the communications to use non-blocking sends and receives (MPI_Isend(), MPI_Irecv(), MPI_Wait() / MPI_Waitall()). Recompute the performance numbers of Q2.
    Non-blocking Number of processors
    1248
    Time (s)    
    Speedup1.0