Hands-On Session Parallelization Strategies #3:
Synchronous Computations
Objective: To gain experience in the the performance of
applications parallelized by the Synchronous Computation strategy.
The code for
this exercise can be found here.
Instructions on how to log into the remote machine and how to
download the source code to your working directory
(using wget) can be found
here.
We again use the program heat.c
, which sets up and
solves a 2-D heat diffusion problem.
This models a metal plate of
size
by for which the
temperature at the edge is held at some constant
value . The aim is
to determine the temperature in the middle of the plate. This is
done iteratively by dividing the domain of the metal plate into a
grid of points. The new temperature at each grid point is calculated
as the average of the current temperatures at the four adjacent grid
points. Iteration continues until the maximum change in temperature
for any grid point is less than some threshold. A diagrammatic
illustration of the problem is shown below:
The code requests as input:
- Nx
- the number of data points along the x axis
- Ny
- the number of data points along the y axis
- Tedge
- the fixed temperature at the boundary
- Max_iter
- Maximum number of iterations permitted
- converge
- the convergence criteria
Except at the edges, each grid point is initially set to zero. For
problem sizes with Nx and Ny less than 20 the
resulting temperatures are printed. This is really just for
interest.
The code has been parallelized using MPI.
Run the code mpirun -np 1 ./heat < heat.input
just to make sure it works.
- With Nx = 15 and Ny = 19 and Tedge =
100.0, how many iterations does it take before grid point (7,5) is
non-zero (assume grid points begin at 0,0)? In general,
with Nx = Ny = () how
many iterations does it take for point ()
to be non-zero? Explain the rationale behind your answer.
For our purpose we can happily restrict the number of iterations
and not worry whether or not the calculation converges, i.e. we
don’t care about the result, only the performance of the
code!
- Using Nx = Ny = 5000 and
setting Max_iter to 10, gather performance information
necessary to complete the following table. Remember that is best to
run performance tests using the batch queue to avoid interference
from other users.
|
Number of processors |
1 | 2 | 4 | 8 |
Time (s) | | | | |
Speedup | 1.0 | | | |
What is the parallel speedup on 2, 4 and 8 CPUs?
Comment on the performance.
- How is the domain decomposed? (in other words, how are the grid
points divided between processes?). How do the total memory requirements
scale with the number of processes? How would you re-write the program
so that it scaled better?
- The exchange of data (i.e. communication) between the parallel
regions is far from optimal - can you explain why?
(hint: from initial inspection of the code, you might
expect it to deadlock - but it doesn't). Implement an improved
communication scheme, but still using blocking sends and
receives. Gather new performance data to see if this is faster.
Blocking |
Number of processors |
1 | 2 | 4 | 8 |
Time (s) | | | | |
Speedup | 1.0 | | | |
- Change the communications to use non-blocking sends and receives
(MPI_Isend(), MPI_Irecv(), MPI_Wait()
/ MPI_Waitall()). Recompute the performance numbers
of Q2.
Non-blocking |
Number of processors |
1 | 2 | 4 | 8 |
Time (s) | | | | |
Speedup | 1.0 | | | |