COMP4300: Laboratory 2
COMP4300/6430 2011: Laboratory 2
Binary Trees and Load Balancing
This lab will again use the NCI Xe system. You should all have
accounts on this system by now. If not you should email
comp4300@cs.anu.edu.au well or if you are in the lab you will need to
talk with the tutor.
A reminder that comprehensive documentation for the NCI Xe system available
HERE.
A tar file containing all the programs for this lab is available
HERE. Save this tar file on your local
desktop and then transfer it to the Xe system as per last week.
Growing your Trees
global_sum.c generates a single random integer on each
process and then seeks to sum the values together across all processes
using a binary tree. The code works if the number of processes used is
a power of two, but fails otherwise. Inspect and run the code. Make
sure you understand how it works.
- Fix the code so that it works correctly even when the number of
processes uses is not a power of two.
- [SKIP INITIALLY BUT COMPLETE LATER IF TIME] The current program gives the final sum value only on process
0. Modify the code so that it will give the final value on any
process that you chose. Have the rank of that process passed to the
summation routine. For the purpose of this exercise make that
process the the one with rank=nproc/2.
- Modify the code so that instead of using the routine Global_sum
it calls MPI_Reduce (and gives the same answer!). That is you need to
replace the line:
total = Global_sum(x, my_rank, nproc, comm);
with an equivalent line where the value of total is computed via a
call to MPI_Reduce.
- If you used MPI_Allreduce instead of MPI_Reduce what is the
difference?
global_gather.c is similar to global_sum.c in that we have
multiple processes that each generate a single integer of random
value. However, instead of adding these values together we now wish to
create a list of all these variables on each process. This is a
"gather" operation, and has been implemented in the code using
MPI_Allgather. Inspect and run the code. Make sure you understand what
it is doing.
- Your task is to re-write global_gather.c and replace the call to
MPI_Allgather with a call to a routine that you have written that only
uses MPI point to point send/receive calls. Base your routine off the
global_sum routine in global_sum.c. But now instead of having one
process send another process a single integer in each phase, you will
need to have pairs of processes exchanging data. Note that the amount
of data exchanged will double in each phase. Thus in phase one pairs of
processes will exchange a single integer, in phase two they will
exchange two integers, in phase three it will be four etc.
Balancing your Loads
We will use the Mandelbrot Set as an example of a load balancing
problem. No doubt you will have all seen such images:
The image is generated from values associated with a set of points in
the complex plane. The points are quasi stable with values that are
computed by iterating the function Zk+1 =
Zk2 + C, where Z and C are complex numbers (Z = A + Bi), with Z initially set to
zero and C is the position of the point in the complex
plan. Iterations of the above continue until either
|Z|>2 or some arbitrary iteration limit is reached (|Z| =
sqrt(A2 + B2). The set of points are enclosed by
a circle centered at (0,0) of radius 2.
A program to evaluate the Mandelbrot set is contained in mandel.c -
but with a twist. Computing a Mandelbort pixel as implemented takes
from 1 to 256 iterations. However even when 256 iterations are
required the total time taken is still very very small. Hence the
twist - the code does some extra work that is given by the square of the
number of Mandelbrot iterations. The extra work is purely to
increase task time duration so we have a meaningful load balancing problem.
The program requires as input
Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Rx_min - the minimum x coordinate
Rx_max - the maximum x coordinate
Ry_min - the minimum y coordinate
Ry_max - the maximum y coordinate
Nexpt - Number of timing experiments to run
NOTE: To ensure the point is enclosed within the circle of radius 2
the values of Rx/Ry Min/Max should be between -2 and +2.
The Mandelbrot compuation is run Nexpt times and the minimum
time is printed out. This should minimize the need to run the
computation several times.
The code is written to solve the Mandelbrot problem twice, first
sequentially, and then in parallel. We time both, and for
a sanity check verify that the results from the sequential code are
identical to those from the parallel computation. You should only
change the parallel part of the code.
For problem sizes with Nx and Ny less than 20 the resulting pixels
are printed. This is really just for interest.
Run the code
mpirun -np 1 mandel
with input
10 10
-2 2
-2 2
4
to make sure it works. Maybe you can see that the printout looks
something like picture above, noting that the center of picture
should be black. If not then you just have to trust me that it
works!
- Complete the table for the following two input data sets
(You should run on the batch systemhe batch system).
#1 Nx=100, Ny=10, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
#2 Nx=10, Ny=100, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
---------------------------------------------------------
Number of processors
Input Time 1 2 4 8
---------------------------------------------------------
#1 Sequential - - -
#1 Parallel
#1 Speedup
#2 Sequential - - -
#2 Parallel
#2 Speedup
---------------------------------------------------------
Both data sets have the same number of data points and the same
Rx/y min/max values. Explain why the performance is so different.
(Hint it may help to run this region for a smaller number of data
points so that it prints out their values).Your answer should
include comments based on how the code has been parallelised.
- As it stands the parallel code records the time from when all the
processors start until the last one is finished. Insert code to
record the time take to perform the pixel computation on each
processsor. This is the time from the current initial timing call
until just BEFORE the MPI_Reduce call. Each process should
determine its minimum task time from the Nexpt experiments. When all
experiments are done collect the minimum task times from the
different processors onto rank=0 using MPI_Gather and store them
in the ttsk_all array (already declared). Modify the code so that
rank=0 also prints out these values when printing out the
sequential and parallel total times. Using the same input data as
above complete the following table
---------------------------------------------------------
Task Number of processors
Input Time Rank 2 4 8
---------------------------------------------------------
#1 Parallel 0
1
2 N/A
3 N/A
4 N/A N/A
5 N/A N/A
6 N/A N/A
7 N/A N/A
#2 Parallel 0
1
2 N/A
3 N/A
4 N/A N/A
5 N/A N/A
6 N/A N/A
7 N/A N/A
---------------------------------------------------------
Does this explain the previous results?
- Implement a dynamic load balancing scheme using a master slave
approach in which the slaves request tasks from the master and
(in the parallel section) the master does no other useful work
but allocate tasks to slave (i.e. you effectively lose a
process). Gather performance data for two data sets given
above on upto 16 processors.