COMP4300: Laboratory 2
COMP4300/6430 2007: Laboratory 2
There is something strange with gettimeofday on saratoga! In
some cases time appears to go backwards! Meaning that if two time
measurements are made the second one can returns a time earlier than
the first. This is actually something we observed before, but I'd
forgotten about it! Others have reported similar things also on an
AMD64 dual core, but I don't have a fix at the moment! For now, just
add a check in your code, that if the result of calling gettimeofday
is less than the previous value, throw the result away and repeat the
timing experiment. This is done in mandel.c
Load Balancing and Partitioning Strategies
In this lab we will investigate the Mandelbrot Set computations
outlined in Lectures on Monday 16th March.
A tar file containing all the programs for this lab is available
on Saratoga in /tmp/COMP4300. Obtain this and untar it as follows:
cp /tmp/COMP4300/lab2.tar .
tar -xvf lab2.tar
PROGRAM 1 mandel.c
This program evaluates the Mandelbrot Set - but
with a twist! We talked about the Mandelbrot Set in L6. We are
interested in the Mandelbrot Set as an example of a load balancing
problem. Computing a Mandelbort pixel as implemented takes from 1 to 256
iterations. However even when 256 iterations are required the total
time taken is still very very small. Hence the twist - the
code does some extra work that is given by the square of the
number of Mandelbrot iterations. The extra work means nothing -
it is purely to increase task time duration.
The program requires as input
Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Rx_min - the minimum x coordinate
Rx_max - the maximum x coordinate
Ry_min - the minimum y coordinate
Ry_max - the maximum y coordinate
Nexpt - Number of timing experiments to run
NOTE: Rx/Ry Min/Max should be between -2 and +2 as mentioned in L6
The Mandelbrot compuation is run Nexpt times and the minimum
time is printed out. This should minimize the need to run the
computation several times.
The code is written to solve the Mandelbrot problem twice, first
sequentially, and then in parallel. We time both, and for
a sanity check verify that the results from the sequential code are
identical to those from the parallel computation. You should only
change the parallel part of the code.
For problem sizes with Nx and Ny less than 20 the resulting pixels
are printed. This is really just for interest.
Run the code
mpirun -np 1 mandel
with input
10 10
-2 2
-2 2
4
to make sure it works. Maybe you can see that the printout looks
something like picture in L6, noting that the center of picture in
L6 should be black. If not then you just have to trust me that it
works!
- Complete the table for the following two input data sets
(remember you will need to run on the batch system)
#1 Nx=100, Ny=10, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
#2 Nx=10, Ny=100, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
---------------------------------------------------------
Number of processors
Input Time 1 2 4 8
---------------------------------------------------------
#1 Sequential - - -
#1 Parallel
#1 Speedup
#2 Sequential - - -
#2 Parallel
#2 Speedup
---------------------------------------------------------
Both data sets have the same number of data points and the same
Rx/y min/max values. Explain why the performance is so different.
(Hint it may help to run this region for a smaller number of data
points so that it prints out their values).Your answer should
include comments based on how the code has been parallelised.
- As it stands the parallel code records the time from when all the
processors start until the last one is finished. Insert code to
record the time take to perform the pixel computation on each
processsor. This is the time from the current initial timing call
until just BEFORE the MPI_Reduce call. Each process should
determine its minimum task time from the Nexpt experiments. When all
experiments are done collect the minimum task times from the
different processors onto rank=0 using MPI_Gather and store them
in the ttsk_all array (already declared). Modify the code so that
rank=0 also prints out these values when printing out the
sequential and parallel total times. Using the same input data as
in Q1 complete the following table
---------------------------------------------------------
Task Number of processors
Input Time Rank 2 4 8
---------------------------------------------------------
#1 Parallel 0
1
2 N/A
3 N/A
4 N/A N/A
5 N/A N/A
6 N/A N/A
7 N/A N/A
#2 Parallel 0
1
2 N/A
3 N/A
4 N/A N/A
5 N/A N/A
6 N/A N/A
7 N/A N/A
---------------------------------------------------------
Does this explain the results from Q1?
- Implement a dynamic load balancing scheme using a master slave
approach in which the slaves request tasks from the master and
(in the parallel section) the master does no other useful work
but allocate tasks to slave (i.e. you effectively loose a
process). Gather performance data for two data sets given
above on upto 8 processors.
PROGRAM 2 mysort.c
Program mysort.c reads in two
integer values N and MxInt. Each process
from
0-size then generates N integers between the range of
0-MxInt using a pseudo random number generator. These individual lists
are then sorted. Your task is to fully sort the lists
of integers over all the processes, so process 0 has the an ordered list
for which the largest value is less than (or equal to) the smallest value
in the list on process 1, which also has an ordered list, where the
maximum is less than the smallest value on process 2, etc.
- (Some revision for you) The code contains three
different sorting algorithms, sortA, sortB, and sortC.
- Name the three different sorting algorithms (look at the code).
- What is the theoretical cost of each (on a sequential machine)?
- Which one is the fastest? (use this one in what follows)
- (Tricky) Implement the parallel
bucket sort that is illustrated on slide L7-15.
- Divide the numerical range (MxInt) by the number of processes
- Sort the list on each process into local buckets that are
defined according to which process they will be sent to.
- Send data from a given bucket to the relevant process.
- After the sort is complete it is likely that there will
be different numbers of elements on different processes.
Print the number of elements on each process. (Managing memory
and not knowing how many items will get sent to any given
process makes this task tricky).
- Analysis the cost of your implementation (both computation
and communication cost).
(You should note that the random number generator is seeded
using the current microseconds obtained from gettimeofday. So
your results are not reproducible between runs. If you want to
make it reproducible while debugging your code, change this so
that the seed value is fixed.)
- Obtain performance data for your code - both as a function of
the number of processes, and number of integers generated per
process. Comment on how well you code runs!
- (Hard - meaning this requires significant
understanding and reading beyond what has been directly presented
to you in lectures!) As above generate a buffer of N random
integers on each parallel process. Then sort these integers using
your favourite sorting algorithm. You now have NPROC lists of
N sorted integers. Your task now is to merge these sorted lists
so that every process still has a list of N sorted integers, such
that the maximum integer in the list on process 0 is less or equal
to the minimum integer on process 1, and the maximum on process 1
is less than the minimum on process 2 etc. Your memory requirement
on any given process must scale as X*N, where X is a low number
(say 2-3) that does not depend on NPROC. In essence you need to do
a number of merge sorts. Here are some hints