CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science (CECS)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP4300
COMP4300: Laboratory 2

COMP4300/6430 2007: Laboratory 2


There is something strange with gettimeofday on saratoga! In some cases time appears to go backwards! Meaning that if two time measurements are made the second one can returns a time earlier than the first. This is actually something we observed before, but I'd forgotten about it! Others have reported similar things also on an AMD64 dual core, but I don't have a fix at the moment! For now, just add a check in your code, that if the result of calling gettimeofday is less than the previous value, throw the result away and repeat the timing experiment. This is done in mandel.c

Load Balancing and Partitioning Strategies

In this lab we will investigate the Mandelbrot Set computations outlined in Lectures on Monday 16th March.

A tar file containing all the programs for this lab is available on Saratoga in /tmp/COMP4300. Obtain this and untar it as follows:

  cp /tmp/COMP4300/lab2.tar .
  tar -xvf lab2.tar

PROGRAM 1 mandel.c

This program evaluates the Mandelbrot Set - but with a twist! We talked about the Mandelbrot Set in L6. We are interested in the Mandelbrot Set as an example of a load balancing problem. Computing a Mandelbort pixel as implemented takes from 1 to 256 iterations. However even when 256 iterations are required the total time taken is still very very small. Hence the twist - the code does some extra work that is given by the square of the number of Mandelbrot iterations. The extra work means nothing - it is purely to increase task time duration.

The program requires as input

Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Rx_min - the minimum x coordinate
Rx_max - the maximum x coordinate
Ry_min - the minimum y coordinate
Ry_max - the maximum y coordinate
Nexpt - Number of timing experiments to run
NOTE: Rx/Ry Min/Max should be between -2 and +2 as mentioned in L6 The Mandelbrot compuation is run Nexpt times and the minimum time is printed out. This should minimize the need to run the computation several times.

The code is written to solve the Mandelbrot problem twice, first sequentially, and then in parallel. We time both, and for a sanity check verify that the results from the sequential code are identical to those from the parallel computation. You should only change the parallel part of the code.

For problem sizes with Nx and Ny less than 20 the resulting pixels are printed. This is really just for interest.

Run the code

	mpirun -np 1 mandel
with input
10 10
-2 2
-2 2
4
to make sure it works. Maybe you can see that the printout looks something like picture in L6, noting that the center of picture in L6 should be black. If not then you just have to trust me that it works!
  1. Complete the table for the following two input data sets (remember you will need to run on the batch system)
         #1 Nx=100, Ny=10, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
         #2 Nx=10, Ny=100, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
         ---------------------------------------------------------
                                  Number of processors
         Input  Time          1        2        4        8          
         ---------------------------------------------------------
         #1     Sequential             -        -        -     
         #1     Parallel
         #1     Speedup
         #2     Sequential             -        -        -     
         #2     Parallel
         #2     Speedup
         ---------------------------------------------------------
    
    Both data sets have the same number of data points and the same Rx/y min/max values. Explain why the performance is so different. (Hint it may help to run this region for a smaller number of data points so that it prints out their values).Your answer should include comments based on how the code has been parallelised.
  2. As it stands the parallel code records the time from when all the processors start until the last one is finished. Insert code to record the time take to perform the pixel computation on each processsor. This is the time from the current initial timing call until just BEFORE the MPI_Reduce call. Each process should determine its minimum task time from the Nexpt experiments. When all experiments are done collect the minimum task times from the different processors onto rank=0 using MPI_Gather and store them in the ttsk_all array (already declared). Modify the code so that rank=0 also prints out these values when printing out the sequential and parallel total times. Using the same input data as in Q1 complete the following table
         ---------------------------------------------------------
                         Task     Number of processors
         Input  Time     Rank        2            4            8
         ---------------------------------------------------------
         #1     Parallel   0         
                           1
                           2        N/A
                           3        N/A
                           4        N/A          N/A
                           5        N/A          N/A
                           6        N/A          N/A
                           7        N/A          N/A
         #2     Parallel   0
                           1
                           2        N/A
                           3        N/A
                           4        N/A          N/A
                           5        N/A          N/A
                           6        N/A          N/A
                           7        N/A          N/A
         ---------------------------------------------------------
    
    Does this explain the results from Q1?
  3. Implement a dynamic load balancing scheme using a master slave approach in which the slaves request tasks from the master and (in the parallel section) the master does no other useful work but allocate tasks to slave (i.e. you effectively loose a process). Gather performance data for two data sets given above on upto 8 processors.

PROGRAM 2 mysort.c

Program mysort.c reads in two integer values N and MxInt. Each process from 0-size then generates N integers between the range of 0-MxInt using a pseudo random number generator. These individual lists are then sorted. Your task is to fully sort the lists of integers over all the processes, so process 0 has the an ordered list for which the largest value is less than (or equal to) the smallest value in the list on process 1, which also has an ordered list, where the maximum is less than the smallest value on process 2, etc.
  1. (Some revision for you) The code contains three different sorting algorithms, sortA, sortB, and sortC.
    • Name the three different sorting algorithms (look at the code).
    • What is the theoretical cost of each (on a sequential machine)?
    • Which one is the fastest? (use this one in what follows)
  2. (Tricky) Implement the parallel bucket sort that is illustrated on slide L7-15.
    • Divide the numerical range (MxInt) by the number of processes
    • Sort the list on each process into local buckets that are defined according to which process they will be sent to.
    • Send data from a given bucket to the relevant process.
    • After the sort is complete it is likely that there will be different numbers of elements on different processes. Print the number of elements on each process. (Managing memory and not knowing how many items will get sent to any given process makes this task tricky).
    • Analysis the cost of your implementation (both computation and communication cost).
    (You should note that the random number generator is seeded using the current microseconds obtained from gettimeofday. So your results are not reproducible between runs. If you want to make it reproducible while debugging your code, change this so that the seed value is fixed.)
  3. Obtain performance data for your code - both as a function of the number of processes, and number of integers generated per process. Comment on how well you code runs!
  4. (Hard - meaning this requires significant understanding and reading beyond what has been directly presented to you in lectures!) As above generate a buffer of N random integers on each parallel process. Then sort these integers using your favourite sorting algorithm. You now have NPROC lists of N sorted integers. Your task now is to merge these sorted lists so that every process still has a list of N sorted integers, such that the maximum integer in the list on process 0 is less or equal to the minimum integer on process 1, and the maximum on process 1 is less than the minimum on process 2 etc. Your memory requirement on any given process must scale as X*N, where X is a low number (say 2-3) that does not depend on NPROC. In essence you need to do a number of merge sorts. Here are some hints