COMP4300/6430 2013

Some Notes on Lab 2 - Binary Trees and Load Balancing

Growing your Trees

  1. Add an if test around the Recv in Global_sum(), so the process only receives if partner < nproc. i.e:

    if (my_rank < partner) { if (partner < nproc) { MPI_Recv(&temp, 1, MPI_INT, partner, 0, comm, MPI_STATUS_IGNORE); sum += temp; } …

  2. This is an exercise in message passing and managing process IDs. Just remap the nodes e.g: for nodes 0-3 reducing to root node 1, map node 1 to 0, 2 to 1, 3 to 2, and 0 to 3.
    Calculation of tree partners is done using the mapped ranks, not the ‘real’ ranks in the communicator.
    (see included global_sum.c)

  3. MPI_Reduce(&x, &total, 1, MPI_INT, MPI_SUM, CHOSEN, comm); where CHOSEN is the root process chosen for the sum.

  4. With MPI_Reduce, only the root process receives the total. With MPI_Allreduce, all processes receive the total.

  5. (see global_gather.c) - not the most optimal solution, but it gets the job done for powers of 2.

Balancing your Loads

  1. You should get results something like:
        ---------------------------------------------------------
                                Number of processors
        Input   Time              1     2       4       8          
        ---------------------------------------------------------
        #1      Sequential    1.7351    -       -       -
        #1      Parallel      1.7339    1.2062  0.9128  0.9117
        #1      Speedup       1.0x      1.4x    1.9x    1.9x
        #2      Sequential    1.3093    -       -       -
        #2      Parallel      1.3092    0.6713  0.3559  0.1986
        #2      Speedup       1.0x      1.9x    3.7x    6.6x
        --------------------------------------------------------- 
    
    The second scheme shows much better scalability. The domain decomposition is over Ny, so in the first case there are just 10 parallel tasks, but in the second case there are 100. Note the division of tasks is round robin (r%size == rank). Print out the tasks and you will see that last row has a lot of long tasks, so the process that gets this row will have much more work to do.
  2. You should get results something like:
        ---------------------------------------------------------
                        Task       Number of processors
        Input  Time     Rank       2            4            8
        ---------------------------------------------------------
        #1     Parallel   0        0.5279       0.4705       0.4705
                          1        1.2063       0.9128       0.9117
                          2        -            0.0574       0.0000
                          3        -            0.2928       0.0000
                          4        -            -            0.0002
                          5        -            -            0.0011
                          6        -            -            0.0573
                          7        -            -            0.2927
        #2     Parallel   0        0.6386       0.3050       0.1125
                          1        0.6711       0.3151       0.1190
                          2        -            0.3331       0.1248
                          3        -            0.3559       0.1342
                          4        -            -            0.0937
                          5        -            -            0.0940
                          6        -            -            0.1004
                          7        -            -            0.1062
        ---------------------------------------------------------
    
    These results clearly show the difference in load imbalance between the two schemes.
    See mandel.c
  3. You have to have one MPI process (say rank=0) devoted to handing out the next task. So it has a counter, intially zero, it then sits in (essentially) in a receive call in an infinite loop. WHen a receive call comes in, it responds with current value of counter, then increments the counter and goes back into the recv call. (note receive must receive from any process). Slaves sit in other infinite loops requesting tasks from the master. The master will control termination. When the counter exceeeds the maximum dimension it starts responding to tasks with a -1, to indicate no more tasks. On receipt of -1 a slave would terminate. The master needs to keep track of number of slaves - so it will terminate when it has handed out number_of_slaves -1 values.
    See mandel2.c