Add an if test around the Recv in Global_sum(), so the process only receives if partner < nproc. i.e:
if (my_rank < partner) { if (partner < nproc) { MPI_Recv(&temp, 1, MPI_INT, partner, 0, comm, MPI_STATUS_IGNORE); sum += temp; } …
This is an exercise in message passing and managing process IDs. Just remap the nodes e.g: for nodes 0-3 reducing to root node 1, map node 1 to 0, 2 to 1, 3 to 2, and 0 to 3.
Calculation of tree partners is done using the mapped ranks, not the ‘real’ ranks in the communicator.
(see included global_sum.c)
MPI_Reduce(&x, &total, 1, MPI_INT, MPI_SUM, CHOSEN, comm); where CHOSEN is the root process chosen for the sum.
With MPI_Reduce, only the root process receives the total. With MPI_Allreduce, all processes receive the total.
(see global_gather.c) - not the most optimal solution, but it gets the job done for powers of 2.
---------------------------------------------------------
Number of processors
Input Time 1 2 4 8
---------------------------------------------------------
#1 Sequential 1.7351 - - -
#1 Parallel 1.7339 1.2062 0.9128 0.9117
#1 Speedup 1.0x 1.4x 1.9x 1.9x
#2 Sequential 1.3093 - - -
#2 Parallel 1.3092 0.6713 0.3559 0.1986
#2 Speedup 1.0x 1.9x 3.7x 6.6x
---------------------------------------------------------
The second scheme shows much better scalability. The domain decomposition is over
Ny, so in the first case there are just 10 parallel tasks, but in the second
case there are 100. Note the division of tasks is round robin (r%size == rank).
Print out the tasks and you will see that last row has a lot of long tasks, so
the process that gets this row will have much more work to do.
---------------------------------------------------------
Task Number of processors
Input Time Rank 2 4 8
---------------------------------------------------------
#1 Parallel 0 0.5279 0.4705 0.4705
1 1.2063 0.9128 0.9117
2 - 0.0574 0.0000
3 - 0.2928 0.0000
4 - - 0.0002
5 - - 0.0011
6 - - 0.0573
7 - - 0.2927
#2 Parallel 0 0.6386 0.3050 0.1125
1 0.6711 0.3151 0.1190
2 - 0.3331 0.1248
3 - 0.3559 0.1342
4 - - 0.0937
5 - - 0.0940
6 - - 0.1004
7 - - 0.1062
---------------------------------------------------------
These results clearly show the difference in load imbalance between the two
schemes.