Hands-On Session Parallelization Strategies #2:
Bucket Sort

Objective: To better understand data partitioning strategies and further knowledge of communication patterns using the example of the bucket sort.
The code for this exercise can be found here. Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

The parallel bucket sort begins with an array distributed across processes according to its index space, which is the usual case with the data decompositional parallelization strategy. However, the final array is distributed according to the values of the array, in the sense that each process contains the array elements in contiguous non-overlapping ranges (called `buckets'). This can be achieved by a series of standard communication patterns.

The program bucketSort.c performs a parallel sort on n integers across p processes, when invoked by mpirun -np p ./bucketSort n. Currently, each process generates its local portion of the array, with elements in the range 0..n/2 and a local size n_loc. It splits these into p local buckets, stored in the array AbL[0..n_loc]. The sizes and offsets in AbL of these buckets are stored bLsize[0..p-1] and bLoffs[0..p-1] respectively. Process r must collect the rth local buckets from all processes; to do so, it must first collect the lengths of these buckets, which it stores in the array bCsizeR[0..p-1]. The sum of these is stored in nbC and finally it must collect the rth local buckets from all processes, and stores this in the array AbC[0..nbC-1]. This array is then locally sorted, completing the bucket sort.

The program also checks the sorted array for correctness (does its global length and checksum match the original, are there any elements out-of-order within or between the collected buckets?).

  1. Compile and run the program interactively for small n and p. You will find that the function checkResult() reports errors! This is because the communications to collect values in bCsizeR and AbC are omitted: it is your task to rectify this! This can be done in the following stages:
  2. Test the code interactively. To get a feeling for how `balanced' the bucket sort is, it might be helpful to sort its output by process rank, e.g. mpirun -np 16 bucketSort 32 | sort -n. With largish n, do you see any speedup for 1≤p≤8 when run interactively?
  3. Run the batch file, possibly adjusting the values of n beforehand (suggest you comment out the printf("%d: bucket ...", ...) statement first). What kind of speedups do you observe? Is an all-to-all communication on an n lg(n) algorithm a show-stopper? And is comparing p=1 with p>1 valid for speedup in this case anyway? (and why?).
  4. Finally, in bucketSort.c, look at the implementation of the code to count the number elements out-of-order between the collected buckets (at the bottom of the function checkResult()). It can miss errors when the size of the collected bucket in a process is 0. To see this, uncomment the printf("%d: bucket...", ...) statement and run with n≤p, e.g. mpirun -np 16 bucketSort 8 | sort -n.
    Improve the code so that the checks occur only between processes having non-empty collected buckets.
    Hint: each process needs to know the collected bucket sizes in all other processes.