Skip Navigation | ANU Home | Search ANU | Search FEIT | Feedback
The Australian National University
Faculty of Engineering and Information Technology (FEIT)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP2310
COMP2310 Assignment 2

COMP2310 Assignment 2: Distributed Merge Sort

This assignment is worth 20% of the total course mark.

Deadline: 23:57 Friday 19 October 2007

Late Penalty:
<1 hour -0.5 marks/20
<6 hour -1.0 marks/20
<12 hour -1.5 marks/20
<24 hours -2.0 marks/20
<48 hours -5.0 marks/20
<72 hours -10.0 marks/20
>72 hours -20.0 marks/20, i.e. forget it!


Submission Details

You are required to submit the files mergepipe.c, mergesocket.c, mergethread.c, and write-up.txt for the assignment name Ass2. The following command can be used to perform this:
    submit comp2310 Ass2 mergepipe.c mergesocket.c mergethread.c write-up.txt

Some notes:

  • Your code will be tested on partch and iwaki, so make sure it works on these machines! BUT, as these are shared resources, test and debug your code first on a normal workstation!
  • Your programs should be written with good programming style (and so should not produce any warnings when compiled with the -Wall option!). I.e. it should be well commented, so as to enable the reader (this includes the tutor/marker!) to understand how it works. Identifiers should be meaningful and (unless their role is trivial) commented where declared, any constants should be named rather than hard-coded, and defensive programming should be used. The latter includes checks being made for system calls that can fail due to resource limitations; once detected, a message should be generated with a call to the system error function perror() and the process should exit with a status of 2. It also includes using assert() to perform other kinds of checks, and the freeing of any malloced data before (normal) exit (this permits the mcheck library to detect any buffer overwrite).
  • Your programs should be robust for array sizes up to (say) 10 million and 16 processes / threads.
  • You are required to submit a file called write-up.txt. This must be a plain ASCII text file.
  • All submittable files must have a preamble making a declaration about the work you have submitted. The provided template files come with such a preamble. You must add your name and student number where indicated. If you need to modify the declaration, e.g. you did in fact receive substantial assistance from another student, you should modify the text to say so.

Introduction

Many algorithms requiring substantial computational resources have a tree-like structure (i.e. use a divide-and -conquer strategy). These can normally adapted to a distributed computation, and especially the case if the tree can be kept balanced (i.e. use p=2k processors by assigning p/2 processes to the left sub-tree and p/2 to the right sub-tree). Merge sort is an example of this.

In this assignment you will implement various forms of a distributed merge sort using various IPC and shared memory features of Unix and the C programming language.

The figure below depicts the structure of a distributed merge sort with p = 8 processes. The tree indicates the process structure, beginning with process 0 at the top-level (d = 0). However, instead of spawning 2 child processes, it only spawns 1 for the right sub-tree, and then it processes the left sub-tree itself. This continues recursively for k = log2(p) levels, at which point p processes have been created. These processes sort their respective segment of the array. Then, the tree structure is traversed upwards, with the respective process at each node then merging the array segment from its right child into its own segment from the next level down. In this step the right child may have to explicitly communicate its array segment; then that process is free to terminate. As the tree traversal return to level d = 0, process 0 performs the final merge and the array is sorted.

It is assumed that p is a power of 2, with p>=1 and n>=0. The numbering of the process id's shown in the figure is only one of several possibilities; however, it can be seen that the binary representation of process id i can be used to determine what each process should do at each step. For example, at level d (0 <= d < k), if the dth bit (where the 0th bit is the least significant bit) of i is one, process i is currently the right child in the sub-tree (this can be encoded in C as ((i & (1 << d)) != 0), for those that missed COMP2300, or have otherwise forgotten:). Also, by reversing the order of the bits in i, the offset of the array segment at level d = k can be determined.

The directory /dept/dcs/comp2310/public/ass2 contains template files for the files that you will submit. It also contains a library of provided functions in mergelib.h and mergelib.c. It also contains a Makefile which should be used to compile the programs (note the use of the -Wall and -lmcheck) on both Linux and Solaris systems. Copy these files into your local directory area before you begin work on your programs.

Merge Sort using Forks and Pipes (worth 40% of the assignment mark)

In mergepipe.c, implement a distributed merge sort using forks to spawn the processes and pipes to communicate the (sorted) array segments. The program has the synopsis:
    mergepipe n [p [s]]
where n is the size of an integer array to be sorted, and p (default value = 1) is the number of processes (the program should check that p is a power of 2 and n is a multiple of p). s is a random number seed (default value = 0). A negative value of n signifies the printing option is specified; |n| is subsequently used for the array size.

The original (parent) process processes the command line parameters (any errors should result in an appropriate error message being given, and the process exiting with a status of 1), allocates via malloc() an array A of size n (and any other buffers it needs), and invokes the provided external function genArray(A, n, s) to initialize the array A. If the printing option was specified, it should call the provided function printArray(A, n). Then startSortTimer() should be called.

The distributed merge sort is then performed, which involves the (direct or indirect) forking of a further p-1 processes. The array segments should be transferred between processes using Unix pipes. Note that the semantics of fork() can be utilized to reduce the amount of explicit data transfer. The provided external function mergeSort(int A[], int tmp[], int n) can be used to sort a segment of A[0:n-1] (using tmp[0:n-1] as temporary storage), and merge(int A[], int L[], int nL, int R[], int nR) can be used to merge L[0:nL-1] and R[0:nR-1] into A[0:nL+nR-1]. The standard C function bcopy() can be used to copy array segments, if needed.

For full marks, no process should contain more than k copies of pipe descriptor pairs, and processes should close any pipe as soon as no longer needed. Child processes should exit when their participation in the sort is finished, and their parent should wait on them to do this.

When the sort is completed, the original (parent) process calls stopSortTimer(), printArray(A, n) if the printing option was specified, and then checkArray(A, n, s), where A[0:n-1] should now hold the sorted array.

Debugging: the use of printf("%d:...\n", i, ...) statements, where i is the process or thread id, at strategic points is recommended. printArray() can be used to print out array segments. Noting that your program(s) should not produce any output directly, you should remove these and any other print statements before submitting. Running your program under strace -f may also be helpful for debugging (called truss -f on iwaki - the output on Solaris is more readable, so it may be preferable to use that).

Merge Sort using Sockets (worth 15% of the assignment mark)

In mergesocket.c, create a variant of the above program which uses sockets instead of pipes to transfer data, but is otherwise identical.

Warning: clean up the common code in mergepipe.c before you copy it!

Shared Memory Merge Sort using Pthreads (worth 30% of the assignment mark)

In mergethread.c, create a variant of the above program which creates Posix threads instead of processes, and uses shared memory to communicate data.

Hint: it is recommended that you pass the logical thread id (as in the above diagram) as the parameter of the function that the threads execute. The use of shared memory may call for a substantially different program structure! And remember, the easiest ways to the threads to access data they may need to share (e.g. the array length and the array itself) is through global variables.

Write-up for this Assignment (worth 15% of the assignment mark)

Try compiling and running your programs with large n on both partch and iwaki. Both of these machines are 4 CPU symmetric multiprocessors (can you recall what `symmetric' refers to?), so parallelism of p=4 is possible. Note that on iwaki, you must use gmake - the version of make we use on Linux is GNU Make; it is called gmake on iwaki as make is reserved for the Solaris version).

In the file write-up.txt, answer the following questions.

  1. You should have noticed mergepipe.c (on iwaki) and mergesocket.c develop problems for large n (up to ten million), at least for a naive use of pipes or sockets. Determine what values of n these are in each case. Given an explanation of what you think is the cause of this problem, and how you have fixed the problem.

  2. Time the programs on both machines with p=4 and n=8000000 (or thereabouts). Compare these times with those for p=1 (all 3 programs should give much the same time here, so you need only try one of them). Cut-and-paste your timings into write-up.txt (include the prompts (which should include the hostname) and the commands used).

    Which version is fastest? Explain whether this what you expect, given your knowledge of the underlying mechanisms used. The theoretical maximum speedup for the p=4 timings over the p=1 timing is 4. Are you observing close to this speedup? Explain why this might be the case (Hint:try doubling (or more) n; does the speedup improve significantly?)

  3. One possible idea for improving the performance of mergethread.c is that if the parent does not have to wait for the child to exit (call pthread_join()), the computation can be sped up. But the parent still has to wait for the child to finish the merge! Build into mergethread.c the following variant code paths:
    • wait_on_join: use pthread_join() to tell the parent that the child has finished (you have probably implemented this already). This is activated when the command line parameter s = 0.
    • wait_on_mutexlock: use (an array of) Pthread mutexes to tell the parent that the child has finished (i.e. the child unlocks a lock when finished, which the parent must acquire before it proceeds with the merge). This is activated when s > 0.
    • wait_on_memlock: like the above but using a normal program array (i.e. a value of 1 signifies the child is yet to finish; a value of 0 signifies it has finished and the parent may proceed). This is activated when s < 0.
      Hint: the array (pointer) should be a global variable declared as something like static volatile char *p_memlocks;. In main(), use a call to malloc() to initialize the pointer. You should give careful thought as to when the array elements should be initialized (and similarly for wait_on_mutexlock).
    Aside: In the above, we are now using the random number seed s for a dual (and unrelated) purpose. It should be noted this is not good programming practice, but in this case, we are doing so in order to concentrate on concurrency aspects (difficult enough!) without further complicating code structure.

  4. Briefly explain why the simpler wait_on_memlock scheme can be used safely in this situation. Also explain what the C keyword volatile means and why it is needed (try running the code without it!).

  5. Time the three variants of mergethreads as per part 2, and record your results similarly. Which version is fastest, and did the idea work?

  6. Time the three variants of mergethreads as per part 2, except now use p=8, i.e. there are now twice as many threads as CPUs, on either iwaki or partch. Compare your results to the above, and note and try to explain any significant differences in the relative speeds.