COMP2310/COMP6310:
Concurrent and Distributed Systems
Semester 2 2012

COMP2310 Assignment 2: Concurrent and Distributed QuickSort

Deadline: 17:00 on Friday 26 October


(Please report any errors, omissions and ambiguities to the course lecturer)
(Amendments to this document after its release will be marked in blue)

This assignment is worth 18% of your total course mark. It will be marked out of 36.

Submission

This assignment must be submitted electronically. This can be done using the following commands (according to your enrollment):

Extensions

See the course Assessment page.
How LatePenalty from 36 Marks
less than 1 hour-2
between 1 and 6 hours-4
between 6 and 24 hours (1 day)-8
between 24 and 48 hours (2 days)-16
between 48 and 72 hours (3 days)-32
more than 72 hours (4 days)-64 (forget it!)

Objectives

Background

Many algorithms requiring substantial computational resources have a tree-like structure (i.e. use a divide-and -conquer strategy). These can normally adapted to a distributed computation, and especially the case if the tree can be kept balanced (i.e. use p=2k processors by assigning p/2 processes to the left sub-tree and p/2 to the right sub-tree). quicksort is an example of this.

In this assignment you will implement various forms of a distributed quicksort using various IPC and shared memory features of Unix and the C programming language.

The figure below depicts the structure of a distributed quicksort with p = 8 processes. The tree indicates the process structure, beginning with process 0 at the top-level (d = 0). The process partitions the array into (hopefully balanced) left and right segments. This means that a call to the partition() function has performed the appropriate swapping of array elements, so that all elements in the left partition are ≤ all elements in the right partition, in which case the sorting can proceed on both the left and right partitions ndependently. Then, instead of spawning 2 child processes, it only spawns 1 for the right sub-tree, and then it processes the left sub-tree itself. This continues recursively for k = log2(p) levels, at which point p processes have been created. These processes sort their respective segment of the array. Then, the tree structure is traversed upwards, with the respective process at each node then copying the array segment from its right child into its own segment. In this step the right child may have to explicitly communicate its array segment; then that process is free to terminate. As the tree traversal returns to level d = 0, process 0 will have gathered the whole sorted array. The following diagram illustrates the process with N=16 and p=8.


Each process must initially be given the left land right r indices for its initial array segment. Let's say it is at level d. As it descends a level, the partition step splits the interval [l,r] into [l,m] and [m+1,r] (note that the split in general will not be exactly even, as shown above). Thus at level d+1, the left index remains constant for that process, but the right changes. The right indices can be recorded in an array structure, r[0..k-1]. When it later re-ascends to level d, it can determine the starting index of the segment for the child on the right using r[d+1]+1.

Setup

The file ass2.zip (modified 16/10) contains template files for the files that you will submit. It also contains a library of provided functions in quicklib.h and quicklib.c. It also contains a Makefile which should be used to compile the programs (note the use of the -Wall and -lmcheck). Copy these files into your local directory area before you begin work on your programs. You can compile the system with the simple command make. The synopsis of the test program program is: where -p selects the sort using pipes (the default), -s selects the sort using sockets and -t selects the sort using (Posix) threads. n is the array length. If -d is selected, printing of the array is turned on - for debugging purposes. p is the number of processes or threads to be used in the sort (default value is 1); you may assume that p is a power of 2. s is a random number seed (default value 0) used for generating array values. The use of the -v v option will be explained below; it is only needed when -t is specified.

Requirements

  • Your code will be tested on a 4-core Linux machine like those in the CS student labs, so make sure it works on these machines!
  • Your programs should be written with good programming style (and so should not produce any warnings when compiled with the -Wall option!). I.e. it should be well commented, so as to enable the reader (this includes the tutor/marker!) to understand how it works. Identifiers should be meaningful and (unless their role is trivial) commented where declared, any constants should be named rather than hard-coded, and defensive programming should be used. The latter includes checks being made for system calls that can fail due to resource limitations; once detected, a message should be generated with a call to the system error function perror() and the process should exit with a status of 2. It also includes using assert() to perform other kinds of checks, and the freeing of any malloced data before (normal) exit (this permits the mcheck library to detect any buffer overwrites). All file descriptors should be closed before the process exits.
  • You must not change the interfaces to the sort functions in the template distquicklib.c; while you are free to modify if needed the given quicksort.c for debugging purposes, your sort functions must work from the standard version of quicksort.c.
  • Your program must compile on the CS student system with the standard (un-modified) makefile that is provided.
  • Your sort functions should be robust for array sizes up to (say) 10 million and 32 processes / threads.
  • You are required to submit a file called report.txt. This must be a plain ASCII text file.
  • Both C code and report must be formatted to a standard width of 80 characters.
  • All submitted files must have a preamble making a declaration about the work you have submitted. The provided template files come with such a preamble. You must add your name and student number where indicated. If you need to modify the declaration, e.g. you did in fact receive substantial assistance from another person who is not course staff, you should modify the text to say so. Code obtained from the internet will receive no marks even if acknowledged.

    Quick Sort using Forks and Pipes (worth 40% of the assignment mark)

    In the function quickPipe() in distquicklib.c, implement a distributed quicksort using forks to spawn (child) processes and pipes to communicate the (sorted) array segments. The distributed quicksort is then performed, which involves the (direct or indirect) forking of a further p-1 processes. The array segments should be transferred between processes using Unix pipes. Note that the semantics of fork() can be utilized to reduce the amount of explicit data transfer. The provided external function m = partition(int *A, int n) can be for the partition step, and quickSort(int A[], int n) can be used to sort a segment of the array (where A points to the start of the segment and n is its length).

    For full marks, no process should contain more than k copies of pipe descriptor pairs, and processes should close any pipe as soon as no longer needed. Child processes should exit when their participation in the sort is finished, and their parent should wait on them to do this.

    Debugging: the use of printf("%d:...\n", i, ...) statements, where i is the process or thread id, at strategic points, is recommended. printArray() can be used to print out array segments. Remember, you may need to fflush(stdout) if the process or thread is soon to exit. Noting that your code should not produce any output directly; you should remove these and any other print statements before submitting. Running the program under strace -f may also be helpful for debugging.

    Quick Sort using Sockets (worth 15% of the assignment mark)

    In the function quickSocket(), create a variant of the above function which uses sockets instead of pipes to transfer data, but is otherwise identical.

    Shared Memory Quick Sort using Posix Threads (worth 30% of the assignment mark)

    In quickThread(), create a variant of the above function which creates Posix threads instead of processes, and uses shared memory to communicate data. You must use the thread structure as indicated in the diagram above for your solution to be considered valid (although you may use a different scheme to number the threads if you wish). Note that the full solution requires various implementations of waiting on child threads (see below).

    Hints: it is recommended that you include the logical thread id (as in the above diagram) in the parameters of the function that the threads execute. It will be easier if you declare this function with the exact type that pthread_create() requires (returns void * with a single void * parameter). You will probably need to pass more than 1 integer value to the function; in which case, copy them into a small array, and use the pointer to the array as the parameter. The use of shared memory may call for a substantially different algorithm structure! And remember, the easiest ways to the threads to access data they may need to share (e.g. the array length and the pointer to array itself) is through global variables (which should be declared static).

    Issue: by default, quicksort is compiled with -lmcheck (linking in the memory checking library). This may detect an error that is possibly not caused by your code (`memory clobbered before allocated block' towards the end of the sort). If you think that this is the case, modify the Makefile accordingly (it contains the instructions) use quicksort-nomc, the version of the test program compiled without -l mcheck, instead.

    Write-up for this Assignment (worth 15% of the assignment mark)

    Try compiling and running your programs with large n on the CS student system. The CS student workstation are 4 core machines, so a parallelism of p=4 is possible (partch has however only 2, and is unsuitable for the experiments).

    In the file report.txt, answer the following questions.

    1. You should have noticed quickSocket() develop problems for large n, at least for a naive use of sockets. Determine the message size associated with this problem and record the value of n which bis (just) big enough to expose this. Given an explanation of what you think is the cause of this problem, fix it in your code, and describe the fix in the report.

    2. Time each sort with p=4 and n=8000000 (or thereabouts). Compare these times with those for p=1. For p=4 take say 5 trials and include the best one. Cut-and-paste your timings into report.txt (include the prompts and the commands used).

      Which version is fastest (for p=4)? Explain whether this what you expect, given your knowledge of the underlying mechanisms used. The theoretical maximum speedup for the p=4 timings over the p=1 timing is 4. Are you observing close to this speedup? Explain why this might be the case (Hint:try doubling (or more) n; does the speedup improve significantly?)

    3. One possible idea for improving the performance of quickThread() is that if the parent does not have to wait for the child to exit (call pthread_join()), the computation can be sped up. But the parent still has to wait for the child to finish the sort! Build into quickThread() the following variant code paths:
      • wait_on_join: use pthread_join() to tell the parent that the child has finished (you have probably implemented this already). This is the default; it can is explicitly activated with the command line option -v 0 .
      • wait_on_mutexlock: use (an array of) Pthread mutexes to tell the parent that the child has finished (i.e. the child unlocks a lock when finished, which the parent must acquire before it exits). This is activated when the option -v 1 is specified.
      • wait_on_memlock: like the above but using a normal array (i.e. a value of 1 signifies the child is yet to finish; a value of 0 signifies it has finished and the parent may finish). This is activated when when the option -v 2 is specified.
        Hint: the array (pointer) should be a global variable declared as something like static volatile char *p_memlocks;. In quickThread(), use a call to malloc() to initialize the pointer. You should give careful thought as to when the array elements should be initialized (and similarly for wait_on_mutexlock).

    4. Briefly explain why the simpler wait_on_memlock scheme can be used safely in this situation. Also explain what the C keyword volatile means and why it is needed (try running the code without it!).

    5. Time the three variants of quickThreads as per part 2, (except use a smaller n=8000) and record your results similarly. Which version is fastest, and did the idea work?

    6. Time the three variants of quickThreads() as per part 5, except now use p=32, i.e. there are now many more threads than CPUs. Compare your results to the above, and note and try to explain any significant differences in the relative speeds.