CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP2300 - Assignment 3

COMP2300

Semester 1, 2009: Assignment 3

Deadline: noon Friday 29 May


Please read the complete assignment specifications carefully.
(Please report any errors, mistakes and ambiguities to the course lecturer)

This assignment is worth 8% of your total course mark. It will be marked out of 32 as indicated below.


The estimated time we expect you to spend on this assignment is around 8 hours in total (1 hour per 4 assignment marks).

Submission

This can be done using the submit comp2300 ass3 ... command (or the submit comp6300 ass2 ... command, if you have a COMP6300 enrollment).

The file names which you should use (and the only ones that will be accepted by the submit command) are:

    memhier.txt
    shuffle2.c
The submitted files must include, at the beginning, a completed preamble. The proforma for this can be found in the file:

      /dept/dcs/comp2300/public/standard_preamble.txt

You have to copy the preamble from this file into the top of each file you intend to submit, and then complete your details. By submitting a file containing a completed preamble, you are making a declaration about the work you have submitted (see the text of the declaration in the preamble files. If you need to modify the text, e.g. you did in fact receive substantial assistance from another student, you should modify the text to say so).

Any file submitted without a completed preamble will be rejected without being marked.

Note you can submit any file multiple times, including after the deadline has expired. Note that if you submit any file after the deadline your entire submission will be considered as late and will attract a late penalty.

Extensions

See http://cs.anu.edu.au/student/comp2300/assessment.php.

Late Penalties

Late penalties are as follows
How LatePenalty from 32 Marks
<=1 hour-1
<=6 hour-2
<=24 hours-4
<=48 hours-8
<=72 hours-16
> 72 hours-32 (forget it!)

Late penalties apply even if you submit just one file after the deadline.

Plagiarism

See http://cs.anu.edu.au/student/comp2300/assessment.php.

Experiments on the Memory Hierarchy

In this assignment, your report containing answers to the questions below go into a file called memhier.txt (make sure you include, at the beginning, a completed preamble based on the standard_preamble.txt file).

Note that the text (especially Bryant) may be useful for some of the more questions from Section 1.

The machine wallaman.anu.edu.au is an 1.2 GHz UltraSPARC T2 with approximately 8 GB of physical memory and a 4 MB L2 cache (16-way set associative, 64 byte lines). It can be assumed that it has a 48-bit virtual address space with 8 KB pages (this is its normal mode of operation). Note that the T2 is a 64-bit machine.

It has a 8 KB level-1 cache (4-way set associative, 16 byte line size). This is for data only - instructions have a separate cache. It has a 128-entry TLB (for data).

The shuffle program developed in this course is a potentially memory-intensive program, which while generating relatively simple access patterns, which can challenge the memory hierarchy of the computer running it. Hence, with some extension, it is suitable for the purpose of performing memory hierarchy experiments.

Section 1 [10 marks]

Answer the following questions, showing your working. Express your answers for large numbers using powers to 2 and/or KB/MB/GB where appropriate, rather than using long sequences of decimal digits. Show your working.
  1. How many pages will fit into the main memory? [1 mark]
  2. How many pages exist in the virtual address space? If a page table was required to have all possible pages, how large would it be? (assume each entry is 16 bytes). [2 marks]
  3. Would it be practical to implement such a page table assuming a single process is running at any one time? Would this still be the case for large-scale multiprocessing (say 100 processes in memory at any time)? In either case, suggest how page table memory overheads could be reduced. [2 marks; answer in no more than 3 sentences]
  4. In terms of pages, what is the largest working set of a user program that will fit into the level-1 cache, the level-2 cache and main memory (assume that only half the physical memory is available to a user program). What, in terms of pages and bytes, is the largest working set that will fit into the TLB? [2 marks]
  5. For an array of integers (4 bytes) of length N, what values of N do the above working set sizes correspond to? [1 mark]
  6. Assuming that no data was in the cache initially, calculate the best possible cache miss rate for the level-1 cache for an algorithm that access each item of data exactly once, and in units of ints. The cache hit rate is defined as the percentage of data accesses where the data is in cache. Hint: consider the effect of the cache line size for the case when all data accesses occur with consecutively increasing addresses. [2 marks]

Section 2 [22 marks + bonuses!]

Copy the files in /dept/dcs/comp2300/public/ass3 into a suitably named sub-directory in your own account area. The program shuffletst.c represents the final evolution of our shuffle program (note the extra error checking and defensive programming code). It extends the shuffle2.c program of Assignment 1 in the following ways:
  • As much of the shuffle code as possible is is encapsulated in the library shuffle.c. There are a number variants of algorithms used; these can be selected using the -v v on the command line of shuffletst. Also two in-place algorithms (courtesy of our two Lords of the In-place Shuffle) are included; these can be selected by specifying the -i option. This file #includes a second file shuffle2.c, which contains functions where you can write an optimized algorithms.
  • shuffletst executes the specified shuffle a number of times (-r nTimings option), and times each shuffle. It calculates the number of computer clock cycles consumed per element of the shuffled array, based on the minimum timing (you should consider why it is necessary to take a number of timings, and why the minimum should be regarded as the most reliable. Hint: one reason is related to multiprocessing effects, the other to the memory hierarchy). On most modern computers, the minimum possible number of cycles per element is two (one cycle to load the element, one to store it). On the T2, 4 cycles are required per load / store, due to the fact that a single thread of execution (the case for a normal program) gets access to the load/store unit of the CPU only on every second cycle. It is not necessarily easy to get close to these lower limits though!
  • It can also perform shuffles for a range of values of N, i.e. for N = n, n+dN, n+2*dN, ... n+(nN-1)*dN. Here nN and dN are parameters which may appear optionally after n k. n, k and dN may also be entered in exponential form x^y (e.g. 2^20, no spaces on either side of the ^), noting that power-of-two values can create pathological situations in the memory hierarchy.
  • The shuffle is properly checked, and the program will abort if there are any errors. The -p option can still be used for debugging; note that the value for nTimings defaults to 1 in this case.
  • The program can also print statistics of various hardware events of possible interest over the cache simulation, e.g. L1 cache misses (-e option). This is only available when compiled and run on hosts which support such a functionality (i.e. wallaman, partch). Note both this and using accurate timing functions are system-dependent; the use of C macros passed down from the Makefile, in conjunction with #ifdef directives, makes the code portable.
While this might appear complicated, the commands required to drive the program will mostly be given to you. The program will print out what options have been selected, and you should check this to ensure the results you are getting correspond to what was intended.

Browse the file shuffletst.c in order to understand its overall operation and how to use it. Study the file shuffle.c and inspect the various shuffle algorithms presented; which looks like it will be the fastest?

ssh to wallaman, cd to the directory containing your copy of shuffletst.c and compile it using the command gmake. Execute the command ./shuffletst to get it to print out a detailed usage message.

Answer the following questions. You may find adding the -e option to the commands below helpful; it prints out statistics of hardware events such as cache misses (but it should be noted that these might not be accurate in all situations).

Note that the timings of the shuffles may be rather unstable, due to other process activity on wallaman. You should repeat any command 2 or 3 times (more if needed) in quick succession to see if the timings vary; if so, use the command reporting the fastest timings.

  1. Run the program via the command shuffletst -v v 1000 1 and shuffletst -v v 1000 1000, for v = 2, 3. Cut-and-paste each command line including your prompt (which will verify you ran it from wallaman!) and the output in memhier.txt. By inspecting the corresponding loops in shuffle.c, explain the similarities and differences in the results. [2 marks]

  2. Now run the program via the command shuffletst -v v 1000 1 and shuffletst -v v 1024 2, for v = 0, 2. Cut-and-paste each command and its output as before. Inspect the corresponding code in shuffle.c, and its generated assembly code in shuffle.s (hint: search for shuffleSrcCtg, then locate code for the innermost loop). How many multiply and divide instructions are executed per iteration? Assuming this to be the cause of the essential difference in speed between the two versions, calculate the cost of a a multiply-divide instruction pair in cycles for n = 1000 and n = 1024. [3 marks]

  3. Explain why such differences in operand values can affect instruction speed. Hint: the T2 is a post-RISC processor; it does not necessarily have purely RISC properties. [1 mark - COMP6300 only]

  4. Run the program via the command shuffletst -i -v v -e 1000 2 for v = 0, 1. Cut-and-paste each command/output as before. This invokes the in-place algorithms. Briefly explain why the swap-based algorithm algorithm is more efficient. Hint: compare the total number of instructions executed in each case. By comparing these with your previous results, what is the main factor affecting the performance of these algorithms? [2 marks]

  5. It is clear that the computation time of the non-in place algorithms is proportional to n (ignoring memory hierarchy effects). Is this the case for the in-place algorithms? If not, what function of n would the computation time be? Justify your answer. Hint:: you may derive your answer though analysis of the algorithms, or you could try modifying the code to count the number of iterations of the loops for various n (choose powers of 2) with k = 2. [2 marks - COMP6300 only. 2 bonus marks for COMP2300]

  6. In shuffle2.c, implement the ipShuffleOptP2() function to use faster arithmetic for the case where n (and hence k) is a power of 2. That is, replace integer multiplications, divisions and modulus (*, /, %) with bit-shifting operations (<<, >>, &), for this case. The function should also work for non-power-of-two cases, but need not be optimized for these. You may your code on either of the two given in-place algorithms. Be liberal with your use of parentheses, as the bitwise operators have low precedence.

    Test that your modifications have preserved the correctness of the shuffle (use commands such as shuffletst -i -v 2 -p 16 16 2 for debugging).

    Cut-and-paste the command/output shuffletst -i -v 2 1024 2 and either shuffletst -i -v 0 1024 2 or shuffletst -i -v 1 1024 2 (depending on which variant you based your code on) and state the improvement of performance. [7 marks - COMP2300 ; 4 marks - COMP6300]

  7. Execute the command shuffletst -v v k^2 k for v = 2, 3 and k = 64, 65. Cut-and-paste the command/output as before. What level of the memory hierarchy is responsible for the drop in performance in one case, and give a conjecture for why it only occurs for k = 64. State the reason why using n = k^2 is suitable for comparing results between the two versions. [3 marks]

    Suggest a reason for why it occurs for only one of the two versions. [1 bonus mark]

  8. Execute the command shuffletst -r 8 -v v 2^x 1024 for v = 2, 3 and x = 17, 18, 19, 20. State the trend in the results, when combined those of the previous question (where k = 64). Noting that these algorithms use two arrays of length n, and your answers for part 1 Q5, explain what parts of the memory hierarchy are impacting performance (and what parts are not). [3 marks]

  9. In shuffle.h, the type sxint is defined as the type used for shuffle indexing computations. Why might it be safer for this type to be unsigned? If we wanted to use much larger n and k than we have used above, how should we change this defniition, and what else would need to be changed? [2 marks]

  10. (for bonus marks) Devise an experiment of your own that reveals something interesting about the memory hierarchy, and try to explain the results. It can possibly be on another host, e.g. partch. To get details of the memory hierarchy, install x86info on an x86-based machine. Note that you have to use make instead of gmake on Linux hosts. [up to 4 bonus marks]

  11. (for bonus marks) Optimize either the in-place or ordinary shuffle so that it is significantly faster at least for some parameters than the respective algorithms so far. shuffle2.c has stubs where you can write this code (ipShuffleLotQS() or shuffleLotQS()). Include in your report the performance gains and on what range of parameters it is designed for. Note that gcc inline assembler is allowed.

    Note that these functions are invoked with -i -v 3 or -v 4 (no -i) options. [up to 4 bonus marks]


Marking scheme

You are expected to give clear and concise answers, but must also include sufficient detail to explain how you obtained your answer. It is required that you cut-and-pasted the results in your report, but doing this without an explanation will gain little or no marks.


Last modified: 13/05/2009, 18:58

Copyright | Disclaimer | Privacy | Contact ANU