COMP2300

Semester 1, 2007: Assignment 3

Deadline: 12 noon on Wednesday 30 May 2007


(second draft; subject hopefully only to small fixes)
(24/05: minor fixes to commands in Part 2 Q2, Q3 and Q8)
(28/05: clarification to Part 1 Q6)

Please read the complete assignment specifications carefully.
(Please report any errors, mistakes and ambiguities to the course lecturer)


This assignment is worth 8% of your total course mark. It will be marked out of 32 as indicated below.


The estimated time we expect you to spend on this assignment is around 8 hours in total (1 hour per 4 assignment marks).

Submission

This assignment must be submitted electronically. This can be done using the submit comp2300 ass3 memhier.txt transalgs_a.c command.

The files memhier.txt and transalgs_a.c submitted must include, at the beginning, a completed preamble. The proforma for this can be found in the file:

      /dept/dcs/comp2300/public/standard_preamble.txt

You have to copy the preamble from this file into the top of each file you intend to submit, and then complete your details. By submitting a file containing a completed preamble, you are making a declaration about the work you have submitted (see the text of the declaration in the preamble files. If you need to modify the text, e.g. you did in fact receive substantial assistance from another student, you should modify the text to say so).

Any file submitted without a completed preamble will be rejected without being marked.

Extensions

See http://cs.anu.edu.au/student/comp2300/assessment.html.

Late Penalties

Late penalties are as follows
How LatePenalty from 32 Marks
<=1 hour-1
<=6 hour-2
<=24 hours-4
<=48 hours-8
<=72 hours-16
B hours-32 (forget it!)

Late penalties apply even if you submit just one file after the deadline.

Plagiarism

See http://cs.anu.edu.au/student/comp2300/assessment.html.

Experiments on the Memory Hierarchy

In this assignment, your report containing answers to the questions below go into a file called memhier.txt (make sure you include, at the beginning, a completed preamble based on the standard_preamble.txt file). In Section 2, will also be asked to write a small amount of C code in a file called transalgs_a.c.

Note that the text (especially Bryant) may be useful for some of the more questions from Section 1.

The machine karajan.anu.edu.au is a 2.13 GHz Intel (R) Core2 processor (brand new!) with approximately 2 GB of physical memory and a 2 MB L2 cache (you can verify this by ssh karajan followed by the command x86info, courtesy of a project by Dave Jones. The command cat /proc/cpuinfo /proc/meminfo will give some of this information). It can be assumed that it has a 32-bit virtual address space with 4 KB pages (this is its normal mode of operation).

It has a 32 KB level-1 cache (8-way set associative), with a block (line) size of 64 bytes. This is for data only - instructions have a separate cache. The L2 cache has the same line size and associativity. It has a 256-entry TLB (for data).

Matrix transpose is a simple computation which nonetheless challenges the memory hierarchy, and so is suitable for the purpose of experiments on it. In fact, this part is a kind of a follow-on of a research paper based on experiments on the UltraSPARC I and III.

Section 1 [10 marks]

Answer the following questions, showing your working. Express your answers for large numbers using powers to 2 and/or KB/MB/GB where appropriate, rather than using long sequences of decimal digits. Show your working.
  1. How many pages will fit into the main memory? [1 mark]
  2. How many pages exist in the virtual address space? If a page table was required to have all possible pages, how large would it be? (assume each entry is 16 bytes). [2 marks]
  3. Would it be practical to implement such a page table? If so, how could it be improved? If not, how could it be made more practical? [2 marks; answer in no more than 3 sentences]
  4. In terms of pages, what is the largest working set of a user program that will fit into the level-1 cache, the level-2 cache and main memory (assume that only half the physical memory is available to a user program). What, in terms of pages and bytes, is the largest working set that will fit into the TLB? [2 marks]
  5. For an n by n matrix of integers (4 bytes), what values ofn do the above working set sizes correspond to? [1 mark]
  6. Assuming that no data was in the cache initially, calculate the best (lowest) possible cache miss rate for the level-1 cache for an algorithm accessing a set of data in units of ints, based on the consideration of block size. Assume that the algorithm accesses each unit of the set exactly once. [2 marks]

Section 2 [22 marks + bonuses!]

Copy the files in /dept/dcs/comp2300/public/ass3 into a suitably named sub-directory in your own account area. The program tranpose3.c transposes an n by n matrix A.It extends the transpose2.c program of Assignment 2 in the following ways: Browse the file transpose3.c in order to understand its overall operation and how to use it.

ssh to karajan, cd to the directory containing your copy of transpose3.c and compile it using the command make. Execute the command ./transpose3 to get it to print out a usage message.

Answer the following questions. You may find adding the -e option to the commands below helpful; it prints out counts of hardware events such as cache misses.

  1. Study the code in main() in transpose3.c in the for (r...) loop (assume getHWevents = flushCache = 0). In terms of data, what would you identify as making up the working set for this loop? [1 mark]

  2. Run the program via the commands ./transpose3 -a 0 32 136 8 and ./transpose3 -a 1 32 136 8. (the -a 1 option selects a naive matrix transposition algorithm; the -a 0 option selects an algorithm which reverses elements in rows. This is not a transposition at all but can be used as a baseline as it contains the same number of swaps).

    Cut-and-paste the command line including your prompt (which will verify you ran it from karajan!) and the output in memhier.txt.

    Explain why the speed increases (for at least a while) after n=32. At what point does the performance begin to decrease? What part of the memory hierarchy is likely to start to thrash at this point? Is this exactly where you would expect from Sect 1 Q5? What could explain the anomaly at n=128 for the transpose? [4 marks]

  3. Run the program again via the command ./transpose3 -a 3 32 136 8. This performs the tranpose via 2 by 2 blocks, as per Assignment 2. Cut-and-paste the prompt / command / output as above.

    Inspect transalgs.s and locate the innermost loop for the baseline, trans1x1 and trans2x2 functions. Count the number of memory references in each body (these are signified by (...)'s in the instruction operands, not including those of the leal instruction). Write this information in your report, including the address label for the head of the innermost loop.

    How well does the relative execution speed (take the fastest over the range of n) correlate with the number of memory references (per matrix element)? Is this factor alone sufficient to explain the relative performance? [3 marks]

  4. You may have noticed that the innermost loop of the trans2x2 function contains at least one imull (integer multiplication) instruction. A general optimization technique is to avoid these instructions, as these tend to be slower than other integer arithmetic instructions (e.g. see Lecture M5, p32). In the case of an expression i*n+j inside a loop, where i increases by 1 each iteration, this can be achieved for example by replacing i*n with a local variable in which is incremented by n whenever i is incremented by 1 (this kind of technique is known as strength reduction and forms an important class of program optimizations).

    Implement in transalgs_a.c versions of the 1x1 and 2x2 transpose algorithms that avoid explicit integer multiply using this technique (they can be accessed by the -a 2 and -a 4 options). After testing your code, inspect transalgs_a.s to verify no imull instructions appear in the innermost loops.

    Cut-and-paste the prompt/command/output of ./transpose3 -a 2 32 136 8 and ./transpose3 -a 4 32 136 8 into your report. State whether the results suggest this optimization was worthwhile. If the results are actually worse, suggest a reason for why this might be. [6 marks]

  5. The algorithm blockedTrans() in transalgs.c is a transpose algorithm based on swapping b by b blocks. For the case of b=4, it calls R_Block4Trans() (this routine was designed for the UltraSPARC III, although for floating point rather than integer data). Inspect the code for this. On an UltraSPARC III, this tends to be significantly faster than trans2x2(). Cut-and-paste the prompt/command/output of ./transpose3 -a 5 32 136 8 (which causes R_Block4Trans() to be used) into your report.

    Explain why the relative performance of these two algorithms is different on the x86 and UltraSPARC architectures. [2 marks]

  6. Looking more closely of the impact of the level-1 cache on performance, cut-and-paste the prompt/command/output of ./transpose3 -a 3 48 336 16 into your report. What pattern do you see emerging? Disregarding the anomolous (slow) results over the range, to what extent would you say the actual capacity of the level-1 cache is having an impact of performance. What could be a possible explanation for the pattern of anomolous (slow) results? Hint: does the baseline algorithm (-a 0) exhibit such a pattern? [3 marks]

  7. Cut-and-paste the prompt/command/output of ./transpose3 -a 3 128 2048 128 and ./transpose3 -a 5 -b 8 128 2048 128 into your report. Looking first at the overall trends of the -a 3 results, what level(s) of the memory hierarchy could be taking effect? What pattern (of anomalies) do you see emerge for larger n? Why is this pattern softened for the -a 5 results, and why are these results generally faster now for large n? [3 marks]

  8. (for bonus marks) Cut-and-paste the prompt/command/output of ./transpose3 -a 6 32 136 8 and ./transpose3 -a 6 128 2048 128 into your report. This uses a divide-and-conquer strategy and is an example of a class of algorithms called cache-oblivious algorithms: so called because they are claimed to minimize misses at all levels of the memory hierarchy without needing to knowing about the capacity at any level.

    Does this algorithm appears to live up to its name? Briefly explain. Also explain what you think is the main factor in its relative performance, say to the -a 3 case. [up to 2 bonus marks]

  9. (for bonus marks) Devise an experiment of your own that reveals something interesting about the memory hierarchy, and try to explain the results. It can possibly be on another host, e.g. partch which comes from an older line of the x86 architecture (and with an aggressively high clock rate to try to make up for this!), or possibly on iwaki an UltraSPARC III Cu.

    To get details of the memory hierarchy, type x86info on partch or fpuversion on iwaki. Note that you need to do make clean; make to re-compile your code on partch (and gmake clean; gmake on iwaki - the version of make we use on Linux is GNU Make; it is called gmake on iwaki as make is reserved for the Solaris version). [up to 4 bonus marks]

  10. (for bonus marks - it is not yet known whether this can be done!)

    In the function transWildcard(), implement a version of a transpose algorithm that aims to be significantly faster than the best of the algorithms we have seen so far (-a 3 for small n, -a 5 -b 8 for large n). You may use different gcc compiler options in versions of existing codes as well - if so state what options you require clearly in your report. Test your code using the -a 7 option. Include in your report relevant performance results and an explanation of them. [up to 4 bonus marks]


Marking scheme

You are expected to give clear and concise answers, but must also include sufficient detail to explain how you obtained your answer.



Last modified: Thu May 24 10:29:36 EST 2007