Skip Navigation | ANU Home | Search ANU | Search FEIT | Feedback
The Australian National University
Faculty of Engineering and Information Technology (FEIT)
Department of Computer Science
Printer Friendly Version of this Document
COMP2300 Introduction to Computer Systems COMP2300 - Assignment 3

COMP2300

Semester 1, 2008: Assignment 3

Deadline: 10:00 am on Tuesday 3 June Friday 30 May 2008


Please read the complete assignment specifications carefully.
(Please report any errors, mistakes and ambiguities to the course lecturer)

This assignment is worth 8% of your total course mark. It will be marked out of 32 as indicated below.


The estimated time we expect you to spend on this assignment is around 8 hours in total (1 hour per 4 assignment marks).

Submission

This can be done using the submit comp2300 ass3 ... command (or the submit comp6300 ass2 ... command, if you have a COMP6300 enrollment).

The file names which you should use (and the only ones that will be accepted by the submit command) are:

    memhier.txt
    cache2.c (optional, for bonus marks)
The submitted files must include, at the beginning, a completed preamble. The proforma for this can be found in the file:

      /dept/dcs/comp2300/public/standard_preamble.txt

You have to copy the preamble from this file into the top of each file you intend to submit, and then complete your details. By submitting a file containing a completed preamble, you are making a declaration about the work you have submitted (see the text of the declaration in the preamble files. If you need to modify the text, e.g. you did in fact receive substantial assistance from another student, you should modify the text to say so).

Any file submitted without a completed preamble will be rejected without being marked.

Note you can submit any file multiple times, including after the deadline has expired. Note that if you submit any file after the deadline your entire submission will be considered as late and will attract a late penalty.

Extensions

See http://cs.anu.edu.au/student/comp2300/assessment.php.

Late Penalties

Late penalties are as follows
How LatePenalty from 32 Marks
<=1 hour-1
<=6 hour-2
<=24 hours-4
<=48 hours-8
<=72 hours-16
> 72 hours-32 (forget it!)

Late penalties apply even if you submit just one file after the deadline.

Plagiarism

See http://cs.anu.edu.au/student/comp2300/assessment.php.

Experiments on the Memory Hierarchy

In this assignment, your report containing answers to the questions below go into a file called memhier.txt (make sure you include, at the beginning, a completed preamble based on the standard_preamble.txt file).

Note that the text (especially Bryant) may be useful for some of the more questions from Section 1.

The machine partch.anu.edu.au is a 3.06 GHz Intel Pentium 4 processor with approximately 4 GB of physical memory and a 0.5 MB L2 cache (you can verify this by ssh partch followed by the command x86info, courtesy of a project by Dave Jones. The command cat /proc/cpuinfo /proc/meminfo will give some of this information). It can be assumed that it has a 32-bit virtual address space with 4 KB pages (this is its normal mode of operation).

It has a 8 KB level-1 cache (4-way set associative), with a block (line) size of 64 bytes. This is for data only - instructions have a separate cache. The L2 cache has the same line size and 8-way associativity. It has a 64-entry TLB (for data).

The cache simulator program developed in this course is a potentially memory-intensive program, which while generating relatively simple access patterns, which can challenge the memory hierarchy of the computer running it. Hence, suitably extended, it is suitable for the purpose of experiments on it.

Section 1 [10 marks]

Answer the following questions, showing your working. Express your answers for large numbers using powers to 2 and/or KB/MB/GB where appropriate, rather than using long sequences of decimal digits. Show your working.
  1. How many pages will fit into the main memory? [1 mark]
  2. How many pages exist in the virtual address space? If a page table was required to have all possible pages, how large would it be? (assume each entry is 16 bytes). [2 marks]
  3. Would it be practical to implement such a page table? If so, how could it be improved? If not, how could it be made more practical? [2 marks; answer in no more than 3 sentences]
  4. In terms of pages, what is the largest working set of a user program that will fit into the level-1 cache, the level-2 cache and main memory (assume that only half the physical memory is available to a user program). What, in terms of pages and bytes, is the largest working set that will fit into the TLB? [2 marks]
  5. For an array of integers (4 bytes) of length C, what values of C do the above working set sizes correspond to? [1 mark]
  6. Assuming that no data was in the cache initially, calculate the best possible cache miss rate for the level-1 cache for an algorithm accessing data in units of ints, based on the consideration of block size. [2 marks]

Section 2 [22 marks + bonuses!]

Copy the files in /dept/dcs/comp2300/public/ass3 into a suitably named sub-directory in your own account area. The program cachesim3.c extends the cachesim2.c program of Assignment 1 in the following ways:
  • An LRU replacement policy can be selected (-L option). A block (line) size of B can be specified (-b lgB option); however, for the purposes of most experiments, only the case B=1 will be of interest. For set-associative caches, the elements of each set, instead of occupying consecutive entries in the simulated cache, can be spread across the simulated cache array in intervals of C/(K*B) (with the -t option). If you consider an associative cache as a C/K times K array, the -t option causes the entries to be stored in a transposed ordering.
  • As much of the cache functionality as possible is is encapsulated in the library cache.c. This file #includes a second file cache2.c, which contains the crucial lookup function which you are to optimize! The cache size need not be a power of 2; however the associativity K and block size B must be.
  • cachesim3 executes the cache simulation a number of times (-r nTimings option), and times each simulation. It calculates the number of computer clock cycles consumed per simulated cache access, based on the minimum timing (you should consider why it is necessary to take a number of timings, and why the minimum should be regarded as the most reliable. Hint: one reason is related to multiprocessing effects, the other to the memory hierarchy).
  • It can perform the simulations for a range of values of C, i.e. for C = C0, C0+dC*K*B, C0+2*dC*K*B, ... C0+(nC-1)*dC*K*B, where C0 = (1<<lgC). Here nC and dC are parameters which may appear optionally after lgC NdivC R. Note that the above formula ensures that C is a multiple of K*B. Noting that the computation's working set size is C, having non-power-of-two values of C is useful for memory hierarchy experiments, as power-of-two values can create pathological situations in the memory hierarchy.
  • As the simulated cache (hit rate) statistics are not the emphasis of this assignment, these are suppressed, unless the -s option is specified. However, it is still important that the simulated cache is still functioning correctly! The -p option (which also forces -s) can still be used for debugging.
  • The program can also print statistics of various hardware events of possible interest over the cache simulation, e.g. L1 cache misses (-e option). This is only available when compiled and run on hosts which support such a functionality (i.e. partch, iwaki). Note both this and using accurate timing functions are system-dependent; the use of C macros passed down from the Makefile, in conjunction with #ifdef directives, makes the code portable.
While this might appear complicated, the commands required to drive the program will mostly be given to you. The program will print out what options have been selected, and you should check this to ensure the results you are getting correspond to what was intended.

Browse the file cachesim3.c in order to understand its overall operation and how to use it.

ssh to partch, cd to the directory containing your copy of cachesim3.c and compile it using the command make. Execute the command ./cachesim3 to get it to print out a detailed usage message.

Answer the following questions. You may find adding the -e option to the commands below helpful; it prints out statistics of hardware events such as cache misses (but it should be noted that these might not be accurate in all situations).

Note that the timings of the simulations may be rather unstable, due to other process activity on partch (thank heaven no-one should need to be running Peanut on it anymore!). You should repeat any command 4 or 5 times (more if needed) in quick succession to see if the timings vary; if so, use the command reporting the fastest timings.

  1. Study the code in main() in cachesim3.c in the for (i...) {lookup(i);} loop, and the code for lookup() in cache2.c. For a direct-mapped cache configuration (K==1) where there is a hit, how many accesses to the simulated cache array cache[] will occur? What other memory accesses are involved? How significant will the accesses of cache[] be over the others? [2 marks]

  2. Run the program via the command cachesim3 -L 8 1 10 4 16. Cut-and-paste the command line including your prompt (which will verify you ran it from partch!) and the output in memhier.txt. Assuming that at least 30% of these cycles are consumed in memory accesses, and it should take at most 2 cycles to make a single access (assuming the data is resident in the L1 cache), estimate a lower bound on the number of memory accesses occurring per lookup. Does this match your answer to the previous question? [1 mark]

  3. Run the program again via the commands cachesim3 -L 8 1 1 4 16 and cachesim3 8 1 1 4 16. Cut-and-paste the output as described above. Identify the differences in the execution of lookup() in these situations from that of the previous question (hint: repeat all of the commands with the -s option). Use this information to produce a rough estimate of the overhead of the rand() function. [3 marks]

  4. Now optimize the code for lookup() by replacing integer multiplications, divisions and modulus (*, /, %) with bit-shifting operations (<<, >>, &), wherever and whenever this is possible. Note that you may always assume that K and B are powers of 2. The POWER_OF_2_CACHE macro may be used to decide when this optimization can be applied on operations involving ConKB. Be liberal with your use of parentheses, as the bitwise operators have low precedence.

    Test that your modifications have preserved the correctness of the simulated cache (use commands such as cachesim3 -p -L 4 1 2, cachesim3 -p -L -k 2 4 1 2, cachesim3 -p -L -b 2 4 1 2 for debugging). To test your optimizations with POWER_OF_2_CACHE, enable the macro using make OPT_P2=1. [6 marks - COMP2300 ; 4 marks - COMP6300]

  5. Now repeat the command cachesim3 -L 8 1 10 4 16 for both when the program was compiled with make OPT_P2=1 and with simply make. You will notice that the second line of output indicates how the program was compiled. Cut-and-paste the output as described above. Comparing your results with Q2, evaluate how effective your optimizations were. You will probably find the results for C≥272 rather sunrising; try to explain them. Hint: optimizing compilers will treat code such as if (0) x=e1; else x=e2; as simply x=e2;. [3 marks]

  6. Recompile your program with make and only use this version of your program from now on. Note that if you did not get the code working correctly, you can (and should) use the original code in cache2.c from now on.

    Execute the command cachesim3 -L -k 6 8 1 10 2 8. Cut-and-paste the output as described above. Why does the number of cycles increase from the result in Q2? Why would an associativity of K=2^6 be better to use for memory hierarchy experiments? [1 mark]

  7. Execute the command cachesim3 -L -k 6 lgC 1 10 2 8, for lgC = 9, 10, 11, 12. Cut-and-paste the output as described above. Comparing with results with that of the previous question, what effect (if any) does the L1 cache have on the performance of the computation? [1 mark]

    Bonus question: what feature of modern computer architecture might be responsible for this behavior? (note: we have not mentioned this in lectures, but there was a recent DCS seminar on the topic). [up to 1 bonus mark]

  8. (COMP6300 only) Consider the computation cachesim3 -r 1 -L -k 4 6 1 1. Describe the pattern of the order of accesses of elements of the cache[] array over the commutation (after the array has been initialized; ignore store accesses). Could the cachesim2 program (COMP6300 version) be used to accurately predict the performance of this computation on a direct-mapped cache of the same size? If so, what parameters would you give it? [2 marks - COMP6300 only]

  9. Execute the command cachesim3 -L -k 6 lgC 1 4 2 8, for lgC = 15, 16, 17, 18, 19. Cut-and-paste the output as described above. Comparing your results with section 1 Q4, what levels of the memory hierarchy are responsible for each decrease in performance? Briefly explain. [2 marks]

  10. Now repeat the experiments above with the -t option included. Cut-and-paste the output as described above. Compare carefully the results with that of the previous question, and write down the main trends in your report. Explain the results (as best you can! this is getting hard!). [4 marks]

  11. (for bonus marks) Devise an experiment of your own that reveals something interesting about the memory hierarchy, and try to explain the results. It can possibly be on another host, e.g. iwaki, an UltraSPARC III Cu, or a newer x86-based processor than partch.

    To get details of the memory hierarchy, type x86info on an x86-based machine or fpuversion on iwaki. Note that you have to use gmake instead of make on iwaki (the version of make we use on Linux is GNU Make; it is called gmake on iwaki as make is reserved for the Solaris version). [up to 4 bonus marks]

  12. (for bonus marks - it is not yet known whether this can be done!)

    Optimize the lookup() function in cache2.c further, so that it is significantly faster than for your results reported in Q5. It should also be significantly faster than the sample solution optimized as for Q5, which is in /dept/dcs/comp2300/public/ass3/cachesim3. Include in your report relevant performance results and an explanation of them. [up to 4 bonus marks]


Marking scheme

You are expected to give clear and concise answers, but must also include sufficient detail to explain how you obtained your answer. It is required that you cut-and-pasted the results in your report, but doing this without an explanation will gain little or no marks.


Last modified: 27/05/2008, 16:47

Copyright | Disclaimer | Privacy | Contact ANU