COMP2300 - Assignment 3
COMP2300
Semester 1, 2008: Assignment 3
Deadline: 10:00 am on Tuesday 3 June Friday 30 May 2008
Please read the complete assignment specifications carefully.
(Please report any errors, mistakes and ambiguities
to the course lecturer)
This assignment is worth 8% of your total course mark. It will be
marked out of 32 as indicated below.
The estimated time we expect you to spend on this assignment is around
8 hours in total (1 hour per 4 assignment marks).
Submission
This can be done using the
submit comp2300 ass3 ... command (or the
submit comp6300 ass2 ... command, if you have a COMP6300 enrollment).
The file names which you should use (and the only ones that will be accepted by the submit command) are:
memhier.txt
cache2.c (optional, for bonus marks)
The submitted files must include, at the beginning, a completed
preamble. The proforma for this can be found in the file:
/dept/dcs/comp2300/public/standard_preamble.txt
You have to copy the preamble from this file into the top of each file
you intend to submit, and then complete your details. By submitting a
file containing a completed preamble, you are making a declaration about
the work you have submitted (see the text of the declaration in the
preamble files. If you need to modify the text, e.g. you did in fact
receive substantial assistance from another student, you should modify
the text to say so).
Any file submitted without a completed preamble will be rejected
without being marked.
Note you can submit any file multiple times, including after the
deadline has expired. Note that if you submit any file after the
deadline your entire submission will be considered as late and will
attract a late penalty.
Extensions
See
http://cs.anu.edu.au/student/comp2300/assessment.php.
Late Penalties
Late penalties are as follows
| How Late | Penalty from 32 Marks |
|---|
| <=1 hour | -1 |
| <=6 hour | -2 |
| <=24 hours | -4 |
| <=48 hours | -8 |
| <=72 hours | -16 |
| > 72 hours | -32 (forget it!) |
Late penalties apply even if you submit just one file after the
deadline.
Plagiarism
See
http://cs.anu.edu.au/student/comp2300/assessment.php.
Experiments on the Memory Hierarchy
In this assignment, your report containing answers to the questions
below go into a file called memhier.txt (make sure you include,
at the beginning, a completed preamble based on the
standard_preamble.txt file).
Note that the text (especially Bryant) may be useful for some of the
more questions from Section 1.
The machine partch.anu.edu.au is a 3.06 GHz Intel Pentium 4
processor with approximately 4 GB of physical memory and a
0.5 MB L2 cache (you can verify this by ssh partch followed by
the command x86info, courtesy of a project by Dave
Jones. The command cat /proc/cpuinfo /proc/meminfo will
give some of this information). It can be assumed that it has a 32-bit
virtual address space with 4 KB pages (this is its normal mode of
operation).
It has a 8 KB level-1 cache (4-way set associative),
with a block (line) size of 64 bytes. This is for
data only - instructions have a separate cache. The L2 cache
has the same line size and 8-way associativity.
It has a 64-entry TLB (for data).
The cache simulator program developed in this course is a potentially
memory-intensive program, which while generating relatively simple
access patterns, which can challenge the memory hierarchy of the
computer running it. Hence, suitably extended, it is suitable for the
purpose of experiments on it.
Section 1 [10 marks]
Answer the following questions, showing your working.
Express your answers for large numbers using powers to 2 and/or
KB/MB/GB where appropriate, rather than using long sequences of decimal
digits. Show your working.
- How many pages will fit into the main memory?
[1 mark]
- How many pages exist in the virtual address space?
If a page table was required to have all possible pages,
how large would it be? (assume each entry is 16 bytes).
[2 marks]
- Would it be practical to implement such a page table?
If so, how could it be improved? If not, how could
it be made more practical?
[2 marks; answer in no more than 3 sentences]
- In terms of pages,
what is the largest working set of a user program
that will fit into the level-1 cache, the level-2 cache
and main memory (assume that only half the physical memory is
available to a user program).
What, in terms of pages and bytes, is the largest working set that will
fit into the TLB?
[2 marks]
- For an array of integers (4 bytes) of length C, what values
of C do the above working set sizes correspond to?
[1 mark]
-
Assuming that no data was in the cache initially, calculate the best
possible cache miss rate for the level-1 cache for an algorithm accessing
data in units of ints, based on the consideration of block size.
[2 marks]
Section 2 [22 marks + bonuses!]
Copy the files in
/dept/dcs/comp2300/public/ass3 into a
suitably named sub-directory in your own account area. The program
cachesim3.c extends the
cachesim2.c
program of Assignment 1 in the following ways:
- An LRU replacement policy can be selected (-L option). A
block (line) size of B can be specified (-b lgB
option); however, for the purposes of most experiments, only the case
B=1 will be of interest. For set-associative caches, the
elements of each set, instead of occupying consecutive entries in the
simulated cache, can be spread across the simulated cache array in
intervals of C/(K*B) (with the -t option). If you
consider an associative cache as a C/K times K array,
the -t option causes the entries to be stored in a transposed
ordering.
- As much of the cache functionality as possible is is encapsulated
in the library cache.c. This file #includes a second
file cache2.c, which contains the crucial lookup function which
you are to optimize! The cache size need not be a power of 2; however
the associativity K and block size B must be.
- cachesim3 executes the cache simulation a number of times
(-r nTimings option), and times each simulation. It calculates
the number of computer clock cycles consumed per simulated cache access,
based on the minimum timing (you should consider why it is necessary to
take a number of timings, and why the minimum should be regarded as the
most reliable. Hint: one reason is related to multiprocessing
effects, the other to the memory hierarchy).
- It can perform the simulations for a range of values of C,
i.e. for C = C0, C0+dC*K*B, C0+2*dC*K*B, ... C0+(nC-1)*dC*K*B,
where C0 = (1<<lgC). Here nC and dC
are parameters which may appear optionally after lgC NdivC R.
Note that the above formula ensures that C is a multiple of
K*B. Noting that the computation's working set size is
C, having non-power-of-two values of C is useful for
memory hierarchy experiments, as power-of-two values can create
pathological situations in the memory hierarchy.
- As the simulated cache (hit rate) statistics are not the emphasis
of this assignment, these are suppressed, unless the -s
option is specified. However, it is still important that the simulated
cache is still functioning correctly! The -p option (which
also forces -s) can still be used for debugging.
- The program can also print statistics of various hardware events of
possible interest over the cache simulation, e.g. L1 cache misses
(-e option). This is only available when compiled and run on
hosts which support such a functionality (i.e. partch,
iwaki). Note both this and using accurate timing functions are
system-dependent; the use of C macros passed down from the
Makefile, in conjunction with #ifdef directives, makes
the code portable.
While this might appear complicated, the commands required to drive
the program will mostly be given to you. The program will print out
what options have been selected, and you should check this to ensure
the results you are getting correspond to what was intended.
Browse the file cachesim3.c in order to understand its overall
operation and how to use it.
ssh to partch, cd to the directory
containing your copy of cachesim3.c and compile it using the
command make. Execute the command ./cachesim3 to
get it to print out a detailed usage message.
Answer the following questions. You may find adding the -e
option to the commands below helpful; it prints out statistics of hardware
events such as cache misses (but it should be noted that these might
not be accurate in all situations).
Note that the timings of the simulations may be rather unstable, due
to other process activity on partch (thank heaven no-one should
need to be running Peanut on it anymore!). You should repeat any
command 4 or 5 times (more if needed) in quick succession to see if the
timings vary; if so, use the command reporting the fastest timings.
- Study the code in main() in cachesim3.c in the
for (i...) {lookup(i);} loop, and the code for
lookup() in cache2.c. For a direct-mapped cache
configuration (K==1) where there is a hit, how many
accesses to the simulated cache array cache[] will occur? What
other memory accesses are involved? How significant will the accesses of
cache[] be over the others?
[2 marks]
- Run the program via the command cachesim3 -L 8 1 10 4 16.
Cut-and-paste the command line including your prompt (which will verify
you ran it from partch!) and the output in
memhier.txt. Assuming that at least 30% of these cycles are
consumed in memory accesses, and it should take at most 2 cycles to make
a single access (assuming the data is resident in the L1 cache),
estimate a lower bound on the number of memory accesses occurring per
lookup. Does this match your answer to the previous question?
[1 mark]
- Run the program again via the commands cachesim3 -L 8 1 1 4
16 and cachesim3 8 1 1 4 16. Cut-and-paste the output as
described above. Identify the differences in the execution of
lookup() in these situations from that of the previous question
(hint: repeat all of the commands with the -s
option). Use this information to produce a rough estimate of the
overhead of the rand() function. [3 marks]
- Now optimize the code for lookup() by replacing integer
multiplications, divisions and modulus (*, /,
%) with bit-shifting operations (<<,
>>, &), wherever and whenever this is
possible. Note that you may always assume that K and B
are powers of 2. The POWER_OF_2_CACHE macro may be used to
decide when this optimization can be applied on operations involving
ConKB. Be liberal with your use of parentheses, as the bitwise
operators have low precedence.
Test that your modifications have preserved the
correctness of the simulated cache (use commands such as cachesim3 -p
-L 4 1 2, cachesim3 -p -L -k 2 4 1 2, cachesim3 -p -L
-b 2 4 1 2 for debugging). To test your optimizations with
POWER_OF_2_CACHE, enable the macro using make
OPT_P2=1.
[6 marks - COMP2300 ; 4 marks - COMP6300]
- Now repeat the command cachesim3 -L 8 1 10 4 16 for both
when the program was compiled with make OPT_P2=1 and with simply
make. You will notice that the second line of output indicates
how the program was compiled. Cut-and-paste the output as described
above. Comparing your results with Q2, evaluate how effective your
optimizations were. You will probably find the results for
C≥272 rather sunrising; try to explain them. Hint:
optimizing compilers will treat code such as if (0) x=e1; else x=e2;
as simply x=e2;.
[3 marks]
-
Recompile your program with make and only use this version
of your program from now on. Note that if you did not get the code working
correctly, you can (and should) use the original code in cache2.c
from now on.
Execute the command cachesim3 -L -k 6 8 1 10 2 8. Cut-and-paste
the output as described above. Why does the number of cycles increase
from the result in Q2? Why would an associativity of K=2^6 be
better to use for memory hierarchy experiments?
[1 mark]
-
Execute the command cachesim3 -L -k 6 lgC 1 10 2 8,
for lgC = 9, 10, 11, 12. Cut-and-paste the output as described
above. Comparing with results with that of the previous question, what
effect (if any) does the L1 cache have on the performance of the
computation?
[1 mark]
Bonus question: what feature of modern computer architecture
might be responsible for this behavior? (note: we have not mentioned
this in lectures, but there was a recent DCS seminar on the topic).
[up to 1 bonus mark]
- (COMP6300 only)
Consider the computation cachesim3 -r 1 -L -k 4 6 1 1. Describe
the pattern of the order of accesses of elements of the cache[]
array over the commutation (after the array has been initialized;
ignore store accesses). Could the cachesim2 program (COMP6300
version) be used to accurately predict the performance of this
computation on a direct-mapped cache of the same size? If so, what
parameters would you give it?
[2 marks - COMP6300 only]
-
Execute the command cachesim3 -L -k 6 lgC 1 4 2 8, for
lgC = 15, 16, 17, 18, 19. Cut-and-paste the output as
described above. Comparing your results with section 1 Q4, what levels of
the memory hierarchy are responsible for each decrease in performance?
Briefly explain.
[2 marks]
-
Now repeat the experiments above with the -t option included.
Cut-and-paste the output as described above. Compare carefully the
results with that of the previous question, and write down the main
trends in your report. Explain the results (as best you can! this is
getting hard!).
[4 marks]
- (for bonus marks)
Devise an experiment of your own that reveals something interesting
about the memory hierarchy, and try to explain the results.
It can possibly be on another host, e.g. iwaki, an UltraSPARC
III Cu, or a newer x86-based processor than partch.
To get details of the memory hierarchy, type x86info on an
x86-based machine or fpuversion on iwaki. Note that
you have to use gmake instead of make on
iwaki (the version of make we use on Linux is GNU
Make; it is called gmake on iwaki as make is
reserved for the Solaris version). [up to 4 bonus marks]
- (for bonus marks - it is not yet known whether this can be done!)
Optimize the lookup() function in cache2.c further, so
that it is significantly faster than for your results reported in Q5.
It should also be significantly faster than the sample solution
optimized as for Q5, which is in
/dept/dcs/comp2300/public/ass3/cachesim3. Include in your
report relevant performance results and an explanation of them.
[up to 4 bonus marks]
Marking scheme
You are expected to give clear and concise answers, but must also
include sufficient detail to explain how you obtained your answer.
It is required that you cut-and-pasted the results in your report,
but doing this without an explanation will gain little or no marks.
Last modified: 27/05/2008, 16:47