COMP2300
Semester 1, 2007: Assignment 3
Deadline: 12 noon on Wednesday 30 May 2007
(second draft; subject hopefully only to small fixes)
(24/05: minor fixes to commands in Part 2 Q2, Q3 and Q8)
(28/05: clarification to Part 1 Q6)
Please read the complete assignment specifications carefully.
(Please report any errors, mistakes and ambiguities
to the course lecturer)
This assignment is worth 8% of your total course mark. It will be
marked out of 32 as indicated below.
The estimated time we expect you to spend on this assignment is around
8 hours in total (1 hour per 4 assignment marks).
Submission
This assignment must be submitted electronically. This can be done
using the submit comp2300 ass3 memhier.txt transalgs_a.c command.
The files memhier.txt and transalgs_a.c
submitted must include, at the beginning, a completed
preamble. The proforma for this can be found in the file:
/dept/dcs/comp2300/public/standard_preamble.txt
You have to copy the preamble from this file into the top of each file
you intend to submit, and then complete your details. By submitting a
file containing a completed preamble, you are making a declaration about
the work you have submitted (see the text of the declaration in the
preamble files. If you need to modify the text, e.g. you did in fact
receive substantial assistance from another student, you should modify
the text to say so).
Any file submitted without a completed preamble will be rejected
without being marked.
Extensions
See
http://cs.anu.edu.au/student/comp2300/assessment.html.
Late Penalties
Late penalties are as follows
| How Late | Penalty from 32 Marks |
|---|
| <=1 hour | -1 |
| <=6 hour | -2 |
| <=24 hours | -4 |
| <=48 hours | -8 |
| <=72 hours | -16 |
| B hours | -32 (forget it!) |
Late penalties apply even if you submit just one file after the
deadline.
Plagiarism
See
http://cs.anu.edu.au/student/comp2300/assessment.html.
Experiments on the Memory Hierarchy
In this assignment, your report containing answers to the questions
below go into a file called memhier.txt (make sure you include,
at the beginning, a completed preamble based on the
standard_preamble.txt file). In Section 2, will also be asked to
write a small amount of C code in a file called transalgs_a.c.
Note that the text (especially Bryant) may be useful for some of the
more questions from Section 1.
The machine karajan.anu.edu.au is a 2.13 GHz Intel (R) Core2
processor (brand new!) with approximately 2 GB of physical memory and a
2 MB L2 cache (you can verify this by ssh karajan followed by
the command x86info, courtesy of a project by Dave
Jones. The command cat /proc/cpuinfo /proc/meminfo will
give some of this information). It can be assumed that it has a 32-bit
virtual address space with 4 KB pages (this is its normal mode of
operation).
It has a 32 KB level-1 cache (8-way set associative),
with a block (line) size of 64 bytes. This is for
data only - instructions have a separate cache. The L2 cache
has the same line size and associativity.
It has a 256-entry TLB (for data).
Matrix transpose is a simple computation which nonetheless challenges
the memory hierarchy, and so is suitable for the purpose of experiments
on it. In fact, this part is a kind of a follow-on of a research paper based on experiments on the
UltraSPARC I and III.
Section 1 [10 marks]
Answer the following questions, showing your working.
Express your answers for large numbers using powers to 2 and/or
KB/MB/GB where appropriate, rather than using long sequences of decimal
digits. Show your working.
- How many pages will fit into the main memory?
[1 mark]
- How many pages exist in the virtual address space?
If a page table was required to have all possible pages,
how large would it be? (assume each entry is 16 bytes).
[2 marks]
- Would it be practical to implement such a page table?
If so, how could it be improved? If not, how could
it be made more practical?
[2 marks; answer in no more than 3 sentences]
- In terms of pages,
what is the largest working set of a user program
that will fit into the level-1 cache, the level-2 cache
and main memory (assume that only half the physical memory is
available to a user program).
What, in terms of pages and bytes, is the largest working set that will
fit into the TLB?
[2 marks]
- For an n by n matrix of integers (4 bytes), what
values ofn do the above working set sizes correspond to?
[1 mark]
-
Assuming that no data was in the cache initially, calculate the best
(lowest) possible cache miss rate for the
level-1 cache for an algorithm accessing a set
of data in units of ints, based on the consideration of
block size. Assume that the algorithm accesses
each unit of the set exactly once.
[2 marks]
Section 2 [22 marks + bonuses!]
Copy the files in /dept/dcs/comp2300/public/ass3 into a
suitably named sub-directory in your own account area. The program
tranpose3.c transposes an n by
n matrix A.It extends the transpose2.c
program of Assignment 2 in the following ways:
- there are a larger number of transpose algorithms available
(-a option).
These are separated from the main program in files transalgs.c
and transalgs_a.c. A Makefile is provided to compile
the program system consistently (note use of header files).
For clarity, the program will print the name of the algorithm selected.
- the program executes the selected transpose algorithm a number of
times (-r option), and times each operation. It calculates the
number of computer clock cycles consumed per matrix element, based on the
minimum timing (you should consider why it is necessary to take a number
of timings, and why the minimum should be regarded as the most reliable.
Hint: one reason is related to multiprocessing effects, the other to
the memory hierarchy).
- for convenience, it performs the computations for a range of
values of n, i.e. for n = ni + dn, ni+2*dn, ..., nf.
- it will also print counts of various hardware events of possible
interest over the transpose operation, e.g. L1 cache misses (-e
option). This is only available on hosts which support such a
functionality (e.g. karajan, iwaki). Note both this
and using accurate timing functions are system-dependent; the use of C
macros passed down from the Makefile, in conjunction with
#ifdef directives, makes the code portable.
- it can ensure the matrix is initially flushed from the top-level
cache before the first transpose operation (-c option -
we say the caches are `cold' with respect to holding the data).
Browse the file transpose3.c in order to understand its overall
operation and how to use it.
ssh to karajan, cd to the directory
containing your copy of transpose3.c and compile it using the
command make. Execute the command ./transpose3 to
get it to print out a usage message.
Answer the following questions. You may find adding the -e
option to the commands below helpful; it prints out counts of hardware
events such as cache misses.
- Study the code in main() in transpose3.c in the
for (r...) loop (assume getHWevents = flushCache = 0).
In terms of data, what would you identify as making up the working
set for this loop? [1 mark]
- Run the program via the commands ./transpose3 -a 0 32 136 8
and ./transpose3 -a 1 32 136 8.
(the -a 1 option selects a naive matrix transposition
algorithm; the -a 0 option selects an algorithm which
reverses elements in rows. This is not a transposition at all
but can be used as a baseline as it contains the same number of
swaps).
Cut-and-paste the command line including your prompt (which
will verify you ran it from karajan!) and the output in
memhier.txt.
Explain why the speed increases (for at least a while) after
n=32. At what point does the performance begin to decrease?
What part of the memory hierarchy is likely to start to thrash at this
point? Is this exactly where you would expect from Sect 1 Q5? What
could explain the anomaly at n=128 for the transpose? [4
marks]
- Run the program again via the command ./transpose3 -a 3
32 136 8. This performs the tranpose via 2
by 2 blocks, as per Assignment 2. Cut-and-paste the prompt / command /
output as above.
Inspect transalgs.s and locate the innermost loop for the
baseline, trans1x1 and trans2x2 functions.
Count the number of memory references in each body (these are signified
by (...)'s in the instruction operands, not including those
of the leal instruction). Write this information in your report,
including the address label for the head of the innermost loop.
How well does the relative execution speed (take the fastest over the range of
n) correlate with the number of memory references (per matrix
element)? Is this factor alone sufficient to explain the relative
performance? [3 marks]
-
You may have noticed that the innermost loop of the trans2x2
function contains at least one imull
(integer multiplication) instruction. A general optimization technique
is to avoid these instructions, as these tend to be slower than other
integer arithmetic instructions (e.g. see Lecture M5, p32). In the case
of an expression i*n+j inside a loop, where i increases
by 1 each iteration, this can be achieved for example by replacing
i*n with a local variable in which is incremented by
n whenever i is incremented by 1 (this kind of
technique is known as strength reduction and forms an important
class of program optimizations).
Implement in transalgs_a.c versions of the 1x1 and 2x2
transpose algorithms that avoid explicit integer multiply using this technique
(they can be accessed by the -a 2 and -a 4 options).
After testing your code, inspect transalgs_a.s to verify
no imull instructions appear in the innermost loops.
Cut-and-paste the prompt/command/output of ./transpose3 -a 2 32 136
8 and ./transpose3 -a 4 32 136 8 into your report. State
whether the results suggest this optimization was worthwhile. If the
results are actually worse, suggest a reason for why this might be.
[6 marks]
-
The algorithm blockedTrans() in transalgs.c is a
transpose algorithm based on swapping b by b blocks.
For the case of b=4, it calls R_Block4Trans()
(this routine was designed for the UltraSPARC III, although for floating
point rather than integer data).
Inspect the code for this. On an UltraSPARC III,
this tends to be significantly faster than trans2x2().
Cut-and-paste the prompt/command/output of ./transpose3 -a 5 32 136 8
(which causes R_Block4Trans() to be used) into your report.
Explain why the relative performance of these two algorithms
is different on the x86 and UltraSPARC architectures.
[2 marks]
-
Looking more closely of the impact of the level-1 cache on
performance, cut-and-paste the prompt/command/output of
./transpose3 -a 3 48 336 16 into your report. What pattern do
you see emerging? Disregarding the anomolous (slow) results over the
range, to what extent would you say the actual capacity of the level-1
cache is having an impact of performance. What could be a possible
explanation for the pattern of anomolous (slow) results?
Hint:
does the baseline algorithm (-a 0) exhibit such a pattern?
[3 marks]
-
Cut-and-paste the prompt/command/output of ./transpose3 -a 3
128 2048 128 and ./transpose3 -a 5 -b 8 128 2048 128
into your report.
Looking first at the overall trends of the -a 3 results, what
level(s) of the memory hierarchy could be taking effect?
What pattern (of anomalies) do you see emerge for larger n?
Why is this pattern softened for the -a 5 results,
and why are these results generally faster now for large n?
[3 marks]
- (for bonus marks)
Cut-and-paste the prompt/command/output of ./transpose3 -a 6 32
136 8 and
./transpose3 -a 6 128 2048 128 into your
report. This uses a divide-and-conquer strategy and is an example of a
class of algorithms called cache-oblivious algorithms: so called
because they are claimed to minimize misses at all levels of the
memory hierarchy without needing to knowing about the capacity at any level.
Does this algorithm appears to live up to its name? Briefly explain. Also
explain what you think is the main factor in its relative performance,
say to the -a 3 case. [up to 2 bonus marks]
- (for bonus marks)
Devise an experiment of your own that reveals something interesting
about the memory hierarchy, and try to explain the results.
It can possibly be on another host, e.g. partch which
comes from an older line of the x86 architecture (and with an
aggressively high clock rate to try to make up for this!), or
possibly on iwaki an UltraSPARC III Cu.
To get details of the memory hierarchy, type x86info on
partch or fpuversion on iwaki.
Note that you need to do make clean; make to re-compile
your code on partch (and gmake clean; gmake on
iwaki - the version of make we use on Linux is GNU Make;
it is called gmake on iwaki as make is reserved
for the Solaris version).
[up to 4 bonus marks]
- (for bonus marks - it is not yet known whether this can be done!)
In the function transWildcard(), implement a version of a
transpose algorithm that aims to be significantly faster than the best
of the algorithms we have seen so far (-a 3 for small n,
-a 5 -b 8 for large n). You may use different
gcc compiler options in versions of existing codes as well - if so
state what options you require clearly in your report. Test your code
using the -a 7 option. Include in your report relevant
performance results and an explanation of them.
[up to 4 bonus marks]
Marking scheme
You are expected to give clear and concise
answers, but must also include sufficient detail to explain how you
obtained your answer.
Last modified:
Thu May 24 10:29:36 EST 2007