COMP2300 - Assignment 3
COMP2300
Semester 1, 2009: Assignment 3
Deadline: noon Friday 29 May
Please read the complete assignment specifications carefully.
(Please report any errors, mistakes and ambiguities
to the course lecturer)
This assignment is worth 8% of your total course mark. It will be
marked out of 32 as indicated below.
The estimated time we expect you to spend on this assignment is around
8 hours in total (1 hour per 4 assignment marks).
Submission
This can be done using the
submit comp2300 ass3 ... command (or the
submit comp6300 ass2 ... command, if you have a COMP6300 enrollment).
The file names which you should use (and the only ones that will be accepted by the submit command) are:
The submitted files must include, at the beginning, a completed
preamble. The proforma for this can be found in the file:
/dept/dcs/comp2300/public/standard_preamble.txt
You have to copy the preamble from this file into the top of each file
you intend to submit, and then complete your details. By submitting a
file containing a completed preamble, you are making a declaration about
the work you have submitted (see the text of the declaration in the
preamble files. If you need to modify the text, e.g. you did in fact
receive substantial assistance from another student, you should modify
the text to say so).
Any file submitted without a completed preamble will be rejected
without being marked.
Note you can submit any file multiple times, including after the
deadline has expired. Note that if you submit any file after the
deadline your entire submission will be considered as late and will
attract a late penalty.
Extensions
See
http://cs.anu.edu.au/student/comp2300/assessment.php.
Late Penalties
Late penalties are as follows
| How Late | Penalty from 32 Marks |
|---|
| <=1 hour | -1 |
| <=6 hour | -2 |
| <=24 hours | -4 |
| <=48 hours | -8 |
| <=72 hours | -16 |
| > 72 hours | -32 (forget it!) |
Late penalties apply even if you submit just one file after the
deadline.
Plagiarism
See
http://cs.anu.edu.au/student/comp2300/assessment.php.
Experiments on the Memory Hierarchy
In this assignment, your report containing answers to the questions
below go into a file called memhier.txt (make sure you include,
at the beginning, a completed preamble based on the
standard_preamble.txt file).
Note that the text (especially Bryant) may be useful for some of the
more questions from Section 1.
The machine wallaman.anu.edu.au is an 1.2 GHz UltraSPARC T2
with approximately 8 GB of physical memory and a
4 MB L2 cache (16-way set associative, 64 byte lines).
It can be assumed that it has a 48-bit virtual address space with 8 KB
pages (this is its normal mode of operation).
Note that the T2 is a 64-bit machine.
It has a 8 KB level-1 cache
(4-way set associative, 16 byte line size). This is for
data only - instructions have a separate cache.
It has a 128-entry TLB (for data).
The shuffle program developed in this course is a potentially
memory-intensive program, which while generating relatively simple
access patterns, which can challenge the memory hierarchy of the
computer running it. Hence, with some extension, it is suitable for the
purpose of performing memory hierarchy experiments.
Section 1 [10 marks]
Answer the following questions, showing your working. Express your
answers for large numbers using powers to 2 and/or KB/MB/GB where
appropriate, rather than using long sequences of decimal digits. Show
your working.
- How many pages will fit into the main memory?
[1 mark]
- How many pages exist in the virtual address space?
If a page table was required to have all possible pages,
how large would it be? (assume each entry is 16 bytes).
[2 marks]
- Would it be practical to implement such a page table assuming a single
process is running at any one time?
Would this still be the case for large-scale multiprocessing
(say 100 processes in memory at any time)?
In either case, suggest how page table memory
overheads could be reduced.
[2 marks; answer in no more than 3 sentences]
- In terms of pages,
what is the largest working set of a user program
that will fit into the level-1 cache, the level-2 cache
and main memory (assume that only half the physical memory is
available to a user program).
What, in terms of pages and bytes, is the largest working set that will
fit into the TLB?
[2 marks]
- For an array of integers (4 bytes) of length N, what values
of N do the above working set sizes correspond to?
[1 mark]
-
Assuming that no data was in the cache initially, calculate the best
possible cache miss rate for the level-1 cache for an algorithm that
access each item of data exactly once, and in units of ints.
The cache hit rate is defined as the percentage of data accesses where
the data is in cache. Hint: consider the effect of the cache
line size for the case when all data accesses occur with
consecutively increasing addresses.
[2 marks]
Section 2 [22 marks + bonuses!]
Copy the files in
/dept/dcs/comp2300/public/ass3 into a
suitably named sub-directory in your own account area. The program
shuffletst.c represents the final evolution of our shuffle
program (note the extra error checking and defensive programming code).
It extends the
shuffle2.c program of Assignment 1 in the
following ways:
- As much of the shuffle code as possible is is encapsulated
in the library shuffle.c. There are a number variants
of algorithms used; these can be selected using the -v v
on the command line of shuffletst. Also two in-place algorithms
(courtesy of our two Lords of the In-place Shuffle) are included;
these can be selected by specifying the -i option.
This file #includes a second
file shuffle2.c, which contains functions where
you can write an optimized algorithms.
- shuffletst executes the specified shuffle a number of times
(-r nTimings option), and times each shuffle. It calculates
the number of computer clock cycles consumed per element of the shuffled
array, based on the minimum timing (you should consider why it is
necessary to take a number of timings, and why the minimum should be
regarded as the most reliable. Hint: one reason is related to
multiprocessing effects, the other to the memory hierarchy). On most
modern computers, the minimum possible number of cycles per element is
two (one cycle to load the element, one to store it). On the T2,
4 cycles are required per load / store, due to the fact that a single
thread of execution (the case for a normal program) gets access
to the load/store unit of the CPU only on every second cycle.
It is not necessarily easy to get close to these lower limits though!
- It can also perform shuffles for a range of values of N,
i.e. for N = n, n+dN, n+2*dN, ... n+(nN-1)*dN. Here nN
and dN are parameters which may appear optionally after n
k. n, k and dN may also be entered in
exponential form x^y (e.g. 2^20, no spaces on either
side of the ^), noting that power-of-two values can create
pathological situations in the memory hierarchy.
- The shuffle is properly checked, and the program will abort if
there are any errors. The -p option can still be used for
debugging; note that the value for nTimings defaults
to 1 in this case.
- The program can also print statistics of various hardware events of
possible interest over the cache simulation, e.g. L1 cache misses
(-e option). This is only available when compiled and run on
hosts which support such a functionality (i.e. wallaman,
partch). Note both this and using accurate timing functions are
system-dependent; the use of C macros passed down from the
Makefile, in conjunction with #ifdef directives, makes
the code portable.
While this might appear complicated, the commands required to drive
the program will mostly be given to you. The program will print out
what options have been selected, and you should check this to ensure
the results you are getting correspond to what was intended.
Browse the file shuffletst.c in order to understand its overall
operation and how to use it. Study the file shuffle.c
and inspect the various shuffle algorithms presented; which looks
like it will be the fastest?
ssh to wallaman, cd to the directory
containing your copy of shuffletst.c and compile it using the
command gmake. Execute the command ./shuffletst to
get it to print out a detailed usage message.
Answer the following questions. You may find adding the -e
option to the commands below helpful; it prints out statistics of hardware
events such as cache misses (but it should be noted that these might
not be accurate in all situations).
Note that the timings of the shuffles may be rather unstable, due
to other process activity on wallaman. You should repeat any
command 2 or 3 times (more if needed) in quick succession to see if the
timings vary; if so, use the command reporting the fastest timings.
- Run the program via the command shuffletst -v v 1000
1 and shuffletst -v v 1000 1000, for v = 2,
3. Cut-and-paste each command line including your prompt (which
will verify you ran it from wallaman!) and the output in
memhier.txt. By inspecting the corresponding loops in
shuffle.c, explain the similarities and differences in the
results.
[2 marks]
- Now run the program via the command shuffletst -v v
1000 1 and shuffletst -v v 1024 2, for v =
0, 2. Cut-and-paste each command and its output as before. Inspect
the corresponding code in shuffle.c, and its generated
assembly code in shuffle.s (hint: search for
shuffleSrcCtg, then locate code for the innermost loop). How
many multiply and divide instructions are executed per iteration?
Assuming this to be the cause of the essential difference in speed
between the two versions, calculate the cost of a a multiply-divide
instruction pair in cycles for n = 1000 and n = 1024.
[3 marks]
- Explain why such differences in operand values can affect
instruction speed. Hint: the T2 is a post-RISC processor; it
does not necessarily have purely RISC properties.
[1 mark - COMP6300 only]
- Run the program via the command shuffletst -i -v v -e
1000 2 for v = 0, 1. Cut-and-paste each command/output as
before. This invokes the in-place algorithms. Briefly explain
why the swap-based algorithm algorithm is more efficient. Hint:
compare the total number of instructions executed in each case. By
comparing these with your previous results, what is the main factor
affecting the performance of these algorithms?
[2 marks]
- It is clear that the computation time of the non-in place
algorithms is proportional to n (ignoring memory hierarchy
effects). Is this the case for the in-place algorithms? If not, what
function of n would the computation time be? Justify your
answer. Hint:: you may derive your answer though analysis of
the algorithms, or you could try modifying the code to count the number
of iterations of the loops for various n (choose powers of 2)
with k = 2.
[2 marks - COMP6300 only. 2 bonus marks for COMP2300]
- In shuffle2.c, implement the ipShuffleOptP2()
function to use faster arithmetic for the case where n (and
hence k) is a power of 2. That is, replace integer
multiplications, divisions and modulus (*, /,
%) with bit-shifting operations (<<,
>>, &), for this case. The function should
also work for non-power-of-two cases, but need not be optimized for
these. You may your code on either of the two given in-place algorithms.
Be liberal with your use of parentheses, as the bitwise operators have
low precedence.
Test that your modifications have preserved the correctness of the
shuffle (use commands such as shuffletst -i -v 2 -p 16 16 2
for debugging).
Cut-and-paste the command/output shuffletst -i -v 2 1024 2 and either
shuffletst -i -v 0 1024 2 or shuffletst -i -v 1 1024
2 (depending on which variant you based your code on) and state the
improvement of performance.
[7 marks - COMP2300 ; 4 marks - COMP6300]
-
Execute the command shuffletst -v v k^2
k for v = 2, 3 and k = 64, 65.
Cut-and-paste the command/output as before. What level of the memory
hierarchy is responsible for the drop in performance in one case, and
give a conjecture for why it only occurs for k = 64.
State the reason why using n = k^2 is suitable for comparing
results between the two versions.
[3 marks]
Suggest a reason for why it occurs for only one of the two versions.
[1 bonus mark]
-
Execute the command shuffletst -r 8 -v v 2^x 1024
for v = 2, 3 and x = 17, 18, 19, 20. State the
trend in the results, when combined those of the previous question
(where k = 64).
Noting that these algorithms use two arrays of length
n, and your answers for part 1 Q5, explain what parts of the
memory hierarchy are impacting performance (and what parts are not).
[3 marks]
- In shuffle.h, the type sxint is defined as the
type used for shuffle indexing computations. Why might it be safer for
this type to be unsigned? If we wanted to use much larger n
and k than we have used above, how should we change this
defniition, and what else would need to be changed?
[2 marks]
- (for bonus marks)
Devise an experiment of your own that reveals something interesting
about the memory hierarchy, and try to explain the results.
It can possibly be on another host, e.g. partch.
To get details of the memory hierarchy, install x86info on an
x86-based machine. Note that you have to use make instead of
gmake on Linux hosts.
[up to 4 bonus marks]
- (for bonus marks)
Optimize either the in-place or ordinary shuffle so that it is
significantly faster at least for some parameters than the respective
algorithms so far. shuffle2.c has stubs where you can write
this code (ipShuffleLotQS() or shuffleLotQS()).
Include in your report the performance gains and on what range of
parameters it is designed for. Note that gcc inline assembler is allowed.
Note that these functions are invoked with -i -v 3
or -v 4 (no -i) options.
[up to 4 bonus marks]
Marking scheme
You are expected to give clear and concise answers, but must also
include sufficient detail to explain how you obtained your answer.
It is required that you cut-and-pasted the results in your report,
but doing this without an explanation will gain little or no marks.
Last modified: 13/05/2009, 18:58