COMP2300 - TuteLab 09
COMP2300 / COMP6300
Tutorial Laboratory 09 - Caches, SPARC Assembly and
Linking and Loading
Semester 1, 2009
Week 11 (18 - 22 May)
As well as the Preparation Exercises for this class, you are
expected to have revised lectures M3--M5 and O2 (of course it would be a
very good idea to bring your notes with you to the session!).
It will also be useful to have read this document beforehand.
Also for this session, there is a submitable
laboratory exercise which is due by 09 am Tuesday 26 May (week 12), which
will contribute up to 1% of your assessment (in the Tute/Lab mark).
Objectives
This lab has the following aims.
- To deepen your understanding of caches.
- To give you a brief introduction to the assembly language of
a real machine, in this case SPARC assembly.
- To understand load objects and symbol tables.
Preparation Exercises
the following questions on a separate sheet of paper, with your
name and student number clearly written. Please ensure your writing is
legible. Hand in to your tutor at the beginning of your tutorial /
laboratory session.
- Define the terms direct-mapped cache and
k-way set-associate cache. What are typical values for k?
- What is a cache line, and what lengths are these typically?
- What is a branch delay slot, and why were they introduced
to RISC processors?
- Write the SPARC assembly language instruction that
sets register %o2 to the sum of registers
%i1 and %i2.
- Suppose you wished to write an assembly language program to call
the C function int f(int x, int y).
Which registers would you use to pass the parameters,
and which would hold the return value?
Tutorial Questions on Caches
- Consider a program which repeatedly accesses all elements in an
int array of of length N in a cyclic fashion
(from first to last element), executing on a
computer with a 64 KB data cache. Note that this is a cache version
of Belady's anomaly (Lecture M2).
For example, the computation could be:
while (1) {
int s = 0, i;
for (i = 0; i < N ; i++)
s += a[i];
}
Suppose the first iteration
has already been completed.
Assume that the cache line size
is 4 bytes, and the cache is fully associative
(note: this is not a realistic cache!),
and the replacement policy is least recently used
Calculate the
cache hit rate for
(a) N = 213,
(b) N = 214,
(c) N = 215 and
(d) N = 216.
Hint: calculate the relative size of a[]
to the cache first.
Explain whether a random replacement
policy performs better or worse in each case.
Recall that the cache hit rate is calculated as the percentage
of (in this case int) load/store operations that occur when
the data is in the cache.
- Repeat the above for a direct-mapped cache (note that the replacement policy
plays no role in this case).
- Suppose we now had a line size of 16 bytes.
How would this affect the rate (consider the situation
in Q2 where you had a 0%, 50% and 100% hit rate)?
If so, does this necessarily mean the increase in line size would
affect the computer's execution rate?
- An assumption in the above is that there is no significant
amounts of data access other than to the array. Discuss the extent
you expect this to hold on both SPARC and Intel x86 architectures.
Hint: imagine writing the above loop in PeANUt; what other
memory accesses than the array itself would you have?
- Reconsider Question 1 for the random replacement policy.
For parts (c) and (d) , consider the following argument.
For N=2^15, we expect half of the array will be resident in cache
with a random replacement policy.
Thus half the memory accesses will hit, resulting in a 50% hit rate.
For N=2^15, we expect 1/4 of the array will be resident; thus
we expect a 25% hit rate.
Evidently, this argument was so persuasive that 2 PhD
students and 100 undergraduates did not challenge it when it was
proposed in 2007!
Is this argument correct? What tools* do you have in order to test it? If
it is correct, develop reasons to support the argument. If not correct,
what are the reasons why this is the case.
*The cache simulator developed in COMP2300 2008 is such a tool. For a
cache consisting of 214 4-byte lines that is fully
associative (-k 14), the cache hit rate for the first 10
repetitions, when the array size is 1, 2, and 4 times the cache size,
the following commands can be used:
cachesim2 -k 14 14 1 10
cachesim2 -k 14 14 2 10
cachesim2 -k 14 14 4 10
No answers will be published on this question. Its up to you and your
class to resolve it!
Laboratory Exercises
If you have read this document beforehand, you should be able to
complete the exercise to the end of the
section on
branching in one hour of your session. If need be, the rest of the
exercise should be completed for homework.
SPARC Assembly Language
As discussed in lectures the SPARC architecture is a load/store
architecture. All arithmetic and logical operations are carried out
between operands located in registers. Load and store instructions are
provided to load and store register contents from memory. The machine
has 32 registers available to the programmer at any one time. These
registers are logically divided into four sets:
- global registers (%g0-%g7 in SPARC assembly) are
for global register data, ie data that has meaning for the entire
program.
- local registers (%l0-%l7) are for local function
variables.
- out registers (%o0-%o7) are used as temporaries
and in passing arguments to functions
- in registers (%i0-%i7) are used as temporaries
and in receiving arguments from calling functions
Two of the out registers
%o6 and
%o7 have special use
and should not be used (
%o6 is the stack pointer or
%sp, while
%o7 is the called function's return
address). Also the first global register (
%g0) is special in
that it always returns zero when read and discards anything written to
it.
While the default for C programs files is to append a .c to the end
of the file name, for assembly programs it is common to append a
.s. Also, since compiling a C program is a two step process that first
involves translation of the C program to assembly and then from
assembly to machine code and linkage, we can use the C
compiler both to produce assembly code from an existing C program,
and to produce machine code and link our assembly programs.
gcc -S program.c # to produce assembly listings from C code
gcc program.S -o executable # to compile and link assembly code
- Log into the T2 SPARC via ssh wallaman.
-
Copy the files /dept/dcs/comp2300/public/lab9/* into
your lab9 folder.
|
We will start by looking at a very simple example, example1.S.
Open your file in your favorite text editor.
The program relies on the C pre-processor (cpp) to expand
macros, in an equivalent manner to concise macros in PeANUt (except we
can use it for register names as well as numbers). The suffix
convention .S instructs gcc to run cpp over
the text, before invoking the assembler (as).
-
Execute the command gcc -E example1.S. This applies cpp
only, and displays the assembler code with the macros expanded to the
standard output. This will make clear what the preprocessor has done
(note that cpp has also expanded the macros within the
assembler comments! This is because it does not recognize '!' as a
comment. For the same reason,
we cannot have an assembler comment on the
same line as a #define!
|
Things to note about the assembly code:
- The linker will look for for an address (pointing to a block of code)
labelled
main. This is declared to be .global in
an analogous manner to the
global definition in PeANUt and for the same reason.
- Thus the first user program instruction to be executed is save
%sp,-96,%sp. This command provides space, on the stack, to save
ALL general-purpose registers (if required).
- Most SPARC instructions take three operands: two registers and
a literal constant, or three registers:
op regrs1, reg_or_imm, regrd
Where the content of the first or source register
regrs1
is combined
with the literal or the contents of the second source register
regrs2
to produce a result that is stored in the destination register
regrd .
The number of bits available in the instruction word for storing
a literal constant allows for values from -4096 to +4095.
- mov moves either a register or literal from the source
to the destination register.
mov regrs1, regrd
- sub is obviously subtract, while add is
addition.
- The original SPARC Instruction Set Architecture (ISA) had no integer
multiplication or division instructions. Instead these operations
are accomplished by a series of bit shifts and additions. This
is, however, all available in system provided functions we
access via the
call .mul and call .div statements (we will come
back to this point at the very end of this lab!).
Note that in C, the .mul function would have the header:
int .mul(int x0, int x1) /* returns x0*x1 */;
and similarly for .div
- Notice how the operands for the multiplication and division are
placed in registers %o0 and %o1,
and the result is retrieved from %o0.
- Notice also that after each function call we have a branch delay
slot that has been filled via a nop or "No Operation"
instruction. This consumes one execution cycle
(in one of the integer pipelines) but does nothing.
-
Finally the last three instructions return us to the operating
system. The trap instruction ta calls the operating
system with the service request encoded into register
%g1 (not so different to how we terminated a PeANUt program).
A problem, that you may now have noticed, is that we have no means of
knowing if the program is correct since it produces no output! Just like
in PeANUt, input/output is an involved topic since it involves
translation of the number to an ASCII representation etc etc. At this
stage we could use the debugger (gdb) to run the code and
examine the contents of the various registers as the calculation
proceeds. This in itself would be a valuable exercise, but I will let
you pursue that in your leisure time. Instead we, will do basic integer
I/O through a user defined function that in C corresponds to:
void myprt(int y){
printf( "Integer value is %d\n", y);
}
We use this function in the program
example2.S.
Inspect the file. You should note the following:
- The call to myprt from the main program and how
the value of the
register to be printed is passed to the function.
- How (and where) the character string associated with printf() is
stored.
- How the base address for this character string, defined by
.str, is loaded into register %o0 via the
sethi (set high 22 bits of an immediate constant) and
subsequent or instructions.
- Why can the address of the print string not be set using a single
instruction? (it requires both the sethi and
or instructions outlined above).
-
Compile and run the
executable and ensure that it prints out the correct result (-8).
|
Branch Delay Slots
As mentioned in lectures, a branch delay slot is a characteristic of
RISC machines. In our program the
call instruction is an
example of an operation that has a branch delay slot and these are
currently filled with
nop instructions.
- Comment out the nop instructions
(remember to use a `!',
not a semicolon; a semicolon will have no effect in SPARC AL)
from immediately after the call
.mul AND the call .div lines. Recompile the code and
run it. Verify that the results are now wrong!
Restore the nop instructions before proceeding.
|
Branching
Like in PeANUt branch instructions test for condition codes in order
to determine if a branch condition exists. Branch instructions take
the following form:
bicc label
where
bicc stands for one of the branches testing
the integer condition codes. A few of the possible branch conditions
are given in the following table:
| Assembly
|
Action
|
| ba
| branch always
|
| bn
| branch never (why does this exist?)
|
| bl
| branch on less than zero
|
| ble
| branch on less or equal to zero
|
| be
| branch on equal to zero
|
| bne
| branch on not equal to zero
|
| bge
| branch on greater or equal to zero
|
| bg
| branch on greater than zero
|
To set the condition code we can use the following compare instruction
cmp regrs1, reg_or_imm
(this is just a shorthand for the more general
instruction subcc regrs1,reg_or_imm,%g0 , which
subtracts the two operands and sets the CPU's condition codes according
the result of the subtraction. As the destination register for the
result is is %g0, the result is effectively thrown away, but
the condition codes are set accordingly. So in the above table, comparing
the result of the subtraction to zero is equivalent to comparing the
two operands of the cmp with each other)
For example we might construct a conditional branch in the following manner:
sub x, 1, x ! x=x-1 /* assume x is aliased to a register, i.e. %l0 */
cmp x, 0 ! compare x with 0
bgt loop ! branch to loop if x > 0
nop ! note we again have a branch delay slot
The C program example3.c computes the value of the simple
polynomial for integer values of x from 0 to 10. Inspect this file.
- Compile and run example3.c making sure that the
results are what you expect them to be!
- Copy example2.S into example3.S, and modify
the code of example3.S to reproduce the results from
example3.c.
- Now modify myprt in example3.S
to also print x using the equivalent of
the following C statement:
printf("Value of y=%d, for x=%d\n", y, x);
- Once working, do a final check
for correctness with the command (must be from wallaman!):
previewAutoMark lab9 example3.S
When satisfied, submit it using the submit command:
submit comp2300 lab9 example3.S
submit comp6300 lab9 example3.S
This is due by 09 am Tuesday May 26 (the deadline is strict)
and will contribute up to 1% of your Tute/Lab mark.
|
Generating Assembly from C Code
- Generate assembly for this C code using
gcc -S -mcpu=v7 example3.c
|
A few differences from what we have done will be evident:
- Lots of header info for the main() function (that we
did not use in the .S programs).
- Slightly different stack pointer offsets in the save
instruction (it requires more space).
- Use of %g0 to initialize x to zero.
- All the nop instructions have been removed.
In fact, even though we have not asked for optimization, this particular
version of
gcc (4.0.4) performs many optimizations. This can
actually be a problem in some situations (e.g. in Assignment 3, it performs
an optimization which obscures memory hierarchy effects, and there seems
to be no way of stopping it from doing so).
|
Now generate assembly language for the UltraSPARC (and T2) instruction set
(which confirms to version 9 of the SPARC architecture specification).
gcc -O -S example3.c
If you now
inspect example3.S, you will notice that the calls
to .mul and .div have been removed. The UltraSPARC
does actually have multiply and divide instructions - but the
original SPARC architecture did not! So these instructions are
only used if you tell the compiler that you never want to run the
executable on older machines (which by these days will be hard indeed
to find anyway!).
|
Finally you should note that, given an executable image, it is possible
to disassemble this into assembly code. The command is:
/usr/ccs/bin/dis
Do
man dis on
wallaman.
- Compile the C version of example3.c
gcc -O example3.c -o example3
Run the executable and make sure it works. Then disassemble it
using the following command
/usr/ccs/bin/dis example3 > example3.dis
Inspect the resulting file using an editor.
Compare the code for _start with the pseudo-code
given in the O2 lecture.
You should be able to
locate the main program and function myprt and
relate the resulting assembly to what is given via
gcc -O -S example3.c
|
Delay slot exercise (only when you have completed the parts below)
If we know that the instruction immediately after call
instructions is always executed BEFORE jumping to whatever
address corresponds to the called function, we can restructure
the code to try and move suitable instructions from before the
call instruction into this branch delay slot. Obviously
these moved instructions cannot be instructions that determine
whether or not the branch is taken!
- Restructure the code in such
that there are NO nop instructions between the initial
save
instruction, and the mov %o0, y instruction, while ensuring
the the code also produces the correct result!
|
Filling delay slots and removing nop instructions is important as it
both reduces the size of the executable, and saves operations
(therefore speeds up the code). As we saw above,
this is normally done by the compiler, providing you use a compiler
flag that allows the compiler to use a moderate level of optimization.
Symbol Tables
Inspect the version of the
swap.c program in your
lab9
subdirectory. The file has been modified from the version discussed in
lectures to include a function that counts the number of times the
function
swap() is called.
For each symbol that is defined and referenced in swap.c,
indicate if it will have a symbol table entry in the .symtab
section in module swap.o. If so, indicate the symbol type
(function, object (=data), or undefined (=`notype', used for externally
defined symbols)), the symbol binding (local or global (external
symbols must also be global)) and the section (.text, .data or
.bss, or none) it occupies in that module.
| symbol |
has .symtab entry? |
symbol type |
symbol binding |
section |
| buf | | | | |
| bufp0 | | | | |
| bufp1 | | | | |
| swap | | | | |
| temp | | | | |
| incr | | | | |
| count | | | | |
Compile the program using gcc -c swap.c. Use the utility
program readelf -S -s swap.o (on Linux, note that
`Ndx' signifies section index) or /usr/ccs/bin/elfdump
swap.o (on wallaman) to verify your answers. (you may
notice that count is in a different section on
wallaman; remove the explicit initialization and repeat the
above).
Your colleague sees the above code and suggests that you should
replace the line
void incr()
with
static void incr()
why is this a good idea? (hint: use readelf/elfdump to compare the symbol
tables obtained for swap.o with and without this modification).
A Challenge!
Consider the following program, which consists of two object modules:
/* file foo.c */
void p2(void);
int main() {
p2();
return 0;
}
/* file bar.c */
#include <stdio.h>
char main;
void p2() {
printf("0x%x\n", main);
}
When this program is compiled and executed on an x86 Linux system, it
prints the string "0xffffff8d\n".
and terminates normally, even though
p2 never initializes the variable main. Explain exactly
why this value is printed out? Interestingly the same code fails to link
on wallaman, even though we are still compiling with gcc
- why is this different?
Extension: x86 Assembly Language.
If you have the time and curiosity, repeat the exercises of the section of generating assembly from C on x86
Linux (use objdump -D instead of dis).
A Further Challenge!
example1.S terminated using a ta 0 instruction.
example2.S terminates
using a ret followed by a restore instruction.
This was because it was discovered that, without that, the output
of example2 would disappear if it was re-directed to a file!
(which would cause its automated marking to fail!). Why is this ???
Last modified: 19/05/2009, 12:14
|
|
 |