CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP2300 - TuteLab 09

COMP2300 / COMP6300

Tutorial Laboratory 09 - Caches, SPARC Assembly and
Linking and Loading

Semester 1, 2009         Week 11 (18 - 22 May)


As well as the Preparation Exercises for this class, you are expected to have revised lectures M3--M5 and O2 (of course it would be a very good idea to bring your notes with you to the session!). It will also be useful to have read this document beforehand.

Also for this session, there is a submitable laboratory exercise which is due by 09 am Tuesday 26 May (week 12), which will contribute up to 1% of your assessment (in the Tute/Lab mark).

Objectives

This lab has the following aims.
  • To deepen your understanding of caches.
  • To give you a brief introduction to the assembly language of a real machine, in this case SPARC assembly.
  • To understand load objects and symbol tables.

Preparation Exercises

the following questions on a separate sheet of paper, with your name and student number clearly written. Please ensure your writing is legible. Hand in to your tutor at the beginning of your tutorial / laboratory session.
  1. Define the terms direct-mapped cache and k-way set-associate cache. What are typical values for k?

  2. What is a cache line, and what lengths are these typically?

  3. What is a branch delay slot, and why were they introduced to RISC processors?

  4. Write the SPARC assembly language instruction that sets register %o2 to the sum of registers %i1 and %i2.

  5. Suppose you wished to write an assembly language program to call the C function int f(int x, int y). Which registers would you use to pass the parameters, and which would hold the return value?

Tutorial Questions on Caches

  1. Consider a program which repeatedly accesses all elements in an int array of of length N in a cyclic fashion (from first to last element), executing on a computer with a 64 KB data cache. Note that this is a cache version of Belady's anomaly (Lecture M2). For example, the computation could be:
    	while (1) {
    	  int s = 0, i;
    	  for (i = 0; i < N ; i++)
    	    s += a[i];
            }
    
    Suppose the first iteration has already been completed. Assume that the cache line size is 4 bytes, and the cache is fully associative (note: this is not a realistic cache!), and the replacement policy is least recently used Calculate the cache hit rate for (a) N = 213, (b) N = 214, (c) N = 215 and (d) N = 216. Hint: calculate the relative size of a[] to the cache first.

    Explain whether a random replacement policy performs better or worse in each case.

    Recall that the cache hit rate is calculated as the percentage of (in this case int) load/store operations that occur when the data is in the cache.

  2. Repeat the above for a direct-mapped cache (note that the replacement policy plays no role in this case).

  3. Suppose we now had a line size of 16 bytes. How would this affect the rate (consider the situation in Q2 where you had a 0%, 50% and 100% hit rate)? If so, does this necessarily mean the increase in line size would affect the computer's execution rate?

  4. An assumption in the above is that there is no significant amounts of data access other than to the array. Discuss the extent you expect this to hold on both SPARC and Intel x86 architectures. Hint: imagine writing the above loop in PeANUt; what other memory accesses than the array itself would you have?

  5. Reconsider Question 1 for the random replacement policy. For parts (c) and (d) , consider the following argument.
      For N=2^15, we expect half of the array will be resident in cache with a random replacement policy. Thus half the memory accesses will hit, resulting in a 50% hit rate.

      For N=2^15, we expect 1/4 of the array will be resident; thus we expect a 25% hit rate.

    Evidently, this argument was so persuasive that 2 PhD students and 100 undergraduates did not challenge it when it was proposed in 2007!

    Is this argument correct? What tools* do you have in order to test it? If it is correct, develop reasons to support the argument. If not correct, what are the reasons why this is the case.

    *The cache simulator developed in COMP2300 2008 is such a tool. For a cache consisting of 214 4-byte lines that is fully associative (-k 14), the cache hit rate for the first 10 repetitions, when the array size is 1, 2, and 4 times the cache size, the following commands can be used:

      cachesim2 -k 14 14 1 10
      cachesim2 -k 14 14 2 10
      cachesim2 -k 14 14 4 10

    No answers will be published on this question. Its up to you and your class to resolve it!

Laboratory Exercises

If you have read this document beforehand, you should be able to complete the exercise to the end of the section on branching in one hour of your session. If need be, the rest of the exercise should be completed for homework.

SPARC Assembly Language

As discussed in lectures the SPARC architecture is a load/store architecture. All arithmetic and logical operations are carried out between operands located in registers. Load and store instructions are provided to load and store register contents from memory. The machine has 32 registers available to the programmer at any one time. These registers are logically divided into four sets:
  • global registers (%g0-%g7 in SPARC assembly) are for global register data, ie data that has meaning for the entire program.
  • local registers (%l0-%l7) are for local function variables.
  • out registers (%o0-%o7) are used as temporaries and in passing arguments to functions
  • in registers (%i0-%i7) are used as temporaries and in receiving arguments from calling functions
Two of the out registers %o6 and %o7 have special use and should not be used (%o6 is the stack pointer or %sp, while %o7 is the called function's return address). Also the first global register (%g0) is special in that it always returns zero when read and discards anything written to it.

While the default for C programs files is to append a .c to the end of the file name, for assembly programs it is common to append a .s. Also, since compiling a C program is a two step process that first involves translation of the C program to assembly and then from assembly to machine code and linkage, we can use the C compiler both to produce assembly code from an existing C program, and to produce machine code and link our assembly programs.

  gcc -S program.c              # to produce assembly listings from C code
  gcc program.S -o executable   # to compile and link assembly code
  • Log into the T2 SPARC via ssh wallaman.
  • Copy the files /dept/dcs/comp2300/public/lab9/* into your lab9 folder.

We will start by looking at a very simple example, example1.S. Open your file in your favorite text editor.

The program relies on the C pre-processor (cpp) to expand macros, in an equivalent manner to concise macros in PeANUt (except we can use it for register names as well as numbers). The suffix convention .S instructs gcc to run cpp over the text, before invoking the assembler (as).
  • Execute the command gcc -E example1.S. This applies cpp only, and displays the assembler code with the macros expanded to the standard output. This will make clear what the preprocessor has done (note that cpp has also expanded the macros within the assembler comments! This is because it does not recognize '!' as a comment. For the same reason, we cannot have an assembler comment on the same line as a #define!

Things to note about the assembly code:

  • The linker will look for for an address (pointing to a block of code) labelled main. This is declared to be .global in an analogous manner to the global definition in PeANUt and for the same reason.
  • Thus the first user program instruction to be executed is save %sp,-96,%sp. This command provides space, on the stack, to save ALL general-purpose registers (if required).
  • Most SPARC instructions take three operands: two registers and a literal constant, or three registers:
    op regrs1, reg_or_imm, regrd
    Where the content of the first or source register regrs1 is combined with the literal or the contents of the second source register regrs2 to produce a result that is stored in the destination register regrd . The number of bits available in the instruction word for storing a literal constant allows for values from -4096 to +4095.
  • mov moves either a register or literal from the source to the destination register.
    mov regrs1, regrd
  • sub is obviously subtract, while add is addition.
  • The original SPARC Instruction Set Architecture (ISA) had no integer multiplication or division instructions. Instead these operations are accomplished by a series of bit shifts and additions. This is, however, all available in system provided functions we access via the call .mul and call .div statements (we will come back to this point at the very end of this lab!). Note that in C, the .mul function would have the header:
      int .mul(int x0, int x1) /* returns x0*x1 */;
    and similarly for .div
  • Notice how the operands for the multiplication and division are placed in registers %o0 and %o1, and the result is retrieved from %o0.
  • Notice also that after each function call we have a branch delay slot that has been filled via a nop or "No Operation" instruction. This consumes one execution cycle (in one of the integer pipelines) but does nothing.
  • Finally the last three instructions return us to the operating system. The trap instruction ta calls the operating system with the service request encoded into register %g1 (not so different to how we terminated a PeANUt program).
  • Compile and run the program on wallaman as follows:
      gcc example1.S -o example1
      ./example1
    

A problem, that you may now have noticed, is that we have no means of knowing if the program is correct since it produces no output! Just like in PeANUt, input/output is an involved topic since it involves translation of the number to an ASCII representation etc etc. At this stage we could use the debugger (gdb) to run the code and examine the contents of the various registers as the calculation proceeds. This in itself would be a valuable exercise, but I will let you pursue that in your leisure time. Instead we, will do basic integer I/O through a user defined function that in C corresponds to:

   void myprt(int y){
     printf( "Integer value is %d\n", y);
   }
We use this function in the program example2.S. Inspect the file. You should note the following:
  • The call to myprt from the main program and how the value of the register to be printed is passed to the function.
  • How (and where) the character string associated with printf() is stored.
  • How the base address for this character string, defined by .str, is loaded into register %o0 via the sethi (set high 22 bits of an immediate constant) and subsequent or instructions.
  • Why can the address of the print string not be set using a single instruction? (it requires both the sethi and or instructions outlined above).
  • Compile and run the executable and ensure that it prints out the correct result (-8).

Branch Delay Slots

As mentioned in lectures, a branch delay slot is a characteristic of RISC machines. In our program the call instruction is an example of an operation that has a branch delay slot and these are currently filled with nop instructions.

  • Comment out the nop instructions (remember to use a `!', not a semicolon; a semicolon will have no effect in SPARC AL) from immediately after the call .mul AND the call .div lines. Recompile the code and run it. Verify that the results are now wrong! Restore the nop instructions before proceeding.

Branching

Like in PeANUt branch instructions test for condition codes in order to determine if a branch condition exists. Branch instructions take the following form:
   bicc   label
where bicc stands for one of the branches testing the integer condition codes. A few of the possible branch conditions are given in the following table:

Assembly Action
ba branch always
bn branch never (why does this exist?)
bl branch on less than zero
ble branch on less or equal to zero
be branch on equal to zero
bne branch on not equal to zero
bge branch on greater or equal to zero
bg branch on greater than zero

To set the condition code we can use the following compare instruction

cmp regrs1, reg_or_imm

(this is just a shorthand for the more general instruction subcc regrs1,reg_or_imm,%g0 , which subtracts the two operands and sets the CPU's condition codes according the result of the subtraction. As the destination register for the result is is %g0, the result is effectively thrown away, but the condition codes are set accordingly. So in the above table, comparing the result of the subtraction to zero is equivalent to comparing the two operands of the cmp with each other)

For example we might construct a conditional branch in the following manner:

  sub x, 1, x       ! x=x-1 /* assume x is aliased to a register, i.e. %l0 */
  cmp x, 0          ! compare x with 0 
  bgt loop          ! branch to loop if x > 0
   nop              ! note we again have a branch delay slot

The C program example3.c computes the value of the simple polynomial for integer values of x from 0 to 10. Inspect this file.
  • Compile and run example3.c making sure that the results are what you expect them to be!
  • Copy example2.S into example3.S, and modify the code of example3.S to reproduce the results from example3.c.
  • Now modify myprt in example3.S to also print x using the equivalent of the following C statement:
        
            printf("Value of y=%d, for x=%d\n", y, x);
    
  • Once working, do a final check for correctness with the command (must be from wallaman!):
      previewAutoMark lab9 example3.S
    When satisfied, submit it using the submit command:

        submit comp2300 lab9 example3.S
        submit comp6300 lab9 example3.S

    This is due by 09 am Tuesday May 26 (the deadline is strict) and will contribute up to 1% of your Tute/Lab mark.

Generating Assembly from C Code

  • Generate assembly for this C code using
    gcc -S -mcpu=v7 example3.c

A few differences from what we have done will be evident:

  • Lots of header info for the main() function (that we did not use in the .S programs).
  • Slightly different stack pointer offsets in the save instruction (it requires more space).
  • Use of %g0 to initialize x to zero.
  • All the nop instructions have been removed.
In fact, even though we have not asked for optimization, this particular version of gcc (4.0.4) performs many optimizations. This can actually be a problem in some situations (e.g. in Assignment 3, it performs an optimization which obscures memory hierarchy effects, and there seems to be no way of stopping it from doing so).
  • Now generate assembly language for the UltraSPARC (and T2) instruction set (which confirms to version 9 of the SPARC architecture specification).
    gcc -O -S example3.c
    If you now inspect example3.S, you will notice that the calls to .mul and .div have been removed. The UltraSPARC does actually have multiply and divide instructions - but the original SPARC architecture did not! So these instructions are only used if you tell the compiler that you never want to run the executable on older machines (which by these days will be hard indeed to find anyway!).
  • Finally you should note that, given an executable image, it is possible to disassemble this into assembly code. The command is:
    /usr/ccs/bin/dis
    Do man dis on wallaman.

    • Compile the C version of example3.c
      gcc -O example3.c -o example3
      Run the executable and make sure it works. Then disassemble it using the following command
      /usr/ccs/bin/dis example3 > example3.dis
      Inspect the resulting file using an editor. Compare the code for _start with the pseudo-code given in the O2 lecture. You should be able to locate the main program and function myprt and relate the resulting assembly to what is given via
      gcc -O -S example3.c

    Delay slot exercise (only when you have completed the parts below)

    If we know that the instruction immediately after call instructions is always executed BEFORE jumping to whatever address corresponds to the called function, we can restructure the code to try and move suitable instructions from before the call instruction into this branch delay slot. Obviously these moved instructions cannot be instructions that determine whether or not the branch is taken!

    • Restructure the code in such that there are NO nop instructions between the initial save instruction, and the mov %o0, y instruction, while ensuring the the code also produces the correct result!

    Filling delay slots and removing nop instructions is important as it both reduces the size of the executable, and saves operations (therefore speeds up the code). As we saw above, this is normally done by the compiler, providing you use a compiler flag that allows the compiler to use a moderate level of optimization.

    Symbol Tables

    Inspect the version of the swap.c program in your lab9 subdirectory. The file has been modified from the version discussed in lectures to include a function that counts the number of times the function swap() is called.

  • For each symbol that is defined and referenced in swap.c, indicate if it will have a symbol table entry in the .symtab section in module swap.o. If so, indicate the symbol type (function, object (=data), or undefined (=`notype', used for externally defined symbols)), the symbol binding (local or global (external symbols must also be global)) and the section (.text, .data or .bss, or none) it occupies in that module.

    symbol has .symtab entry? symbol type symbol binding section
    buf     
    bufp0    
    bufp1    
    swap     
    temp     
    incr     
    count    

  • Compile the program using gcc -c swap.c. Use the utility program readelf -S -s swap.o (on Linux, note that `Ndx' signifies section index) or /usr/ccs/bin/elfdump swap.o (on wallaman) to verify your answers. (you may notice that count is in a different section on wallaman; remove the explicit initialization and repeat the above).
  • Your colleague sees the above code and suggests that you should replace the line
        void incr()
        
    with
        static void incr()
        
    why is this a good idea? (hint: use readelf/elfdump to compare the symbol tables obtained for swap.o with and without this modification).

    A Challenge!

    Consider the following program, which consists of two object modules:
        /* file foo.c */
        void p2(void);
        int main() { 
          p2();
          return 0;
        }
        
        /* file bar.c */
        #include <stdio.h>
        char main;
        void p2() {
          printf("0x%x\n", main);
        }
    
    When this program is compiled and executed on an x86 Linux system, it prints the string "0xffffff8d\n". and terminates normally, even though p2 never initializes the variable main. Explain exactly why this value is printed out? Interestingly the same code fails to link on wallaman, even though we are still compiling with gcc - why is this different?


    Extension: x86 Assembly Language.

    If you have the time and curiosity, repeat the exercises of the section of generating assembly from C on x86 Linux (use objdump -D instead of dis).

    A Further Challenge!

    example1.S terminated using a ta 0 instruction. example2.S terminates using a ret followed by a restore instruction. This was because it was discovered that, without that, the output of example2 would disappear if it was re-directed to a file! (which would cause its automated marking to fail!). Why is this ???

    Last modified: 19/05/2009, 12:14

  • Copyright | Disclaimer | Privacy | Contact ANU