Name: Student ID: Lab Group: Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1. From the assembly cut out all those lines that form the main body of the loop. This is all assembly instructions that are repeated as the L900000107 loop is repeated. (Remember SPARC assembly has a branch delay slot) .L900000107: /* 0x002c 4 */ ldd [%o2],%f4 /* 0x0030 */ add %o3,2,%o3 /* 0x0034 */ add %o2,16,%o2 /* 0x0038 */ cmp %o3,%o5 /* 0x003c */ add %o1,16,%o1 /* 0x0040 */ ldd [%o1-16],%f10 /* 0x0044 */ faddd %f0,%f4,%f6 /* 0x0048 */ std %f6,[%o1-24] /* 0x004c */ ldd [%o2-8],%f8 /* 0x0050 */ faddd %f10,%f8,%f12 /* 0x0054 */ ldd [%o1-8],%f0 /* 0x0058 */ ble,pt %icc,.L900000107 /* 0x005c */ std %f12,[%o1-16] ----------------------------------------------------------------------- Q2. The compiler has automatically unrolled the loop. How many times has it been unrolled. Justify your answer. Twice. Each loop iteration a[k] = a[k] + b[k]; has 1 floating point add. The above assembly has 2 faddd operations. Also it have 4 loads (ldd) not 2 and 2 stores (std) not 1. Hence unrolled twice. ----------------------------------------------------------------------- Q3. Using the grouping rules for the SPARC architecture, and assuming that data is loaded/stored to level 1 cache (2 cycle latency), how many cycles will it take for 1 iteration of this unrolled loop to complete. Provide your answer in an analogous form to slide 8 in the loop optimisation lecture notes. It is important that I can follow your estimate, since it is likely to be wrong given what you know about SPARC, but that is okay and I will explain more in the answers. .L900000107: /* 0x0040 4 */ add %g5,2,%g5 /* 0x0044 */ add %o1,16,%o1 /* 0x0048 */ ldd [%o5],%f4 (cant do any more int instructions, ie cant do cmp) /* 0x004c */ cmp %g5,%o2 /* 0x0050 */ add %o5,16,%o5 (cant execute faddd as waiting for %f4 to arrive /* 0x0054 */ faddd %f0,%f4,%f6 /* 0x0058 */ ldd [%o1-16],%f10 /* 0x005c */ std %f6,[%o1-24] /* 0x0060 */ ldd [%o5-8],%f8 x wait for ldd to arrive /* 0x0064 */ faddd %f10,%f8,%f12 /* 0x0068 */ ldd [%o1-8],%f0 /* 0x006c */ ble,pt %icc,.L900000107 /* 0x0070 */ std %f12,[%o1-16] From the above we would actually expect 8 cycles for two loop iterations. So 4 cycles per iteration. In fact the compiler appears to estimate 5: do cc -fast -xarch=v8plusa -xprefetch=no -Wc,-Qms_pipe+info -S vadd.c and looking at Achieved II : 5 cycles Not sure why they estimate one extra cycle. Maybe they estimate longer latency for loads, or there something to do with the stores. ----------------------------------------------------------------------- Q4. Look at the code. Write down an expression in terms of a[i], a[i+n], a[i-n] etc to indicated what the program is doing. (Convince yourself that the program is working correctly, i.e. for small n work out results on paper by hand and make sure you agree with the program). a[i] = a[i] + a[i-1]; There is a dependency! ----------------------------------------------------------------------- Q5. Run runadd2 with input of 5, 10, 15, 20. You should find the "final a" output is correct for input values of 5 and 10, but wrong for input values of 15 and 20. (Note this for real life applications - works fine for small case, but error for large!) However the results for runadd1 are always correct. Give a detailed explanation as to what is happening and why runadd2 gives errors (but only in some cases). Code runadd2 is produced by compiling the code with the flag -xrestrict. This flag says that when I call a function and pass it two addresses, any operation that takes place within the function that uses those addresses (or offsets thereof) do not involve the same memory locations. This is called aliasing, ie having two names for the same memory location. In the following double array[500]; vadd(10, &array[0], &array[50]); // Case 1 vadd(51, &array[0], &array[50]); // Case 2 vadd(10, &array[0], &array[0]); // Case 3 void vadd(int n, double a[], double b[]){ int k; for (k = 0; k < n; k++){ a[k] = a[k] + b[k]; } } Case 1 does not involve aliasing. However increase the value of n from 10 to 51 and now we have an aliasing with a[51] equivalent to b[0]. Similiarly case 3 is clearly aliased as both a and b are the same. C assumes that all function calls could possibly involve aliasing. (Fortran defines that they dont - wrong Fortran if the user does otherwise). Adding the compiler flag tells the C compiler that you don not have aliasing in your code. Without aliasing there is a substantial performance advantage. It permits the compiler to unroll loops and overlap operations. Thus we can do something like 1 ld a[k] 2 ld b[k] 3 ld a[k+1] 4 ld b[k+1] fadd a[k]+b[k] //b[k] now arrived so do add 5 ld a[k+2] 6 ld b[k+2] fadd a[k+1]+b[k+1] //b[k] now arrived so do add 7 st a[k] However if a[k] and b[k+1] are the same locations we clearly have a problem - we are loading b[k+1] (step 4) before it has been updated with the result from the previous loop iteration (st a[k] in step 7). So we get the wrong result. Why do we only see wrong results for some values of N. Because the compiler realises that loop unrolling is expensive. Thus it only uses code with unrolled loops if the loop index is large. Otherwise it uses a non-unrolled loop. This limit appears to be around N=15, and indeed if we look at the assembly we find:- /* 0x0018 */ cmp %o1,15 ## %o1 is the value of N. This should be a leason to you that you can add a compiler flag have it look as if your code is working still - for a small problem case - but then find that it gives errors for a larger test case. ----------------------------------------------------------------------- Q6. Write the code for mydot_u2. Verify that it works correctly. If you have any problems ask your tutor. void mydot_u2(double x[], int n, double a[], int stda, double b[], int stdb, int offset){ int i; for (i=0; i