Name: Student ID: Lab Group: Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1. From the assembly cut out all those lines that form the main body of the loop. This is all assembly instructions that are repeated as the .L3 loop is repeated. movsd (%rsi,%rax), %xmm8 // load a[k] xmm8 movsd 8(%rsi,%rax), %xmm7 // load a[k+1] xmm7 movsd 16(%rsi,%rax), %xmm6 // load a[k+2] xmm6 addsd (%rdx,%rax), %xmm8 // add b[k] xmm8 movsd 24(%rsi,%rax), %xmm5 // load a[k+3] addsd 8(%rdx,%rax), %xmm7 // add b[k+1] xmm7 movsd 32(%rsi,%rax), %xmm4 // load a[k+4] addsd 16(%rdx,%rax), %xmm6 // add b[k+2] xmm6 movsd 40(%rsi,%rax), %xmm3 // load a[k+5] addsd 24(%rdx,%rax), %xmm5 // add b[k+3] xmm5 movsd 48(%rsi,%rax), %xmm2 // load a[k+6] addsd 32(%rdx,%rax), %xmm4 // add b[k+4] xmm4 movsd 56(%rsi,%rax), %xmm1 // load a[k+7] addsd 40(%rdx,%rax), %xmm3 // add b[k+5] xmm3 addsd 48(%rdx,%rax), %xmm2 // add b[k+6] xmm2 movsd %xmm8, (%rsi,%rax) // st xmm8 a[k] addsd 56(%rdx,%rax), %xmm1 // add b[k+7] xmm1 movsd %xmm7, 8(%rsi,%rax) // st xmm7 a[k+1] movsd %xmm6, 16(%rsi,%rax) // st xmm6 a[k+2] movsd %xmm5, 24(%rsi,%rax) // st xmm5 a[k+3] movsd %xmm4, 32(%rsi,%rax) // st xmm4 a[k+4] movsd %xmm3, 40(%rsi,%rax) // st xmm3 a[k+5] movsd %xmm2, 48(%rsi,%rax) // st xmm2 a[k+6] movsd %xmm1, 56(%rsi,%rax) // st xmm1 a[k+7] addq $64, %rax // update loop counter +8*8B cmpq %rcx, %rax // compare with boundary jne .L3 // jump if i < N You can also see the loop preconditioning here movsd 8(%rsi), %xmm9 // N%8 = 7 movl $16, %eax addsd 8(%rdx), %xmm9 movsd %xmm9, 8(%rsi) .L34: movsd (%rsi,%rax), %xmm10 // N%8 = 6 addsd (%rdx,%rax), %xmm10 movsd %xmm10, (%rsi,%rax) addq $8, %rax .L33: movsd (%rsi,%rax), %xmm11 // N%8 = 5 addsd (%rdx,%rax), %xmm11 movsd %xmm11, (%rsi,%rax) addq $8, %rax .L32: movsd (%rsi,%rax), %xmm12 // N%8 = 4 addsd (%rdx,%rax), %xmm12 movsd %xmm12, (%rsi,%rax) addq $8, %rax .L31: movsd (%rsi,%rax), %xmm13 // N%8 = 3 addsd (%rdx,%rax), %xmm13 movsd %xmm13, (%rsi,%rax) addq $8, %rax .L30: movsd (%rsi,%rax), %xmm14 // N%8 = 2 addsd (%rdx,%rax), %xmm14 movsd %xmm14, (%rsi,%rax) addq $8, %rax .L29: movsd (%rsi,%rax), %xmm15 // N%8 = 1 addsd (%rdx,%rax), %xmm15 movsd %xmm15, (%rsi,%rax) addq $8, %rax cmpq %rcx, %rax je .L37 ----------------------------------------------------------------------- Q2. The compiler has automatically unrolled the loop. How many times has it been unrolled. Justify your answer. Eight times. Each loop iteration a[k] = a[k] + b[k]; has 1 floating point add. The first operand is loaded to xmmXX register using movsd instruction while the second operand is directly loaded by addsd instrution. Addsd instruction is a CISC instruction composed of ld and add. ----------------------------------------------------------------------- Q3. Look at the code. Write down an expression in terms of a[i], a[i+n], a[i-n] etc to indicated what the program is doing. (Convince yourself that the program is working correctly, i.e. for small n work out results on paper by hand and make sure you agree with the program). a[i] = a[i] + a[i-1]; There is a dependency! ----------------------------------------------------------------------- Q4. Run runadd2 with input of 5, 10, 15, 20. You should find the "final a" output is correct for input values of 5 and 10, but wrong for input values of 15 and 20. (Note this for real life applications - works fine for small case, but error for large!) However the results for runadd1 are always correct. Give a detailed explanation as to what is happening and why runadd2 gives errors (but only in some cases). Code runadd2 is produced by compiling the code without the flag --fno-strict-aliasing. By default, the flag -fstrict-aliasing is enabled in -O2 and -O3. The default flag says the compiler, that we have no problems with aliasing. Thus aliasing is allowed. By adding --fno-strict-alliasing we want the complier to do some check an prevent any aliasing we do not have problems with this. So using the default flag, compiler assumes we know what we are doing and does not check for aliasing. It strictly assumes that a and b do not overlap. Which is not true!! In the following double array[500]; vadd(10, &array[0], &array[50]); // Case 1 vadd(51, &array[0], &array[50]); // Case 2 vadd(10, &array[0], &array[0]); // Case 3 void vadd(int n, double a[], double b[]){ int k; for (k = 0; k < n; k++){ a[k] = a[k] + b[k]; } } Case 1 does not involve aliasing. However increase the value of n from 10 to 51 and now we have an aliasing with a[51] equivalent to b[0]. Similarly case 3 is clearly aliased as both a and b are the same. C assumes that all function calls could possibly involve aliasing. (Fortran defines that they don't - wrong Fortran if the user does otherwise). Adding the compiler flag tells the C compiler and __restrict__ that you don not have aliasing in your code. Without aliasing there is a substantial performance advantage. It permits the compiler to unroll loops and overlap operations. Thus we can do something like 1 ld a[k] 2 ld b[k] 3 ld a[k+1] 4 ld b[k+1] fadd a[k]+b[k] //b[k] now arrived so do add 5 ld a[k+2] 6 ld b[k+2] fadd a[k+1]+b[k+1] //b[k] now arrived so do add 7 st a[k] However if a[k] and b[k+1] are the same locations we clearly have a problem - we are loading b[k+1] (step 4) before it has been updated with the result from the previous loop iteration (st a[k] in step 7). So we get the wrong result. Why do we only see wrong results for some values of N. Because the compiler realises that loop unrolling is expensive. Thus it only uses code with unrolled loops if the loop index is large. Otherwise it uses a non-unrolled loop. This limit appears to be around N=9, and indeed if we look at the assembly we find:- This should be a reason to you that you can add a compiler flag have it look as if your code is working still - for a small problem case - but then find that it gives errors for a larger test case. ----------------------------------------------------------------------- Q5.You should clearly see the performance difference. Write down the Average MFLOPS of both version and explain how this is possible. ./runadd1_perf 1000 1000000 Total FP instruction 999002048 Avegrage MFLOPS 295.021 Process time 3.386s Wall time 3.460s ./runadd2_perf 1000 1000000 Start computation... Total FP instruction 1000004736 Avegrage MFLOPS 1225.564 Process time 0.816s Wall time 0.839s The version amusing no aliasing is 4x faster!! ----------------------------------------------------------------------- Q6. Write the code for mydot_u2. Verify that it works correctly. If you have any problems ask your tutor. void mydot_u2(double x[], int n, double a[], int stda, double b[], int stdb, int offset){ int i; for (i=0; i