Name: A. Very-Good-Student Student ID: u9507815 Lab Group: Afterhours Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1: Transfer lab2.tar.gz to XE cluster. I downloaded the tar file on my desktop, then click Palces->Connect to server... Then I filled login info and home directory, opened a folder and transferred the package. ----------------------------------------------------------------------- Q2: What machine (use hostname) are your running on? xe ----------------------------------------------------------------------- Q3: Is PAPI module in main memory. If it is not, load PAPI module? Yes. I've used 'module load papi' and checked it by typing 'module list' ----------------------------------------------------------------------- Q4: What version of PAPI did you load into XE memory (papi_version)? 4.1.1.0 ----------------------------------------------------------------------- Q5: What is the frequency of XE's CPUs (papi_avail)? CPU Megahertz : 2799.700928 CPU Clock Megahertz : 2799 The frequency is 2.8GHz ----------------------------------------------------------------------- Q6: How many events can you run simultaneously and multiplexed (papi_avail)? Number Hardware Counters : 5 : I can run up to 5 counters at a time Max Multiplex Counters : 512 : Using multiplexing, I can run up to 512 counters ----------------------------------------------------------------------- Q7: How many events are available on XE (papi_avail)? Of 107 possible events, 51 are available, of which 12 are derived. I can use 51 events (presented on Intel Core 2). I cannot use some counters because they do not make any sense for this CPU. (I cannot count L3 related events knowing there is no L3 cache here) ----------------------------------------------------------------------- Q8: What is the name of the event that counts number of CPU clock cycles (papi_avail)? PAPI_TOT_CYC ----------------------------------------------------------------------- Q9 and Q10: Write down the sizes of L1 and L2 caches (papi_mem_info)? Using papi_mem_info Cache Information. L1 Data Cache: Total size: 32 KB Line size: 64 B Number of Lines: 512 Associativity: 8 L1 Instruction Cache: Total size: 32 KB Line size: 64 B Number of Lines: 512 Associativity: 8 L2 Unified Cache: Total size: 6144 KB Line size: 64 B Number of Lines: 98304 Associativity: 24 ----------------------------------------------------------------------- Q11: The overhead of PAPI can be measured by (papi_cost)? What is the overhead of i) starting PAPI, ii) reading a PAPI counter and iii) resetting a PAPI counter in terms of clock cycles and seconds? I've taken mean cycles from the papi_cost output i) starting PAPI 16175 cycles 5.75us (cycles * 1/CPU_FREQ) ii) reading a PAPI counter and 8135 cycles 2.91us iii) resetting a PAPI counter 2491 cycles 0.88us ----------------------------------------------------------------------- Q12: Is the overhead of PAPI comparable with gettimeofday we used in last lab? Overhead of gettimeofday is less than a microssecond while PAPI reaches 3us for reading counter. The overhead of both should be on the same order. So, their overheads are comparable ----------------------------------------------------------------------- --------------------Simplified PAPI C interface ----------------------- ----------------------------------------------------------------------- Q13: Complete the following table. --------------------------------------------------------------------- Loop 1 Loop 2 nsize nrpt Data size[B] MFLOPS MFLOPS --------------------------------------------------------------------- 100 10000000 1.6KB 1793 2643 1000 1000000 16KB 1858 2782 10000 100000 160KB 1864 2785 100000 10000 1.6MB 1853 2747 1000000 1000 16MB 438 640 10000000 100 160MB 436 644 ---------------------------------------------------------------------- ----------------------------------------------------------------------- Q14: Why does the loop 2 reach better performance? Loop 1: sum = sum + (a[k]*b[k]); Loop 2: sum = sum + a[k] * (a[k]+b[k]); Assume, sum sits in a register. Loop 1: We have to load 2 operands from memory to calculate 2 OPS (1x add and 1x mul) Loop 2: We have to load 2 operands from memory to calculate 3 OPS (2x add and 1x mul) We are still limited by memory accesses in loop 1. (we could perform more operations but have to wait for data) ----------------------------------------------------------------------- Q15: In the above table you should find that with increasing nsize the performance initially improves, but then gets worse. Explain why this is. For very small arrays a and b, the overhead of outer loop has an impact on the proformance For small arrays (KBs - 6MB), the entire arrays sit in cache, which has lower latency than main memory For big arrays, we are forced to work with arrays stored in main memory. This increases cache miss and reduces performance. GOAL: Keep the working set in CACHE. ----------------------------------------------------------------------- Q16: Complete the following table. ---------------------------------------- Loop 3 Threshold value MFLOPS ---------------------------------------- 0.5 1256 0.66 1337 0.75 1478 0.9 1952 0.95 2235 0.99 2510 0.999 2668 ---------------------------------------- ----------------------------------------------------------------------- Q17:How would you explain these results? The performance is influenced by branch misprediction. When the treasured value is 0.5, the branch is utterly unpredictable, because 50% of cases ends up with true and 50% with false branch. Misprediction of loop means, we've already started one way of a branch, but now we know, that was wrong. So, we have to cancel instruction executed so far and start the correct branch. When increasing the threshold, the if statement becomes more and more predictable. In order to reach the peak performance, the misprediction rate should be kept bellow 1%. ----------------------------------------------------------------------- -------------------- High level PAPI C interface ---------------------- ----------------------------------------------------------------------- Q18 and 19:For one thread, the Intel Xeon processor can initiate two floating point operations (addition and multiplication) each clock cycle. Given this how many cycles might you expect the above loop to take? Assuming no waiting for data and no dependency stall. Loop 1 : Total clock cycles = (nsize * nptr * 2) / 2 Loop 2 : Total clock cycles = (nsize * nptr * 3) / 2 ------------------------------------------------------------------------ Q20: Complete the following table. ------------------------------------------------------------------------ Loop 1 Loop 2 nsize nrpt Data size[B] instr/cycle instr/cycle ------------------------------------------------------------------------ 100 10000000 1.6KB 205/320 305/323 1000 1000000 16KB 200/302 300/302 10000 100000 160KB 200/301 300/302 100000 10000 1.6MB 200/302 300/305 1000000 1000 16MB 200/1300 300/1326 10000000 100 160MB 200/1305 300/1324 ---------------------------------------------------------------------- Loop 1: You can do 1 instruction per a 1.5 cycle, when data sits in cache. You cannot execute 2 per cycle because mul and add is depended in this code, and multiplication takes more time, so the adder has to wait for the multiplier. Loop 2: unless working with RAM, we can execute 1 instr per cycle. ------------------------------------------------------------------------ Q21: Open the Makefile and change compiler optimizations level form -O3 to -O1, rebuild the code typing make clean and make and measure the first and the last row again. Is there any difference? Do not forget to put back -O3 when finished. There is a significant difference for small arrays. Loop 1: Using -O1, the peak performance is only 392 MFLOPS (was 1800) Loop 1: Using -O1, the peak performance is only 913 MFLOPS (was 2800) ------------------------------------------------------------------------ Q22: Which performance counters would you use in order to determine the the level 1 data cache and level 2 miss rate? PAPI_L1_DCM PAPI_L1_DCA PAPI_L2_TCM - L2 is unified (data + instructions in the same cache) PAPI_L2_TCA ------------------------------------------------------------------------ Q22 -25: Complete the following table. Work only with the first loop. -----------------------------------------------------------+------------------------------- L1 data cache | L2 data cache nsize nrpt Data size[B] misses accesses miss rate | misses accesses miss rate -----------------------------------------------------------+------------------------------- 100 10000000 1.6KB 282 2.09G 0.0000001 428 1925 0.222 1000 1000000 16KB 16.7K 2.09G 0.000008 524 18.5K 0.0283 10000 100000 160KB 250M 2.00G 0.125 252 694M 0.00000 100000 10000 1.6MB 250M 2.00G 0.125 3990 693M 0.00006 1000000 1000 16MB 250M 2.00G 0.125 250M 489M 0.51 10000000 100 160MB 250M 2.00G 0.125 250M 489M 0.51 ----------------------------------------------------------------------------------------- a) For very small arrays, we can see a negligible L1 miss rate. b)With increasing size, L1 becomes to small to accommodate the vectors-> you can see a higher cache miss Why the L1 cache miss remains at 12.5%? Recall, if there is a cache miss, we will download a complete cache line (64B = 8 elemnts). Thus, reading the first one causes cache miss, but the other 7 ends up with cache hit c) For small arrays, we do not use L2, the number accesses is tiny. When L1 is to small, the number of accesses goes up, but cache miss remains low (we are working in L2). When increasing the size beyond 6MB, we have to go to main memory and the L2 cache miss rate grows rapidly. ----------------------------------------------------------------------- ------------------------- COMP6464 Only : Stream ---------------------- ----------------------------------------------------------------------- ----------------------------------------------------------------------- Q26. Run the code on XE and partch and complete the following table. MACHINE 1 - xe (gcc -O3) ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 3369.0880 0.0099 0.0095 0.0102 Scale: 3291.8286 0.0099 0.0097 0.0105 Add: 3505.4166 0.0144 0.0137 0.0153 Triad: 3432.7370 0.0144 0.0140 0.0152 MACHINE 2 - partch (gcc -O2) ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2855.8179 0.0124 0.0112 0.0214 Scale: 2862.7619 0.0117 0.0112 0.0148 Add: 3098.1901 0.0156 0.0155 0.0161 Triad: 3042.9799 0.0158 0.0158 0.0160 XE has better memory bandwidth overall than partch. XE memory is built from DDR2-800 modules with peak bandwidth of 6.4GB/s ----------------------------------------------------------------------- Q27. The Copy operation corresponds to: for (i = 0; i < LENGTH; i++) a[i] = b[i]; write similar expressions for Scale, Add and Triad scale for (i = 0; i < LENGTH; i++) a[i] = scalar*b[i]; add for (i = 0; i < LENGTH; i++) a[i] = b[i]+c[i]; triad for (i = 0; i < LENGTH; i++) a[i] = b[i]+scalar*c[i]; ----------------------------------------------------------------------- Q28. Given you knowledge of computer hardware would you expect Stream to report any difference in the MB/s rate for each of the above operations (Copy, Scale, Add, Triad)? Provide justification for you answer. The results reported are MB/s obtained while undertaking the above operations (given in Q21). All these operations are limited by memory traffic. The Core2 and Opteron can only do EITHER one load or one store per cycle. So the MB/s rate would be expected to be the same for all loops that are memory traffic limited -regardless of whether this is load or store operations. The differences between copy, scale, add, triad may also be due to the number of FLOPS carried out in each. The following table summarises how MB/s is counted in stream ------------------------------------------------------------------ name kernel bytes/iter FLOPS/iter ------------------------------------------------------------------ COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 ------------------------------------------------------------------ source: http://www.cs.virginia.edu/stream/ref.html#counting NOTE some machines (particularly vector processors) could run 1 load and 1 store per cycle (Fujitsu VPP300) - or even 2 loads and 1 store (Cray YMP) per cycle. If this were true you could see different MB/sec rates for the different operations depending on the balance between loads and stores.