Name: A. Very-Good-Student Student ID: u9507815 Lab Group: Afterhours Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1: What machine (use hostname) are your running on? I will use wallaman - becaue you told me to. ----------------------------------------------------------------------- Q2: What is its clock rate (use fpversion)? Kernel says CPU's clock rate is 1165.4 MHz. ----------------------------------------------------------------------- Q3: For BOTH the L1 and L2 caches what are their i) sizes, ii) cache line sizes, iii) associativity. -xcache=8/16/4:4096/64/16 L1 8KB Cache/ 16B Cache Line / 4 way set associative L2 4MB Cache/ 64B Cache Line / 16 way set associative ----------------------------------------------------------------------- Q4: The wrapper functions add overhead to the event counts. Devise a simple program which will roughly determine the extent of the overhead of hwevts_begin and hwevts_end. How does this define the way you will use these counters? A simple test involving a loop which calls Loop: hwevts_begin() hwevts_end() print ctr1, ctr2 repeatedly would give you an idea of the overheads of using performance counters. If we are measuring the number of instructions and the number of cycles, the counters should return zero at every iteration because there is nothing done between hwevts_begin() and hwevts_end(). Instead, however, the number of instructions reported (for my test code) is: Iter 1: Instr = 1487, Cycles = 4504 Iter 2: Instr = 83, Cycles = 3734 Iter 3: Instr = 83, Cycles = 3514 Iter 4: Instr = 84, Cycles = 3735 Iter 5: Instr = 83, Cycles = 3625 Iter 6: Instr = 83, Cycles = 3624 Thus, this suggests that we should discard the first measurement entirely, as it has very high overheads probably due to various initializations. We must also be aware of the ~83 Instr and ~3600 cycle overhead, and only measure blocks of code which has instruction counts and cycle counts that are much larger than these overhead values. ----------------------------------------------------------------------- Q5: For one thread, the UltraSPARC T2 processor can initiate one floating point operation each clock cycle. Given this how many cycles might you expect the above loop to take? This is a difficult question to answer without having a look at the assembly code that is generated by the compiler. However, by researching how the T2's pipeline works, we are able to come up with an educated guess of sorts. sum = sum + a[k]*b[k]; Assuming that "sum" is in register r0 throughout the calculation and a[k],b[k] is in L1 Cache. In the best possible case we should only require around 1 cycle per operation, however, due to dependencies between instructions the pipeline is stalled often. We know that there is a 3 cycle store for loads from L1, and 6 cycles for dependant F.P instructions (eg. fadd and fmul), so the pipeline may look like this (this depends on the compiler). Cycle Instr 1 load a[k] -> r1 2 load b[k] -> r2 3 (Stall) - wait for r2 4 (Stall) 5 fmul r1, r2 -> r3 6 (Stall) - wait for r3 7 (Stall) 8 (Stall) 9 (Stall) 10 (Stall) 11 fadd r3, r0 -> r0 Thus 11 cycles will be required provided the above assumptions are true. For instance, if the data is not in L1 cache, then the penalty can be much greater ie. 26 cycles form L2, 100+ cycles from Memory. There are also cycle costs from carrying out the loop which we have not accounted for, such as the 5 cycle branch misprediction penalty. If loop overhead is included, then we may require up to 14 cycles, ie. add i, 1 -> i (increment i) cmp i, nsize (compare ) ble (branch) This is the best case scenario for loop overhead costs, which assumes that there is no branch misprediction. Therefore, approximately 14 cycles ----------------------------------------------------------------------- Q6: If the line sum = sum + a[k]*b[k]; were replaced by the line sum = sum + a[k]*(a[k]+b[k]); how many cycles would now be required to complete 1 iteration? Cycle Instr 1 load a[k] -> r1 2 load b[k] -> r2 3 (Stall) - wait for r2 4 (Stall) 5 fadd r1, r2 -> r3 6 (Stall) - wait for r3 7 (Stall) 8 (Stall) 9 (Stall) 10 (Stall) 11 fmul r1, r3 -> r4 12 (Stall) - wait for r4 13 (Stall) 14 (Stall) 15 (Stall) 16 (Stall) 17 fadd r4, r0 -> r0 Approximately 17 cycles [+ 3 cycle for loop overhead] based on the assumptions made in the previous answer. ----------------------------------------------------------------------- Q7: ------------------------------------------------------ nsize nrpt Cycles Cycle/iter Inst/cycle ------------------------------------------------------ 100 1000000 3943695427 39.437 0.189 << Affected by overheads 100 1000000 4013297330 40.133 0.185 100 1000000 3943827166 39.438 0.189 200 100000 496404518 24.820 0.281 200 100000 492951027 24.648 0.283 200 100000 504751131 25.238 0.276 400 100000 834809890 20.870 0.323 400 100000 867805332 21.695 0.310 400 100000 878312843 21.958 0.307 800 100000 2594343668 32.429 0.204 << No longer in L1 800 100000 2595687172 32.446 0.204 800 100000 2595446203 32.443 0.204 1000 10000 313267617 31.327 0.210 1000 10000 313570779 31.357 0.210 1000 10000 313427427 31.343 0.210 10000 1000 274707779 27.471 0.237 10000 1000 274664051 27.466 0.237 10000 1000 274673065 27.467 0.237 100000 100 271073767 27.107 0.240 100000 100 270989741 27.099 0.240 100000 100 271103848 27.110 0.240 1000000 10 608934748 60.893 0.107 << No longer in L2 1000000 10 609097859 60.910 0.107 1000000 10 608941342 60.894 0.107 ------------------------------------------------------ ----------------------------------------------------------------------- Q8: In the above table you should find that with increasing nsize the performance initially improves, but then gets worse. Explain why this is. Increases nsize increase the length of the inner loop, which gives better pipelining and better performance (and less counter overheads). However as nsize increases vectors a and b are no longer able to fit in level 1 cache, and we have to obtain data from level 2 cache. At this point performance degrades, and this is especially true when we exceed L2 cache. ----------------------------------------------------------------------- Q9: Earlier you were asked to predict the best performance obtainable for this loop on the UltraSPARC T2. What percentage of this theoretical peak is i) your fastest result ii) your slowest result? (You should find there is a big difference between the best and worse). Assuming the best possible is 14 cycles (based on our assumptions in Q5) Best is around 21 cycles per iteration = 67% of peak (14/21) Worst is around 60 cycles per iteration = 23% of peak (14/60) ----------------------------------------------------------------------- Q9. By setting the relevant performance counters we can gather information on the level 1 data cache and level 2 or "ecache" hit and miss rate. State which performance counters you would use? The event sets in hwevts.c for CPC is static char *CPC_EVENTS[NUM_EVENTS][2] = { // libcpc hardware event names for T2 {"Instr_cnt", "DTLB_miss"}, {"Instr_ld", "Instr_st"}, {"DC_miss", "L2_dmiss_ld"}, {"CPU_ld_to_PCX", "CPU_st_to_PCX"}, }; Performance is dominated by cache reads (to the data cache) so we will be interested in the counters "DC_miss" (L1 Data Miss), "L2_dmiss_ld" (L2 Data Miss). To find the L1 Miss %, we also need the number of L1 cache accesses. We assume that this number is equivalent to the number of loads and stores, thus we also need to use the counters "Instr_ld" and "Instr_st". To find the L2 Miss %, we assume that the number of L2 accesses is the number of L1 cache misses as the L2 is the next level of storage up in the memory heirarchy, and is thus accessed once we miss L1 cache. ----------------------------------------------------------------------- Q10. Complete the following table: ------------------------------------------------------ Level 1 D-Cache Reads nsize nrpt Ref Miss (%) ------------------------------------------------------ 100 1000000 220000292 235702 (0.107%) 100 1000000 220000292 206917 (0.094%) 100 1000000 220000292 120038 (0.055%) 200 100000 41400248 555495 (1.323%) 200 100000 41400248 433284 (1.032%) 200 100000 41400248 464796 (1.107%) 400 100000 82000292 9634130 (11.749%) 400 100000 82000292 8417385 (10.265%) 400 100000 82000292 9263667 (11.297%) 800 100000 162000292 80900935 (49.939%) 800 100000 162000292 80901134 (49.939%) 800 100000 162000292 80900868 (49.939%) 1000 10000 20200292 10090059 (49.950%) 1000 10000 20200292 10090112 (49.950%) 1000 10000 20200292 10090064 (49.950%) 10000 1000 20020292 10009066 (49.995%) 10000 1000 20020292 10009081 (49.995%) 10000 1000 20020292 10009066 (49.995%) 100000 100 20002292 10000954 (49.999%) 100000 100 20002292 10000952 (49.999%) 100000 100 20002292 10000930 (49.999%) 1000000 10 20000492 10000113 (49.999%) 1000000 10 20000492 10000110 (49.999%) 1000000 10 20000492 10000115 (49.999%) ------------------------------------------------------ ------------------------------------------------------ Level 2 D-Cache Reads nsize nrpt Ref Miss (%) ------------------------------------------------------ 100 1000000 100 1000000 100 1000000 200 100000 200 100000 200 100000 400 100000 400 100000 400 100000 800 100000 800 100000 800 100000 1000 10000 1000 10000 1000 10000 10000 1000 10009066 4 (0.000%) 10000 1000 10009081 0 (0.000%) 10000 1000 10009108 2 (0.000%) 100000 100 10000930 121 (0.001%) 100000 100 10000952 9 (0.000%) 100000 100 10000954 506 (0.005%) 1000000 10 10000113 2500060 (25.000%) 1000000 10 10000110 2500060 (25.000%) 1000000 10 10000112 2500060 (25.000%) ------------------------------------------------------ We see an increase in L1 cache misses after nsize=400, when the arrays exceeded L1 cache. We can do a calculation by hand to verify this (nsize x 2 arrays x 8 byte array element = size of arrays in bytes). The percentage of misses evens out at 50% as nsize increases. This is because two array elements (8 bytes x 2 = 16 bytes) can sit on an L1 cache line, thus there will be one miss out of every two load if data access is contiguous. L2 cache misses were at zero until nsize = 10000, and were negligible until nsize=1000000 where the combined sizes of the arrays exceeded the capacity of the L2 cache. ----------------------------------------------------------------------- Q11. Run the code several times. What is the minimum number of cycles required? Minimum Cycles = 379701 (approx) ----------------------------------------------------------------------- Q12. The code contains the line #pragma align 16 (ll) what does this line do? (Guess!) Ensures that the first element in the array of ll structures (ll[0]) starts at a 16 byte boundary. The aim is to ensure that it begins on a cache line boundary. Note that element ll[1] need not be aligned on a 32 byte boundary. ----------------------------------------------------------------------- Q13. Add code to printout the address of each of the following l[k].number ll[k].x1 ll[k].x2 ll[k].previous for k=0,1,2 and 1023. What are these addresses? Code segment i=0; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); i=1; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); i=1023; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); i=1024; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); Results i= 0 21320 21328 21330 21338 21340 21348 21350 21358 i= 1 21360 21368 21370 21378 21380 21388 21390 21398 i=1023 312e0 312e8 312f0 312f8 31300 31308 31310 31318 i=1024 31320 31328 31330 31338 31340 31348 31350 31358 (NOTE this is all printed out in HEX!) We see that each element of the structure array (ll) is 64bytes removed from the previous one. So each member occupies exactly 2 cache lines. Note also that the member "number" that is of type "int" actually occupies 8 bytes - not four. The compiler has added padding for performance reasons. (Note the pointer to vector also occupies 8 bytes - this is an address - which since the code is compiled in 32 bit mode should also be 32 bits long.) Note that structure element number 1024 is exactly 64kbytes further along in memory compared to element 0. ----------------------------------------------------------------------- Q14. Change the definition of vector to struct vector { double x0, x1, double y0, y1; double z0, z1; int number; struct vector *previous; } ll[NUM]; what is the minimum cycle time now? Cycle time of approximately 224833 - this is almost a 2 fold reduction in the number of cycles. The key part of the code is sum += fabs(ll[i].x0 - ll[i].x1); and now elements x0 and x1 are located on the same cache line rather than different cache lines - so we 1/2 the number of cache line fetches. ----------------------------------------------------------------------- Q15. Complete the following table: --------------------------------- Stride Exp1 Exp2 Exp3 --------------------------------- 1 248338 248338 248447 2 232631 232521 232521 4 225382 225382 225382 16 223075 223075 223076 32 222197 222197 222197 64 365862 365862 365862 128 394199 394308 394199 256 450325 450325 450325 512 450435 450435 450435 --------------------------------- ----------------------------------------------------------------------- Q16. You should find that the performance decreases markedly for the last 3 cases. Explain why this is the case? The L1 cache is 8Kbytes 4 way set associate. Each struct ll is of size 32bytes (4 doubles). Thus in an array of ll elements ll[0] and ll[n*256] will be separated by 8kbytes, and as a consequence map to the same cache line. We get poor performace because the cache line is constantly being trashed. Smaller strides than 256 are also poor the 4 way associativity means that addresses separated by cache_size/associativity actually map to the same location. ----------------------------------------------------------------------- Q17. Perform additional measurements with stides close to 256, eg stride=253,254,255,257... etc. When exactly does the performance change markedly from good to bad or vice versa? --------------------------------- Stride Exp1 Exp2 Exp3 --------------------------------- 253 268328 268437 268437 254 269426 269426 269426 255 270524 270524 270524 256 450325 450325 450325 257 272611 272502 272611 258 273600 273600 273600 --------------------------------- So poor performance occurs for 256 only, due to cache line thrashing ----------------------------------------------------------------------- Q18. You might have expected the above transition only to occur for stride=256 or multiples thereof - why? Why does it occur for 64 and 128? Essentially for the same reasons as outlined in Q16, ie. associativity. ----------------------------------------------------------------------- Q19. Recode this loop to remove this problem. (HINT you want to persuade the compiler to move all data from a cache line into registers before that cache line is evicted!) register double s11, s12, s13, s14, s21, s22, s23, s24; register double s31, s32, s33, s34, s41, s42, s43, s44; for ( i = 0; i < NUM; i++) { s11=ll[i+1*stride].x1; s12=ll[i+1*stride].x2; s13=ll[i+1*stride].x3; s14=ll[i+1*stride].x4; s21=ll[i+2*stride].x1; s22=ll[i+2*stride].x2; s23=ll[i+2*stride].x3; s24=ll[i+2*stride].x4; s31=ll[i+3*stride].x1; s32=ll[i+3*stride].x2; s33=ll[i+3*stride].x3; s34=ll[i+3*stride].x4; s41=ll[i+4*stride].x1; s42=ll[i+4*stride].x2; s43=ll[i+4*stride].x3; s44=ll[i+4*stride].x4; ll[i].x1 = ll[i].x1 + s11+s21+s31+s41; ll[i].x2 = ll[i].x2 + s12+s22+s32+s42; ll[i].x3 = ll[i].x3 + s13+s23+s33+s43; ll[i].x4 = ll[i].x4 + s14+s24+s34+s44; } and define all the s variables as register double. We now copy entire cache lines to register variables before moving on to the next (conflicting) cache line. ----------------------------------------------------------------------- Q20. Run the code on wallaman and partch and complete the following table. MACHINE 1 - wallaman (cc) ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 198.1155 0.1620 0.1615 0.1620 Scale: 222.4759 0.1439 0.1438 0.1439 Add: 251.4602 0.1909 0.1909 0.1911 Triad: 240.3618 0.1998 0.1997 0.2005 MACHINE 2 - partch (gcc -O2) ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2844.9220 0.0113 0.0112 0.0113 Scale: 2692.2698 0.0119 0.0119 0.0121 Add: 3032.7578 0.0159 0.0158 0.0164 Triad: 3015.2432 0.0159 0.0159 0.0161 Partch has better memory bandwidth overall than wallaman. The bottleneck for wallaman is probably not its memory subsystem, but its relatively underpowered processors. ----------------------------------------------------------------------- Q21. The Copy operation corresponds to: for (i = 0; i < LENGTH; i++) a[i] = b[i]; write similar expressions for Scale, Add and Triad scale for (i = 0; i < LENGTH; i++) a[i] = scalar*b[i]; add for (i = 0; i < LENGTH; i++) a[i] = b[i]+c[i]; triad for (i = 0; i < LENGTH; i++) a[i] = b[i]+scalar*c[i]; ----------------------------------------------------------------------- Q22. Given you knowledge of computer hardware would you expect Stream to report any difference in the MB/s rate for each of the above operations (Copy, Scale, Add, Triad)? Provide justification for you answer. The results reported are MB/s obtained while undertaking the above operations (given in Q21). All these operations are limited by memory traffic. The Ultrasparc and Opteron can only do EITHER one load or one store per cycle. So the MB/s rate would be expected to be the same for all loops that are memory traffic limited -regardless of whether this is load or store operations. The differences between copy, scale, add, triad may also be due to the number of FLOPS carried out in each. The following table summarises how MB/s is counted in stream ------------------------------------------------------------------ name kernel bytes/iter FLOPS/iter ------------------------------------------------------------------ COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 ------------------------------------------------------------------ source: http://www.cs.virginia.edu/stream/ref.html#counting NOTE some machines (particularly vector processors) could run 1 load and 1 store per cycle (Fujitsu VPP300) - or even 2 loads and 1 store (Cray YMP) per cycle. If this were true you could see different MB/sec rates for the different operations depending on the balance between loads and stores. ----------------------------------------------------------------------- Q23. Now recompile the code (on wallaman) using the -fast option. Record new data (it should be better than before). MACHINE 1 - Wallaman ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1222.0270 0.0262 0.0262 0.0262 Scale: 1024.8858 0.0312 0.0312 0.0313 Add: 845.9203 0.0568 0.0567 0.0568 Triad: 830.9547 0.0578 0.0578 0.0578 Note significant improvement - just by a compiler option! ----------------------------------------------------------------------- Q24. Based on the MB/s results obtained using the -fast compiler flag compute MFLOP rates for the Add and Triad results assuming we are working with 8byte data quantities. Show how you compute these rates. How do these results compare with the theoretical peak? Taking wallaman results with -fast Add: 845.9203 0.0568 0.0567 0.0568 Triad: 830.9547 0.0578 0.0578 0.0578 Add requires 2 loads and 1 store, thus 3 * 8 bytes = 24 bytes per 1 result or single FLOP. Hence performance is (845.9203/24)*1 = 35.25 Mflops Triad also requires 24 bytes but for 2 FLOPs. Hence performance is (830.9547/24)*2 = 69.25 Mflops. This is a long way from peak - but remember we are streaming data from main memory - not from cache. ----------------------------------------------------------------------- Q25. The stream performance of Wallaman is rather poor. How would you modify the stream benchmark in order to give a more reasonable assessment of Wallaman's memory bandwidth? We only used a single thread in the above streams experiments. However, wallaman has 8 physical processors on a single chip, which support 64 hardware threads. The T2 is designed to work best in a multi-threaded environment, for example where multiple threads are accessing memory simultaneously, thus using a single thread only exploits a tiny fraction of its capabilities. If we compile stream for multiple threads (using OpenMP) > cc -xopenmp -fast -o stream stream.c And set the number of threads to 8 > export OMP_NUM_THREADS=8 The streams bandwidth results is totally different from before: ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 6386.0040 0.0051 0.0050 0.0052 Scale: 5502.9819 0.0063 0.0058 0.0095 Add: 4812.9714 0.0100 0.0100 0.0100 Triad: 4438.2212 0.0109 0.0108 0.0110 ------------------------------------------------------------- ----------------------------------------------------------------------- Q26. From the web site work out the following * The Pentium IV system that gives the fastest single processor performance. Compare this with your results on partch. Intel_875_P4_2.80 1Proc 2391.0 2376.0 2742.0 2715.0 * What stream preformance might you expect for the APAC national facility SGI system (16 Processors per rack) SGI_Altix_3700 16Proc 26364.0 27058.0 31518.0 31818.0 Note: This system has been replaced by a cluster made up of Sun Blade X6275 Server Modules. * The machine that gives the best single processor results NEC_SX-5-16A 1 42545.0 42546.0 47780.0 -----------------------------------------------------------------------