Name: A. Very-Good-Student Student ID: u9507815 Lab Group: Afterhours Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1: What machine (use hostname) are your running on? I will use iwaki - becaue you told me to. ----------------------------------------------------------------------- Q2: What is its clock rate (use fpversion)? Kernel says CPU's clock rate is 1015.0 MHz. But this is not quite right - it should be 1050MHz. ----------------------------------------------------------------------- Q3: For BOTH the L1 and L2 caches what are their i) sizes, ii) cache line sizes, iii) associativity. -xcache=64/32/4:8192/512/2 L1 64KB /32byte cache line/4 way set associative L1 8MB /512byte cache line/2 way set associative ----------------------------------------------------------------------- Q4: Given the superscalar nature of the UltraSPARC II and the grouping rules discussed in lectures. Theoretically for the above code it should be possible to complete 1 iteration every two clock cycles. Explain why this is the case. You should know that this is not an easy question for me to answer yet you need to tell me a bit more information in the lectures. But I'm a good student so I will tell you the answer anyway. The performance is limited by the ability to load the data items into registers. Each iteration requires a[k] and b[k] from memory while sum is stored in a register. Thus this requires two loads - takes two cycles. (If you want more details see Lecture 11!) ----------------------------------------------------------------------- Q5. If the line sum = sum + a[k]*b[k]; were replaced by the line sum = sum + a[k]*(a[k]+b[k]); how many cycles would now be required to complete 1 iteration? Explain your answer. It would still take 2 cycles - we still need to load two items per cycle (a[k] and b[k]) which will take two cycles, we also need to do two floating point additions that will take two cycles, but just one multiplication that will take just one cycle. ----------------------------------------------------------------------- Q6. Complete the following table. ------------------------------------------------------ nsize nrpt Cycles Instr Cycle/iter Inst/cycle ------------------------------------------------------ 10 1000000 185210329 159000900 18.521 0.858488 10 1000000 185004389 159000935 18.500 0.859444 10 1000000 185197235 159000412 18.519 0.858546 100 100000 37247056 57700028 3.7247 1.54912 100 100000 37004833 57700052 3.7004 1.55926 100 100000 36982409 57700033 3.6982 1.56020 1000 10000 25447984 50770003 2.5448 1.99505 1000 10000 25267133 50770006 2.5267 2.00933 1000 10000 25327356 50770004 2.5327 2.00455 10000 1000 68569752 50077000 6.8569 0.73030 << no longer L1 10000 1000 68625650 50077005 6.8625 0.72971 10000 1000 68614502 50077004 6.8614 0.72983 100000 100 86793469 50007782 8.6793 0.57617 100000 100 86741601 50007787 8.6741 0.57651 100000 100 86746775 50007774 8.6746 0.57648 1000000 10 459882985 50017181 45.9883 0.10876 << no longer L2 1000000 10 481541803 50017488 48.1542 0.10386 1000000 10 442053840 50017283 44.2054 0.11314 ------------------------------------------------------ ----------------------------------------------------------------------- Q7. In the above table you should find that with increasing nsize the performance initially improves, but then gets worse. Explain why this is. Increases nsize increase the length of the inner loop, which gives better pipelining and better performance. However as nsize increases vectors a and b are no longer able to fit in level 1 cache, and we have to obtain data from level 2 cache. At this point performance degrades, and this is especially true when we exceed L2 cache. ----------------------------------------------------------------------- Q8. Earlier we stated that the best performance obtainable for this loop was 2 cycles per iteration of the loop. What percentage of this theoretical peak is i) your fastest result ii) your slowest result? (You should find there is a big difference between the best and worse). Best is about 2.52 cycles per iteration = 2/2.52*100 = 79% of peak Worse case is about 44.54 cycles per iteration = 4.5% of peak ----------------------------------------------------------------------- Q9. By setting the relevant performance counters we can gather information on the level 1 data cache and level 2 or "ecache" hit and miss rate. State which performance counters you would use? Performance is dominated by cache reads - so for L1 we only need worry about rd and - as stated - only the data cache Hence use pic0=DC_rd,pic1=DC_rd_miss subtracting the results to get cache hits, then express as % of total refs For Level 2 ecache we chose pic0=EC_ref,pic1=EC_misses again subtracting the results to get cache hits, then express as % of total refs ----------------------------------------------------------------------- Q10. Complete the following table: ------------------------------------------------------ Level 1 D-Cache Reads nsize nrpt ref misses hits ------------------------------------------------------ 10 1000000 36000204 192 36000012 10 1000000 36000429 253 36000176 10 1000000 36000658 560 36000098 100 100000 21600006 58 21599948 100 100000 21600063 119 21599944 100 100000 21600019 195 21599824 1000 10000 20160019 2197 20157822 1000 10000 20160004 1139 20158865 1000 10000 20160005 1070 20158935 10000 1000 20016376 2378922 17637454 10000 1000 20016120 2366926 17649194 10000 1000 20016166 2376517 17639649 100000 100 20002301 3425997 16576304 100000 100 20002475 3401246 16601229 100000 100 20002182 3369780 16632402 1000000 10 20026161 4507328 15518833 1000000 10 20025026 4489650 15535376 1000000 10 20025780 4596030 15429750 ------------------------------------------------------ ------------------------------------------------------ Level 2 D-Cache Access nsize nrpt ref misses hits ------------------------------------------------------ 10 1000000 37272556 28 37272528 10 1000000 36806764 50 36806714 10 1000000 36416682 49 36416633 100 100000 21937046 538 21936508 100 100000 21948125 765 21947360 100 100000 21937666 363 21937303 1000 10000 20201945 538 20201407 1000 10000 20222481 251 20222230 1000 10000 20210452 0 20210452 10000 1000 22047493 3821 22043672 10000 1000 22048377 358 22048019 10000 1000 22048381 24 22048357 100000 100 22238285 71048 22167237 100000 100 22235492 31127 22204365 100000 100 22247413 48344 22199069 1000000 10 22513639 2068318 20445321 1000000 10 22510869 2034030 20476839 1000000 10 22512263 2055160 20457103 ------------------------------------------------------ Note dcache ref basically constant. For small nsize=10,100,1000 we see relatively few dcache misses. At nsize=10000 dcache misses increase markedly, but ecache misses are low. Finally for nsize=10000,1000000 we see marked increase in dcache misses. ----------------------------------------------------------------------- Q11. Run the code several times. What is the minimum number of cycles required? Minimum Cycles = 327547 (approx) ----------------------------------------------------------------------- Q12. The code contains the line #pragma align 32 (ll) what does this line do? (Guess!) Ensures that the first element in the array of ll structures (ll[0]) starts at a 32 byte boundary. The aim is to ensure that it begins on a cache line boundary. Note that element ll[1] need not be aligned on a 32 byte boundary. ----------------------------------------------------------------------- Q13. Add code to printout the address of each of the following ll[k].number ll[k].x1 ll[k].x2 ll[k].previous for k=0,1,2 and 1023. What are these addresses? Code segment i=0; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); i=1; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); i=1023; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); i=1024; printf("i=%4d %p %p %p %p %p %p %p %p\n",i, &ll[i].number,&ll[i].x0,&ll[i].y0,&ll[i].z0, &ll[i].x1,&ll[i].y1,&ll[i].z1,&ll[i].previous); Results i= 0 21320 21328 21330 21338 21340 21348 21350 21358 i= 1 21360 21368 21370 21378 21380 21388 21390 21398 i=1023 312e0 312e8 312f0 312f8 31300 31308 31310 31318 i=1024 31320 31328 31330 31338 31340 31348 31350 31358 (NOTE this is all printed out in HEX!) We see that each element of the structure array (ll) is 64bytes removed from the previous one. So each member occupies exactly 2 cache lines. Note also that the member "number" that is of type "int" actually occupies 8 bytes - not four. The compiler has added padding for performance reasons. (Note the pointer to vector also occupies 8 bytes - this is an address - which since the code is compiled in 32 bit mode should also be 32 bits long.) Note that structure element number 1024 is exactly 64kbytes further along in memory compared to element 0. ----------------------------------------------------------------------- Q14. Change the definition of vector to struct vector { int number; double x0, x1, double y0, y1; double z0, z1; struct vector *previous; } ll[NUM]; what is the minimum cycle time now? Cycle time of approximately 163961 - this is a reduction of about 2. The key part of the code is sum += fabs(ll[i].x0 - ll[i].x1); and now elements x0 and x1 are located on the same cache line rather than different cache lines - so we 1/2 the number of cache line fetches. ----------------------------------------------------------------------- Q15. Change the definition of vector to struct vector { int number; struct vector *previous; double x0, x1, double y0, y1; double z0, z1; } ll[NUM]; what is the minimum cycle time now? Minimum cycle time of about 193992. More than above because now each element of the structure is 8 bytes short of the cache line size. Previously everytime we accessed x0 the alignment ensured that x1 was also on the same cache line. Now x0 and x1 were periodically be on different cache lines. So number of cache line misses increases and performacne goes down. ----------------------------------------------------------------------- Q16. Complete the following table: --------------------------------- Stride Exp1 Exp2 Exp3 --------------------------------- 1 765390 764882 764855 2 765978 765990 765513 3 766038 766239 766106 1024 1952114 1955451 1955396 2048 2144166 2134521 2142082 4096 2177314 2171634 2178835 --------------------------------- ----------------------------------------------------------------------- Q17. You should find that the performance decreases markedly for the last 3 cases. Explain why this is the case? The L1 cache is 64Kbytes 4 way set associate. Each struct ll is of size 32bytes (4 doubles). Thus in an array of ll elements ll[0] and ll[n*2048] will be separated by 64kbytes, and as a consequence map to the same cache line. We get poor performace because the cache line is constantly being trashed. Smaller strides than 2048 are also poor the 4 way associativity means that addresses separated by cache_size/associativity actually map to the same location. ----------------------------------------------------------------------- Q18. Perform additional measurements with stides close to 1024, eg stride=1023,1022,1021... etc. When exactly does the performance change markedly from good to bad or vice versa? --------------------------------- Stride Exp1 Exp2 Exp3 --------------------------------- 2046 800198 799503 799633 2047 1213737 1212612 1212596 2048 2127922 2129580 2139048 2049 770309 769303 769837 2050 800405 800366 801117 2051 770102 769996 769740 --------------------------------- So poor performance occurs for 2047 as well as 2048! ----------------------------------------------------------------------- Q19. You might have expected the above transition only to occur for stride=1024 or multiples thereof - why? Essentially for the same reasons as outlined in Q17. ----------------------------------------------------------------------- Q20. Can you suggest why a few values around multiples of 1024 also exhibit poor performance? The conflict occurs if we want to load multiple data items from the same cache line, but find that between doing one load and issuing the next load from that cache line it has been evicted from the cache. The ordering of the loads is determined by the compiler based on available registers. The compiler unrolls this loop and this gives rise to some conflicts for nearby strides. ----------------------------------------------------------------------- Q21. Recode this loop to remove this problem for ( i = 0; i < NUM; i++) { s11=ll[i+1*stride].x1; s12=ll[i+1*stride].x2; s13=ll[i+1*stride].x3; s14=ll[i+1*stride].x4; s21=ll[i+2*stride].x1; s22=ll[i+2*stride].x2; s23=ll[i+2*stride].x3; s24=ll[i+2*stride].x4; s31=ll[i+3*stride].x1; s32=ll[i+3*stride].x2; s33=ll[i+3*stride].x3; s34=ll[i+3*stride].x4; s41=ll[i+4*stride].x1; s42=ll[i+4*stride].x2; s43=ll[i+4*stride].x3; s44=ll[i+4*stride].x4; ll[i].x1 = ll[i].x1 + s11+s21+s31+s41; ll[i].x2 = ll[i].x2 + s12+s22+s32+s42; ll[i].x3 = ll[i].x3 + s13+s23+s33+s43; ll[i].x4 = ll[i].x4 + s14+s24+s34+s44; } and define all the s variables as register double. We now copy entire cache lines to register variables before moving on to the next (conflicting) cache line. ----------------------------------------------------------------------- Q22. Read the header to the stream_d.c and try compiling the code accordingly. You will find that the code fails to link, with a message like Undefinedfirst referenced symbol in file mysecond stream_d.o Look in second_wall.c and fix the code so that it does link. What did you do? (Hint the problem is a common one when using mixtures of Fortran and C code!). Remove the _ (underscore from the mysecond function implementation) ----------------------------------------------------------------------- Q21. Run the code on iwaki and partch and complete the following table. MACHINE 1 - iwaki ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 345.2333 0.0961 0.0927 0.1038 Scale: 337.3712 0.0978 0.0949 0.1020 Add: 345.2813 0.1429 0.1390 0.1488 Triad: 346.0659 0.1415 0.1387 0.1428 MACHINE 2 - Partch (gcc -O3) ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 1709.6675 0.0773 0.0187 0.2379 Scale: 1689.8979 0.1025 0.0189 0.2271 Add: 2032.1765 0.1070 0.0236 0.2357 Triad: 2023.1743 0.0767 0.0237 0.2318 ----------------------------------------------------------------------- Q22. The Copy operation corresponds to: for (i = 0; i < LENGTH; i++) a[i] = b[i]; write similar expressions for Scale, Add and Triad scale for (i = 0; i < LENGTH; i++) a[i] = scalar*b[i]; add for (i = 0; i < LENGTH; i++) a[i] = b[i]+c[i]; triad for (i = 0; i < LENGTH; i++) a[i] = b[i]+scalar*c[i]; ----------------------------------------------------------------------- Q23. Given you knowledge of the UltraSPARC and Pentium IV system would you expect Stream to report any difference in the MB/s rate for each of the above operations (Copy, Scale, Add, Triad)? Provide justification for you answer. The results reported are MB/s obtained while undertaking the above operations (given in Q22). All these operations are limited by memory traffic. The ultrasparc and Pentium IV can only do EITHER one load or one store per cycle. So the MB/s rate would be expected to be the same for all loops that are memory traffic limited -regardless of whether this is load or store operations. NOTE some machines (particularly vector processors) could run 1 load and 1 store per cycle (Fujitsu VPP300) - or even 2 loads and 1 store (Cray YMP) per cycle. If this were true you could see different MB/sec rates for the different operations depending on the balance between loads and stores. ----------------------------------------------------------------------- Q24. Now recompile the code using the -fast option. Record new data (it should be better than before). MACHINE 1 - Iwaki ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 923.7029 0.0368 0.0346 0.0385 Scale: 953.4608 0.0353 0.0336 0.0385 Add: 921.4116 0.0543 0.0521 0.0588 Triad: 996.9081 0.0501 0.0481 0.0512 Note significant improvement - just by a compiler option! ----------------------------------------------------------------------- Q25. Based on the MB/s results obtained using the -fast compiler flag compute MFLOP rates for the Add and Triad results assuming we are working with 8byte data quantities. Show how you compute these rates. How do these results compare with the theoretical peak? Taking Iwaki results with -fast Add: 249.5632 0.3013 0.1923 0.4476 Triad: 290.9602 0.2590 0.1650 0.3785 Add: 921.4116 0.0543 0.0521 0.0588 Triad: 996.9081 0.0501 0.0481 0.0512 Add requires 2 loads and 1 store or 3*8bytes per 1 result or single FLOP. Hence performance is (921.4116/24)*1 = 38.39Mflops Triad also requires 24 bytes but for 2 FLOPs. Hence performance is (996.9081/24)*2 = 83.08Mflops. This is a long way from peak - but remember we are streaming data from main memory - not from cache. ----------------------------------------------------------------------- Q26. From the web site find the following Stream results * A single CPU on the Compaq_AlphaServer_ES45-68/1000 (This is the same as the nodes on the APAC System at the ANU Supercomputer Facilty) * A single Pentium IV system (fastest possible clock) * The machine that gives the best single processor results Perhaps the following Compaq_AlphaServer_ES45-68/1000 1 1946.0 1940.7 1978.1 1978.1 Intel_875_P4_2.80 1 2391.0 2376.0 2742.0 2715.0 NEC_SX-5-16A 1 42545.0 42546.0 47780.0 47779.0 Note partch is a 2.6GHz Pentium IV system. (See /proc/cpuinfo on partch) -----------------------------------------------------------------------