Name: Student ID: Lab Group: Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- ----------------------------------------------------------------------- -------------------------OpenMP part----------------------------------- ----------------------------------------------------------------------- ----------------------------------------------------------------------- Q1. Run the code by typing fibonacci 1 30. What value is printed out? Make sure you understand what the code is doing. The output is fib(30) = 832040 Time to calculate fib 0.428s The main function creates a group of threads (line 72). Then only a single thread enters the function fib(n). Inside the function, two OpenMP taks are generated. Before we can finish the function, we need to know the results of those tasks. That's why we call task sychronization (omp taskwait). All tasks are executed in parallel. ----------------------------------------------------------------------- Q2. Now run the code with different number of threads and compare the execution time. Comment on the observed behaviour. fibonacci 1 30 0.432s fibonacci 2 30 1.943s fibonacci 4 30 4.311s fibonacci 8 30 8.213s The more threads I used, the slower execution I observed. This looks suspiciously. The problem is, I generated a huge number of tasks, while the amoun of work is tiny (one addition). Thus, the overhead of tasks is enormous. ----------------------------------------------------------------------- Q3. Verify if the code is really being executed by multiple threads by adding an appropriate function call (printf) into the fib function. We can put this call just above the return statement (line 32) printf("Task [%d,%d], thread %d \n", n-1, n-2, omp_get_thread_num()); The output: ./fibonacci 4 6 Task [1,0], thread 1 Task [1,0], thread 1 Task [2,1], thread 1 Task [3,2], thread 1 Task [1,0], thread 3 Task [1,0], thread 1 Task [2,1], thread 1 Task [3,2], thread 3 Task [1,0], thread 2 Task [2,1], thread 2 Task [4,3], thread 2 Task [5,4], thread 0 fib(6) = 8 Time to calculate fib 0.000s We can clearly see by what thread the tasks are executed. ----------------------------------------------------------------------- Q4. Implement vector addition using OpenMP tasks. Verify that you get consistent results for varying number of threads. for (size_t rpt = 0; rpt < nrpt; rpt++) { #pragma omp parallel { #pragma omp single for (size_t i = 0; i < size/Unroll_factor; i++){ #pragma omp task { c[i] = a[i] + b[i]; } } } } we remove the "parallel for" construct and replace it with a "omp parallel" and "omp single". As task intended for parallel executions must be created inside a parallel region, we have to use a parallel construct. But, we also need the "single" construct to prevent the loop to be executed by multiple threads multiple time. Finally, we create a task for every element addition under the loop cycle. ----------------------------------------------------------------------- Q5-6. Open the file vector_add.h and uncomment line 11 (#define PERFORMANCE). Fill the following table. You should be able to explain the results. Execution time of vector add ---------------------+---------------+---------------+ | loop time | task time | ---------------------+---------------+---------------+ vector_add 1 2 2048 | 0.007436s | 0.001587s | 16KB of data //sits in L1 ---------------------+---------------+---------------+ vector_add 2 2 2048 | 0.003855s | 5.356785s | 16KB of data //sits in L1 ---------------------+---------------+---------------+ vector_add 4 2 2048 | 0.005132s | 9.633160s | 16KB of data //sits in L1 ---------------------+---------------+---------------+ vector_add 1 256 256 | 0.042742s | 1.048903s | 2MB of data //sits in L2 ---------------------+---------------+---------------+ vector_add 2 256 256 | 0.022494s | 57.730580s | 2MB of data //sits in L2 ---------------------+---------------+---------------+ vector_add 4 256 256 | 0.011673s | 159.220981s | 2MB of data //sits in L2 ---------------------+---------------+---------------+ vector_add 1 1024 64 | 0.235761s | 1.084346s | 8MB of data // sits in RAM ---------------------+---------------+---------------+ vector_add 2 1024 64 | 0.063808s | 64.970448s | 8MB of data // sits in RAM ---------------------+---------------+---------------+ vector_add 4 1024 64 | 0.060139s | 152.871791s | 8MB of data // sits in RAM ---------------------+---------------+---------------+ The problem in this table is that the higher the number of threads is working on the problem, the lower the performance we can observe. The overhead of tasks is enormous. But be aware you created 64M tasks in the last case! ----------------------------------------------------------------------- Q7. Form the table, you can see slowdown when working with tasks. The problem is that we generate a huge number of tasks which are tiny. Open the file vector_add.h and comment line 11 (#define PERFORMANCE) again - we want to modify the implementation. Now, rework your code to create much lower number of much bigger tasks (each tasks adds a sub-vector, not a single element). Fill the table again. const int Unroll_factor = 1024; for (size_t rpt = 0; rpt < nrpt; rpt++) { #pragma omp parallel { #pragma omp for for (size_t i = 0; i < size/Unroll_factor; i++){ #pragma omp task untied { for (size_t x = 0; x < Unroll_factor; x++){ c[i*Unroll_factor + x] = a[i*Unroll_factor + x] + b[i*Unroll_factor + x]; } } } } } we put more work inside the taks and generated fewer tasks. By the "united", we allowed the tasks to be moved between threads -> better load balancing Execution time of vector add - new resuls ---------------------+---------------+---------------+ | loop time | task time | ---------------------+---------------+---------------+ vector_add 1 2 2048 | 0.007436s | 0.008518s | 16KB of data //sits in L1 ---------------------+---------------+---------------+ vector_add 2 2 2048 | 0.003855s | 0.012215s | 16KB of data //sits in L1 ---------------------+---------------+---------------+ vector_add 4 2 2048 | 0.005132s | 0.020712s | 16KB of data //sits in L1 ---------------------+---------------+---------------+ vector_add 1 256 256 | 0.042742s | 0.046644s | 2MB of data //sits in L2 ---------------------+---------------+---------------+ vector_add 2 256 256 | 0.022494s | 0.085133s | 2MB of data //sits in L2 ---------------------+---------------+---------------+ vector_add 4 256 256 | 0.000748s | 0.148151s | 2MB of data //sits in L2 ---------------------+---------------+---------------+ vector_add 1 1024 64 | 0.235761s | 0.243611s | 8MB of data // sits in RAM ---------------------+---------------+---------------+ vector_add 2 1024 64 | 0.063808s | 0.088745s | 8MB of data // sits in RAM ---------------------+---------------+---------------+ vector_add 4 1024 64 | 0.013006s | 0.060139s | 8MB of data // sits in RAM ---------------------+---------------+---------------+ Now, we can see both versions are comparable, however the loop version is still a little bit faster ----------------------------------------------------------------------- Q8. Implement a parallel search algorithm using OpenMP tasks and verify it on arrays with different sizes. To implement this parallel search, we are going to need a wrapper function that creates parallel threads long VectorSearchParallelWrapper(float * array, const float key, const long imin, const long imax){ long Position = KEY_NOT_FOUND; #pragma omp parallel { #pragma omp single { Position = VectorSearchParallel(array, key,imin,imax); } } return Position; } //---------------------------------------------------------------------------- long VectorSearchParallel(float * array, const float key, const long imin, const long imax){ // test if array is empty if (imax < imin) { // set is empty, so return value showing not found return KEY_NOT_FOUND; } else { // calculate midpoint to cut set in half long imid = (imin + imax) / 2; // the element has been found if (array[imid] == key) { printf ("Key found by thread, [%d] : %d\n", imid, omp_get_thread_num()); return imid; } long left_result = KEY_NOT_FOUND; long right_result = KEY_NOT_FOUND; // try in both sides (the array is not sorted) #pragma omp task shared (left_result) untied { left_result = VectorSearchParallel(array, key, imin, imid-1); } #pragma omp task shared(right_result) untied { right_result = VectorSearchParallel(array, key, imid+1, imax); } #pragma omp taskwait // compare results from both sides if (left_result > KEY_NOT_FOUND) return (left_result); if (right_result > KEY_NOT_FOUND) return (right_result); return KEY_NOT_FOUND; } }// end of VectorSearchParallel ----------------------------------------------------------------------- ----------------------------------------------------------------------- ---------------------------SSE part------------------------------------ ----------------------------------------------------------------------- ----------------------------------------------------------------------- ----------------------------------------------------------------------- Q9. Make the vector addition project by typing make vector_add. You should now see a very long report from the compiler. It provides you with details about autovectorization process used in Standard and Unrolled implementation. Of course, it wasn't able to vectorize the code written in SSE. Go through the output and write down what loops have been vectorized. The compliler vectorized the following loops Loop at the line 22 and line 51. vector_add.cpp:22: note: LOOP VECTORIZED vector_add.cpp:51: note: LOOP VECTORIZED. From the output we can also deduce that the compiler created more versions in order to check possible aliasing and created epilogue to treat the rest of the iterations which cannot be done via SSE (if the number of elements is not divisible by 4). ----------------------------------------------------------------------- Q10. You can run the code by typing vector_add vector_size number_of_repetitions . The number of elements taken as a parameter is multiplied by 1024 in order to work with large arrays. We have to execute multiple runs of the vector add to get a measurable time. Run this program with following parameters and fill the table. Execution time and MFLOPS of vector add --------------------+--------+--------------+---------------+ | | exec. time | MFLOPS | --------------------+--------+--------------+---------------+ vector_add 2 8192 | Add | 27.690 | 151.475 | // Vector MFLOPS data in L1 | Unroll | 24.313 | 690.683 | // Scalar MFLOPS | SSE | 24.203 | 173.297 | // Vector MFLOPS --------------------+--------+--------------+---------------+ vector_add 256 1024 | Add | 172.226 | 389.672 | // Vector MFLOPS data in L2 | Unroll | 258.590 | 1038.095 | // Scalar MFLOPS | SSE | 139.270 | 481.892 | // Vector MFLOPS --------------------+--------+--------------+---------------+ vector_add 2048 64 | Add | 466.960 | 71.879 | // Vector MFLOPS data in RAM | Unroll | 501.896 | 267.436 | // Scalar MFLOPS | SSE | 473.451 | 70.907 | // Vector MFLOPS --------------------+--------+--------------+---------------+ ----------------------------------------------------------------------- Q11. You can see that the performance on small arrays is much higher. Can you explain why? The SSE (SIMD) type of computation is extremely data consuming. If we cannot feed scalar units by data we'll never be able to feed SSE units. So the code has to be computational bounded (not memory). Thus, we can benefit from SIMD only if the data set is stored in L1/L2/L3 cache. The peak here is about 2GFLOPs (all data in L2, SSE implementation). When the cache is not big enough, the performance is limited by memory no matter what kind of execution you choose. (around 280MFLOPs) ----------------------------------------------------------------------- Q12. You should also see something suspicious on the performance in FLOPS reported by PAPI. For shorter execution time, we get much lower number of FLOPS. How are FLOPS calculated in this case (SSE) The reason is that PAPI does not distinguish between Scalar FLOPS and VECTOR Flops. Thus, it treats a single SSE instruction as 1FLOP no matter it works with 4 operands To have a correct result, we have to multiply reported FLOPS by 4 in Add and SSE versions. Here you need to be careful, because different versions of PAPI and different CPUs show different behaviour! Thus corrected results: --------------------+--------+--------------+---------------+ | | exec. time | MFLOPS | --------------------+--------+--------------+---------------+ vector_add 2 8192 | Add | 27.690 | 605.9 | // autovectorization data in L1 | Unroll | 24.313 | 690.683 | // here better pipelining can outperform SSE | SSE | 24.203 | 693.188 | // our SSE implementation is the fastest --------------------+--------+--------------+---------------+ vector_add 256 1024 | Add | 172.226 | 1558.688 | data in L2 | Unroll | 258.590 | 1038.095 | | SSE | 139.270 | 1927.568 | // SSE code is much better than the autovectorized one --------------------+--------+--------------+---------------+ vector_add 2048 64 | Add | 466.960 | 287.516 | // no matter what you use, you're limited by memory data in RAM | Unroll | 501.896 | 267.436 | | SSE | 473.451 | 283.628 | --------------------+--------+--------------+---------------+ ----------------------------------------------------------------------- Q13. Make the vector dot product project by typing make vector_dot. How many loops was the compiler able to vectorize? None! It couldn't deal with data dependency over the sum. ----------------------------------------------------------------------- Q14. Open the file vector_dot.cpp and take a look at it. When you reach VectorDotSSE you will see that something is missing there. To complete the code, you have to put TWO SSE instructions here. Verify the code by typing vector_dot 10 1 void VectorDotSSE(const float * __restrict__ a, const float * __restrict__ b, float * __restrict__ result, const size_t size, const size_t nrpt){ // repeat many times to spend reasonable amount of time here for (size_t rpt = 0; rpt < nrpt; rpt++) { float sum[4] __attribute__ ((aligned (16))) = {0.0f, 0.0f, 0.0f, 0.0f} ; __m128 sse_sum = _mm_set1_ps(0.0f); // SSE version of vector dot product for (size_t i = 0; i < (size/4) * 4; i+=4){ __m128 sse_a = _mm_load_ps(&a[i]); __m128 sse_b = _mm_load_ps(&b[i]); __m128 sse_tmp = _mm_mul_ps(sse_a ,sse_b); // multiplication sse_sum = _mm_add_ps(sse_tmp,sse_sum); // addition } _mm_store_ps(sum, sse_sum); // we have to treat these elements separately (non-SSE) if the size of the array is not divisible by 4 for (size_t i = (size/4) * 4; i < size; i++){ sum[0] += a[i] * b[i]; } *result = sum[0] + sum[1] + sum[2] + sum[3]; }// for rpt }// end of VectorDotProduct //------- ----------------------------------------------------------------------- Q15. When the code is running correctly, we want to measure its performance. Open the header file vector_dot.h and uncomment the line 11 (#define PERFORMANCE) and rebuild the project. Fill in following table: Execution time and MFLOPS of vector dot product --------------------+--------+--------------+---------------+ | | exec. time | MFLOPS | --------------------+--------+--------------+---------------+ vector_dot 2 8192 | Add | 18.066 | 1859.495 | // not vectorized by compiler data in L1 | Unroll | 12.094 | 2780.569 | // unrolled for better pipelining | SSE | 4.593 | 7355.548 | // corrected SSE FLOPS --------------------+--------+--------------+---------------+ vector_dot 256 1024 | Add | 288.766 | 1859.209 | data in L2 | Unroll | 221.479 | 2424.072 | | SSE | 102.299 | 5248.328 | // corrected SSE FLOPS --------------------+--------+--------------+---------------+ vector_dot 2048 64 | Add | 317.106 | 846.557 | data in RAM | Unroll | 313.858 | 855.320 | | SSE | 357.758 | 772.472 | --------------------+--------+--------------+---------------+ The performance is much higher because we perform more operations on the same data Look at the massive speedup in SSE versions! ----------------------------------------------------------------------- Q16. Based on two codes provided in the file SAXPY.cpp create an SSE implementation. You can make the SAXPY project by typing make vector_saxpy void SAXPYSSE(const float * __restrict__ x, const float a, float * __restrict__ y, const size_t size, const size_t nrpt){ // repeat many times to spend reasonable amount of time here for (size_t rpt = 0; rpt < nrpt; rpt++) { __m128 sse_a = _mm_set1_ps(a); // unrolled version of vector add for (size_t i = 0; i < (size/4) * 4; i+=4){ __m128 sse_x = _mm_load_ps(&x[i]); __m128 sse_y = _mm_load_ps(&y[i]); __m128 xmm0 = _mm_mul_ps(sse_x,sse_a); __m128 sse_res = _mm_add_ps(xmm0,sse_y); _mm_store_ps(&y[i],sse_res); } // we have to treat these elements separately if the size of the array is not divisible by 4 for (size_t i = (size/4) * 4; i < size; i++){ y[i] += a * x[i]; } }// for nrpt } // end of SAXPYSSE //------------------------------------------------------------------------------ ----------------------------------------------------------------------- Q17 - 19. When you are sure the code produces the right answer, measure the performance of the code. Do not forget to uncommnet the line 11 (#define PERFORMANCE) in saxpy.h. Use the same parameters as above. Execution time and MFLOPS of SAXPY ---------------+--------------+--------------+---------------+ | | exec. time | MFLOPS | ---------------+--------------+--------------+---------------+ SAXPY 2 8192 | Add | 35.264 | 951.528 | // vectorized (corrected FLOPS) data in L1 | Unroll | 30.170 | 1112.305 | // not vectorized | SSE | 30.139 | 1113.328 | // SSE corrected | SSE Unroll 4x| 30.017 | 1117.848 | | SSE U+Reorder| 8.658 | 3875.576 | ---------------+--------------+--------------+---------------+ SAXPY 256 1024 | Add | 135.598 | 3959.396 | data in L2 | Unroll | 242.209 | 2216.597 | m | SSE | 112.959 | 4752.936 | | SSE Unroll | 111.275 | 4824.876 | | SSE U+Reorder| 107.373 | 5000.204 | ---------------+--------------+--------------+---------------+ SAXPY 2048 64 | Add | 327.255 | 820.408 | data in RAM | Unroll | 330.688 | 811.773 | | SSE | 318.614 | 842.664 | | SSE Unroll | 323.991 | 829.164 | | SSE U+Reorder| 317.649 | 845.328 | ---------------+--------------+--------------+---------------+ You can see that unrolling combined with instruction reordering, can have a significant impact on performance. // SAXPY- Unrolled version void SAXPYSSE(const float * __restrict__ x, const float a, float * __restrict__ y, const size_t size, const size_t nrpt){ const size_t UNROLL_FACTOR = 16; // repeat many times to spend reasonable amount of time here for (size_t rpt = 0; rpt < nrpt; rpt++) { __m128 sse_a = _mm_set1_ps(a); // unrolled version of vector add for (size_t i = 0; i < (size/UNROLL_FACTOR) * UNROLL_FACTOR; i+=UNROLL_FACTOR){ __m128 sse_x = _mm_load_ps(&x[i]); __m128 sse_y = _mm_load_ps(&y[i]); __m128 sse_x1 = _mm_load_ps(&x[i+4]); __m128 sse_y1 = _mm_load_ps(&y[i+4]); __m128 sse_x2 = _mm_load_ps(&x[i+8]); __m128 sse_y2 = _mm_load_ps(&y[i+8]); __m128 sse_x3 = _mm_load_ps(&x[i+12]); __m128 sse_y3 = _mm_load_ps(&y[i+12]); __m128 xmm0 = _mm_mul_ps(sse_x,sse_a); __m128 sse_res = _mm_add_ps(xmm0,sse_y); _mm_store_ps(&y[i],sse_res); __m128 xmm1 = _mm_mul_ps(sse_x1,sse_a); __m128 sse_res1 = _mm_add_ps(xmm1,sse_y1); _mm_store_ps(&y[i+4],sse_res1); __m128 xmm2 = _mm_mul_ps(sse_x2,sse_a); __m128 sse_res2 = _mm_add_ps(xmm2,sse_y2); _mm_store_ps(&y[i+8],sse_res2); __m128 xmm3 = _mm_mul_ps(sse_x3,sse_a); __m128 sse_res3 = _mm_add_ps(xmm3,sse_y3); _mm_store_ps(&y[i+12],sse_res3); } // we have to treat these elements separately if the size of the array is not divisible by 4 for (size_t i = (size/UNROLL_FACTOR) * UNROLL_FACTOR; i < size; i++){ y[i] += a * x[i]; } }// for nrpt } // end of SAXPYSSE //------------------------------------------------------------------------------