Name: Student ID: Lab Group: Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1. Run the code with one thread by typing ompexample1. What value is printed out for i and j. Initial value of i 1 and j 2 Parallel value of i 0 and j 2 Final value of i 1 and j 2 ----------------------------------------------------------------------- Q2. Now run the code with 4 threads by first setting the OMP_NUM_THREADS variable setenv OMP_NUM_THREADS 4 What is printed out now? Explain why these values are printed. Initial value of i 1 and j 2 Parallel value of i 0 and j 2 Parallel value of i 0 and j 2 Parallel value of i 0 and j 2 Parallel value of i 0 and j 2 Final value of i 1 and j 2 We now have 4 threads - so we see 4 "Parallel value" lines. Variable i in the parallel region is defined as a private variable. It bears no relation to the value of variable i in the master thread. The thread local variable i is not initialized - and can take any value. Its often zero because in the code we are running its been assigned a new memory location that has not been previously used - had this location been previously used to store some variable it may be non-zero - and we see this once in the above. ----------------------------------------------------------------------- Q3. Explain the effect of changing private(i) to firstprivate(i). What happens if you change it to lastprivate(i)? firstprivate(i) initializes i to have the same value as is currently the value of the variable with the same name in the master thread. But note that it is still a private variable. If we change its value in the parallel region that change is not propagated back to the master thread. Using firstprivate(i) we get Initial value of i 1 and j 2 Parallel value of i 1 and j 2 Parallel value of i 1 and j 2 Parallel value of i 1 and j 2 Parallel value of i 1 and j 2 Final value of i 1 and j 2 lastprivate will give you a compiler error gcc -O3 -fopenmp -o ompexample1 ompexample1.c -lm ompexample1.c: In function 'main': ompexample1.c:9: error: 'lastprivate' is not valid for '#pragma omp parallel' make: *** [ompexample1] Error 1 lastprivate propagates the value of a private variable back to the master thread. HOWEVER it is only valid for an #pragma omp for loop. This is because we need to have some unique way to determine which of the various private copies of the variable we will propagate back to the master. We can define this uniquely in a loop as the value that corresponds to the thread that iterated the last iteration of that loop ----------------------------------------------------------------------- Q6. Program ompexample4.c computes the sum of all integers from 1 to N, where N is given as command line input. This task is performed using the following loop: sum=0; i=0; while (i < nele){ i++; sum+=i; } Parallelise this summation by using OpenMP to manually divide up the loop operations amongst the available OpenMP threads. Your parallel code must continue to use a while construct. Include in your code comments to indicate the granularity of your parallel tasks. Submit your code as ompexample4.c Something like: omp_set_num_threads(np); sum=1; #pragma omp parallel reduction(+:sum) private(i,iam,nthread) shared(nele) { nthread=omp_get_num_threads(); iam=omp_get_thread_num(); i=iam+2; while (i <= nele) { sum+=i; i+=nthread; } } You need to be careful in the above to ensure you get the same answer for special cases. EG when nthreads > nele. Other solutions are also acceptable - but not using a for loop. ----------------------------------------------------------------------- Q7. Program ompexample5.c computes the value of Pi by numerically integrating the function 1/(1+x2) between the values of 0 and 1 using the trapezoid method. (The value of this integral is Pi/4). The code divides the domain [0-1] into N trapezoids where the area of each trapezoid is given by the average length of the two parallel sides times the width of the trapezoid. The lengths of the two parallel sides are given by the values of the function 1/(1+x2) for the lower and upper value of x for that domain. Parallelise this code using the omp for directive. Submit your code as ompexample5.c Something like: omp_set_num_threads(np); xl=0.0; xw=1.0/nele; area=0.0; #pragma omp parallel for default(none) shared(nele,xw) reduction(+:area) \ private(xl,xh,fxl,fxh) for (i=0; i < nele; i++){ xl=i*xw; xh=xl+xw; fxl=1.0/(1.0+xl*xl); fxh=1.0/(1.0+xh*xh); area+=0.5*(fxl+fxh)*xw; } printf("Elements %i area %14.12f \n",nele,area*4.0); Note we have changed xl to be defined by the loop index ----------------------------------------------------------------------- Q8. If the number_of_segments used in the integration is 100, what is the relative error in the value of Pi. Is this error due to rounding or truncation error? We will use the value of Pi calculated with 100000 segments as the real value. Its not - but its close enough for the purpose of calculating the relative effor. Elements 100 area 3.141575986923 Elements 100000 area 3.141592653573 diff 0.000016666650 So relative error = 0.000016666650/3.141592653573 = 0.00000530515946459343 As a % relative error = .00053% Its a truncation error. If we increase the number of segments we get a better result. Printing out more digits then the errors are as follows Elements Pi Dominant Error 10 3.139925988907159127 Truncation 100 3.141575986923130781 Truncation 1000 3.141592486923127758 Truncation 10000 3.141592651923118318 Truncation 100000 3.141592653573132221 Truncation 1000000 3.141592653589717621 Truncation 10000000 3.141592653590432604 Both 100000000 3.141592653589361017 Rounding (result fluctuates randomly) 1000000000 3.141592653589377004 Rounding ----------------------------------------------------------------------- Q9. Often it is useful to have a shared counter that can be used to implement load balancing. Program ompexample6.c is an attempt to implement such a thing. However it does not work correctly (try it for various values of maxtasks). Explain why the current version does not work correctly. For example on rattle ompexample6 4 100 sum 10005 tsum 4950 ERROR 5055 Something went wrong! Its wrong because the counter is a shared variable and there is no locking of access. Thus two (or more) threads can enter nxtval int nxtval(){ //increment counter icount++; return icount; } say icount =17 thread 1 - icount++ (icount=18) thread 2 - icount++ (icount=19) thread 1 - return icount (returns 19) thread 2 - return icount (returns 19) Both threads have the same counter - giving errors! ----------------------------------------------------------------------- Q10. Fix the above code so it does work correctly. Submit your working version of ompiexample6.c. One possible solution is: int nxtval(){ //increment counter int local; #pragma omp critical { icount++; local=icount; } return local; } NOTE - you cannot use atomic directive here. Because the update (icount++) AND the reading (return icount) of the variable icount must BOTH occur before any other thread accesses this variable. Atomic only applies to one line. ----------------------------------------------------------------------- COMP6464 ONLY ----------------------------------------------------------------------- Q11. Run syncbench and fill the following table (set number of threads by setenv OMP_NUM_THREADS: Overhead (2.8GHz) -------------------+-------+-------+-------+-------+-------+ | 1T | 2T | 4T | 8T | 16T | -------------------+-------+-------+-------+-------+-------+ PARALLEL FOR time | 0.261 | 1.019 | 2.027 | 2.869 | 71.91 | PARALLEL FOR cycle | 730 | 2853 | 5675 | 8033 | 201348| -------------------+-------+-------+-------+-------+-------+ Barrier time | 0.163 | 0.282 | 0.372 | 0.839 | 46.35 | Barrier cycles | 456 | 789 | 1041 | 2349 | 129780| -------------------+-------+-------+-------+-------+-------+ Critical time | 0.021 | 0.264 | 0.266 | 0.725 | 0.673 | Critical cycles | 58 | 739 | 744 | 2030 | 1884 | -------------------+-------+-------+-------+-------+-------+ Atomic time | 0.018 | 0.068 | 0.132 | 0.224 | 0.218 | Atomic cycles | 50 | 190 | 369 | 627 | 610 | -------------------+-------+-------+-------+-------+-------+ Reduction time | 0.258 | 1.454 | 1.757 | 2.539 | 71.85 | Reduction cycles | 722 | 4071 | 4919 | 7109 | 201180| -------------------+-------+-------+-------+-------+-------+ Q12. Run schedbench and fill the following table (set number of threads by setenv OMP_NUM_THREADS): Schedule (2.8GHz) -------------------+-------+-------+-------+-------+-------+ | 1T | 2T | 4T | 8T | 16T | -------------------+-------+-------+-------+-------+-------+ STATIC 1 time | 0.038 | 0.207 | 0.239 | 0.859 | 51.01 | -------------------+-------+-------+-------+-------+-------+ STATIC 8 time | 0.225 | 0.391 | 0.513 | 1.074 | 52.13 | -------------------+-------+-------+-------+-------+-------+ STATIC 128 time | 0.261 | 0.430 | 0.517 | 2.310 | 50.13 | -------------------+-------+-------+-------+-------+-------+ DYNAMIC 1 time | 1.267 | 14.85 | 34.47 | 27.26 | 87.41 | -------------------+-------+-------+-------+-------+-------+ DYNAMIC 8 time | 0.278 | 2.512 | 5.286 | 4.990 | 36.69 | -------------------+-------+-------+-------+-------+-------+ DYNAMIC 128 time | 0.130 | 1.259 | 2.128 | 2.562 | 36.44 | -------------------+-------+-------+-------+-------+-------+ GUIDED 1 time | 0.168 | 1.974 | 4.158 | 8.151 | 44.19 | -------------------+-------+-------+-------+-------+-------+ GUIDED 8 time | 0.171 | 1.507 | 2.797 | 5.480 | 37.42 | -------------------+-------+-------+-------+-------+-------+ GUIDED 128 time | 0.172 | 1.430 |2.6(32)|9.8(16)| ---- | -------------------+-------+-------+-------+-------+-------+ In the following loop, a, b and c are all of type double: for {i=0; i < N; i++}{ a[i]=a[i]+b[i]*c; } There are no dependencies and the loop can in principle be run in parallel. Under what circumstances (values of N and numbers of threads) would you advise me to parallelise the above loop on XE system? The above loop is a daxpy operation. It will take 3 cycles to complete on the XE hardware. It appears from our measurements that parallel for overhead is of 2853-8033 cycles depending on number of CPUs used. If you wanted the overhead to be 10% of the execution time you would need 28,000-80,000 clock cycles of workload. Thus, we would need N= <84,000;240,000> So moral of the story - you would need to have a very long loop. What might help somewhat is if the loop is running out of second level cache and not reaching peak performance of 1 iteration in 3 cycles. But even then N it still likely to be large. ----------------------------------------------------------------------- Q12. If I use the #pragma omp parallel for directive what sort of scheduling would you advise me to use? Static scheduling always appears to be best with a block distribution - meaning the loop indices are divided into NPROC distinct ranges. The only exception is if you use static scheduling with very large chunks. ===============================================================