Some notes on lab4 ================== The code in the tutorial page is different to the code provided: provided code: 1. initial of i and j is 1 and 2. In parallel i starts off arbitrary as scope of finalprivate(i) leaves i undefined to begin with. The final value of i and j is the initial value plus 99. The loop then iterates the four times with the one thread with these values. The final value of i is arbitrary value + 396 because of the lastprivate, j is 398 as it was shared. 2. With this the for loop is unrolled/distributed across the four threads. Without control over the communication (#pragma omp_barrier) j is eventually added 99 four times. likewise i is uninitialised on each thread and incremented by the distribution of the for loop and eventually added 99 four times. The final value is set to the value of i and j of the fourth thread. Tutorial code: 1. You should see something like. Initial value of i 1 and j 2 Parallel value of i 0 and j 2 Final value of i 1 and j 2 This is because i is undefined now from the private call. 2. you may see something like: Initial value of i 1 and j 2 Parallel value of i -12689056 and j 2 Parallel value of i 0 and j 2 Parallel value of i 0 and j 2 Parallel value of i 0 and j 2 Final value of i 1 and j 2 We now have 4 threads - so we see 4 "Parallel value" lines. Variable i in the parallel region is defined as a private variable. It bears no relation to the value of variable i in the master thread. The thread local variable i is not initialized - and can take any value. Its often zero because in the code we are running its been assigned a new memory location that has not been previously used - had this location been previously used to store some variable it may be non-zero - and we see this once in the above. 3. firstprivate(i) initializes i to have the same value as is currently the value of the variable with the same name in the master thread. But note that it is still a private variable. If we change its value in the parallel region that change is not propagated back to the master thread. Using firstprivate(i) we get Initial value of i 1 and j 2 Parallel value of i 1 and j 2 Parallel value of i 1 and j 2 Parallel value of i 1 and j 2 Parallel value of i 1 and j 2 Final value of i 1 and j 2 lastprivate will give you a compiler error cc -fast -xarch=v8plusa -xopenmp -xrestrict -o ompexample1 ompexample1.c -lm "ompexample1.c", line 9: lastprivate is not a valid clause for pragma parallel lastprivate propagates the value of a private variable back to the master thread. HOWEVER it is only valid for an #pragma omp for loop. This is because we need to have some unique way to determine which of the various private copies of the variable we will propagate back to the master. We can define this uniquely in a loop as the value that corresponds to the thread that iterated the last iteration of that loop 4. overflow. 5. By default you will get 20 threads. By default the NCI's OS allows for more threads than processors. To get the intended result of the tut question set OMP_DYNAMIC=true. However this did not seem to work for most people in the tut but i assure you it is meant to. setting it to false should allow for the full number of threads requested to run. 6. Something like: omp_set_num_threads(np); sum=1; #pragma omp parallel reduction(+:sum) private(i,iam,nthread) shared(nele) { nthread=omp_get_num_threads(); iam=omp_get_thread_num(); i=iam+2; while (i <= nele) { sum+=i; i+=nthread; } } You need to be careful in the above to ensure you get the same answer for special cases. EG when nthreads > nele. Other solutions are also acceptable - but not using a for loop. 7. Something like: omp_set_num_threads(np); xl=0.0; xw=1.0/nele; area=0.0; #pragma omp parallel for default(none) shared(nele,xw) reduction(+:area) \ private(xl,xh,fxl,fxh) for (i=0; i < nele; i++){ xl=i*xw; xh=xl+xw; fxl=1.0/(1.0+xl*xl); fxh=1.0/(1.0+xh*xh); area+=0.5*(fxl+fxh)*xw; } printf("Elements %i area %14.12f \n",nele,area*4.0); Note we have changed xl to be defined by the loop index 8. Its wrong because the counter is a shared variable and there is no locking of access. Thus two (or more) threads can enter nxtval int nxtval(){ //increment counter icount++; return icount; } say icount =17 thread 1 - icount++ (icount=18) thread 2 - icount++ (icount=19) thread 1 - return icount (returns 19) thread 2 - return icount (returns 19) Both threads have the same counter - giving errors! 9. One possible solution is: int nxtval(){ //increment counter int local; #pragma omp critical { icount++; local=icount; } return local; } NOTE - you cannot use atomic directive here. Because the update (icount++) AND the reading (return icount) of the variable icount must BOTH occur before any other thread accesses this variable. Atomic only applies to one line. 10-13. This is straightforward porting of MPI to OpenMP and will remain left as an exercise! 14. The above loop is a daxpy operation. It will take 3 cycles to complete on the Sun hardware.The HPC 3500 system that is used by Bull has a clock speed of 400MHz. It appears from Bulls paper that parallel do on the HPC3500 has an overhead of 7000-1000 cycles depending on number of CPUs used. In 9000cycles we can do 9000/3=3000 loop iterations if running at peak performance. If you wanted the overhead to be 10% of the execution time you would need to do 3000*10 iterations, or N=30,000. So moral of the story - you would need to have a very long loop. What might help somewhat is if the loop is running out of second level cache and not reaching peak performance of 1 iteration in 3 cycles. But even then N it still likely to be large. 15. Figure 7 shows performance for different scheduling. Static scheduling always appears to be best with a block distribution - meaning the loop indices are divided into NPROC distinct ranges. The only exception is if you use static scheduling with very large chunks. 16. The above loop is a dot product. It differs from the daxpy loop in that we now need to do a reduction operation to get the final value of C. If we compare parallel+reduction with parallel+do in figure 1 we see the performance is about 20% worse on small numbers of CPUs but over 50% worse on 8CPUs. Couple this with the fact that the peak performance of the above is 1 loop iteration per 2 cycles - and amortizing the parallel overhead is going to be much worse. Thus we would need even larger loop lengths.