CECS Home | ANU Home | Search ANU
ANU College of Engineering and Computer Science
Research School of Computer Science
Printer Friendly Version of this Document

UniSAFE

High Performance Scientific Computing

COMP3320/6464 2012: Laboratory 4

Shared Memory Parallel Programming with OpenMP

The aim of this lab is to use OpenMP to provide a basic introduction to shared memory parallel programming. Initially you will explore the basic features of OpenMP.

YOU MUST LOGON TO XE.NCI.ORG.AU TO DO THIS LAB

Start by creating a directory on XE and copying all the relevant programs from lab4.tar.gz into that directory.
Set up your environment by typing:

  • module load gcc/4.4.4

Lab Assessment

You are not required to submit labs, but all lab material is examinable! You are therefore advised to record you answers to each question.

If you copied the files as above you will have a template answer file called:

COMP3320_LAB4.txt
Where relevant you are encouraged to write answers in this file. At the end of the lab you should have created the following files.
  • COMP3320_LAB4.txt: text answers to questions below
  • ompexample4.c: Manually parallelized do/while loop
  • ompexample5.c: Parallel code for evaluation of Pi
  • ompexample6.c: Working version of shared counter program


OpenMP Background


The OpenMP Application Program Interface (API) is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications. The OpenMP standard supports multi-platform shared-memory parallel programming in C/C++ and Fortran. It has been jointly defined by a group of major computer hardware and software vendors. For more information see the OpenMP web pages

OpenMP consists of a series of program directives and a small number of function/subroutine calls. The function/subroutine calls are associated with the execution runtime environment, memory locking, and timing. The directives are primarily responsible for the parallelisation of the code. For C/C++ code the directives take the form of pragmas:

#pragma omp
A program written using OpenMP directives begins execution as a single process, or "master thread". The master thread executes sequentially until it encounters the first parallel construct. When this happens a team of threads is created and it assumes the role of master. Upon completion of the parallel construct the threads synchronize (unless specified otherwise) and only the master continues execution. Any number of parallel constructs may be specified in the program, and as a result the program may "fork" and "join" many times.

The number of threads that are spawned may be

  • explicitly given as part of the pragma
  • set using one of the OpenMP function calls
  • predefined by an environment variable or a system setup default
We note that the number of threads may exceed the number of physical CPUs on the machine; it is up to the operating system to schedule the threads as best it can. On a given system whether this is permitted or not is determined by the environment variable OMP_DYNAMIC. If this is true then the OMP environment will not permit you to have more threads than physical CPUs on the machine. To override this and obtain more threads than CPUs you may need to issue explicitly the command
setenv OMP_DYNAMIC false


OpenMP Directives


parallel Regions

A Parallel Region is a structured block of code that is to be executed in parallel by a number of threads. Each thread executes the structured block independently. NOTE it is illegal for your code to branch out of a parallel region. The basic structure is as follows
      #pragma omp parallel [clause]
      {
             /*structured block*/
      }
Clause can be a variety of things for example:
  • private (list) specifies that each thread has its own uninitialized local copy of the variables listed
  • shared (list) specifies single variables that are visible to all threads. If you specify a variable as shared, you are stating that all threads can safely share a single copy of the variable.
  • default(clause) Where for C/C++ the clause is either shared or none. When you specify the default clause, you declare the default for all variables in the code block to be shared or to have no default (none). Note - Fortran also permits a default of private. This is not available in C/C++ since many of the standard libraries use global variables and scoping these as local would give errors.
As an example consider:
      #include <stdio.h>
      #include <omp.h>
      
      int main(void){
        int i=1,j=2;

        printf(" Initial value of i %i and j %i \n",i,j);

        #pragma omp parallel default(shared) private(i)
        {  
           printf(" Parallel value of i %i and j %i \n",i,j);
           i=i+1;
        }

        printf(" Final value of i %i and j %i \n",i,j);
        return 0;
      }
The above code is contained in file ompexample1.c. Compile it by typing
       make ompexample1
  1. Run the code with one thread by typing ompexample1. What value is printed out for i and j and why?
  2. Now run the code with 20 threads by first setting the OMP_NUM_THREADS variable
        setenv OMP_NUM_THREADS 20
        
    What is printed out now? (Make sure you have 20 threads). Now set OMP_DYNAMIC to true and run it again. Explain why these values are printed. Set OMP_DYNAMIC back to false.
  3. Explain the effect of changing private(i) to firstprivate(i). What happens if you change it to lastprivate(i)?

The reduction Clause

A reduction clause can be added to the parallel directive. This specifies that the final values of certain variables are combined using the specified operation (or intrinic function) at the end of the parallel region. For example consider program ompexample2.c

      #include <stdio.h>
      #include <omp.h>
      
      int main(void){
         int tnumber;
         int i=10,j=10,k=10;

         printf("Before parallel region: i=%i,j=%i,k=%i\n",i,j,k);

         #pragma omp parallel default(none) private(tnumber) reduction(+:i) \
          reduction(*:j) reduction(&:k)
         {
            tnumber=omp_get_thread_num()+1;
            i = tnumber;
            j = tnumber;
            k = tnumber;
            printf("Thread %i: i=%i,j=%i,k=%i\n",tnumber,i,j,k);
         }

         printf("After parallel region: i=%i,j=%i,k=%i\n",i,j,k);
         return 0;
      }
The above program demonstrates a number of reduction operations and also shows the use of the omp_get_thread_num function to uniquely define each thread.
  1. Set number of threads to 4, run program ompexample2.c and make sure you understand what is happening
Some other useful OpenMP routines are
  • omp_set_num_threads(np): this allows you to set the number of parallel threads from within the program
  • omp_get_num_threads(): this determine the number of threads currently existing (note this is 1 unless you are in a parallel region)
  • omp_get_max_threads(): gives the maximum number of threads that will be used
These three functions are used in the following program (see ompexample3.c):
      #include <stdio.h>
      #include <stdlib.h>
      #include <omp.h>
      
      int main(int argc, char* argv[]){
         int np, iam, nthread, mxthread;
      
      
         if (argc != 2) {
            printf(" %s Number_of_threads \n",argv[0]);
            return -1;
         }
         else {
            np = atoi(argv[1]);
            if (np < 1){
              printf("Error: Number_of_threads (%i) < 1 \n",np);
              return -1;
            }
         }
      
         omp_set_num_threads(np);
      
         nthread=omp_get_num_threads();
         mxthread=omp_get_max_threads();
         printf("Before Parallel: nthread=%i mxthread %i\n",nthread,mxthread);
         #pragma omp parallel default(none) private(nthread,iam)
         {
            nthread=omp_get_num_threads();
            iam=omp_get_thread_num();
            printf("In Parallel: nthread=%i iam=%i \n",nthread,iam);
         }
         nthread=omp_get_num_threads();
         printf("After Parallel: nthread=%i \n",nthread);
      
         return 0;
      }
  1. Run program ompexample3.c and make sure you understand what is happening
  2. Program ompexample4.c computes the sum of all integers from 1 to N, where N is given as command line input. This task is performed using the following loop:
             sum=0;
             i=0;
             while (i < nele){ 
                 i++;
                 sum+=i;
              } 
        
    Parallelise this summation by using OpenMP to manually divide (this means you are not to convert this to a for loop and use #pragma omp for) up the loop operations amongst the available OpenMP threads. Your parallel code must continue to use a while/do construct. Include in your code comments to indicate the granularity of your parallel tasks. Call your final code ompexample4.c

The for Directive

In the above you parallelised a loop by manually assigning different loop indices to different threads. With for loops OpenMP provides the for directive to do this for you. This directive is placed immediately before a for loop and automatically partitions the loop iterations across the available threads.
  
      #pragma omp for [clause[[,]clause ...]   
         for ()
An important optional clause is the schedule(type[,chunk]) clause. This can be used to define specifically how the tasks are divide amongst the different threads. Two distribution schemes are
  • (static,chunk-size): iterations are divided into pieces of a size specified by chunk and these chunks are then assigned to threads in a round-robin fashion.
  • (dynamic,chunk-size): iterations are divided into pieces of a size specified by chunk. As each thread finishes a chunk, it dynamically obtains the next available chunk of loop indices.
  1. Program ompexample5.c computes the value of Pi by numerically integrating the function 1/(1+x2) between the values of 0 and 1 using the trapezoid method. (The value of this integral is Pi/4). The code divides the domain [0-1] into N trapezoids where the area of each trapezoid is given by the average length of the two parallel sides times the width of the trapezoid. The lengths of the two parallel sides are given by the values of the function 1/(1+x2) for the lower and upper value of x for that domain. Parallelise this code using the omp for directive. Call your final code ompexample5.c
  2. If the number_of_segments used in the integration is 100, what is the relative error in the value of Pi. Is this error due to rounding or truncation error?

The barrier Directive

In any parallel program there will be certain points where you wish to synchronize all your threads. This is achieved by using the barrier directive. All threads must wait at the barrier before any of them can continue.
      #pragma omp barrier

The single Directive

Certain pieces of code you may only want to run on one thread - even though multiple threads are executing. For example, you often only want output to be printed once from one thread. This can be achieved using the single directive
     #pragma omp single [clause]
     {
         /*structured block*/
     }
By default all other threads will wait at the end of the structured block until the thread that is executing that block has completed. You can avoid this by augmenting the single directive with a nowait clause.

The critical Directive

In some instances interactions between threads may lead to wrong (or runtime variable) results. This can arise because two threads are manipulating the same data objects at the same time and the result depends on which tread happened to start first. The critical directive ensures that a block of code is only executed by one processor at a time. Effectively this serializes portions of a parallel region.
     #pragma omp critical [(name)]
     {
           /*structured block*/
     }
A thread will wait at the beginning of the critical section until no other thread in the team is executing that (named) section.

The atomic Directive

In a somewhat similar vein to critical, the atomic directive ensures that two memory locations are never updated at precisely the same time. (Note - reading shared variables is not a problem - it is just storing to shared variables). The atomic directive sets locks to ensure unique access to a given shared variable:
      #pragma omp atomic
The directive refers to the line of code immediately following it. Be aware that there is an overhead associated with the setting and unsetting of locks - so use this directive and/or critical sections only when when necessary. For example we could use the atomic directive to prallelise and inner product:
      #pragma omp parallel for shared(a,b,sum) private(I,tmp)
          for (i=0; i < n; i++){
             tmp = a[i]*b[i];
             #pragma omp atomic
             sum = sum + tmp;
          }
but the performance would be very poor!
  1. Often it is useful to have a shared counter that can be used to implement load balancing. Program ompexample6.c is an attempt to implement such a thing. However it does not work correctly (try it for various values of maxtasks). Explain why the current version does not work correctly.
  2. Fix the above code so it does work correctly. Call your final code ompiexample6.c.

COMP6464 Only


As mentioned in lectures all OpenMP constructs incur some overhead. As an application programmer it is important to have some feeling for the size of these overheads. (Also so you can beat up different vendors so that they produce better OpemMP implementations). In a paper presented to the European workshop on OpenMP (EWOMP) in 1999 Mark Bull (from Edinburgh Parallel Computing Centre - EPCC) presented a series of benchmarks for Measuring Synchronisation and Scheduling Overheads in OpenMP. The results are now somewhat old and were obtained with early versions of OpenMP enabled compilers an. Thus if we repeated the benchmarks today I would expect improved results, but not orders of magnitude different. (Note - Mark Bull has since published an update paper that augments the benchmark suite for OpenMP 2.0 and gives more recent results - but it is not necessary for you to read this paper). The most recent results on OpenMP 3.0 are presented in a paper by Mark's student here (you can take a quick look at it).

Read Mark Bull's first paper and then answer the following questions. Note that !$OMP DO is the Fortran equivalent to C directive #pragma omp for, and PARALLEL DO means the !$OMP PARALLEL and !$OMP DO directives are combined on a single line. Otherwise the Fortran/C relations should be obvious.

  1. Download OpenMP micro benchmarks and unpack in on XE. The content of the package has been modified to correspond with XE system.
    Run syncbench and fill the following table (set number of threads by setenv OMP_NUM_THREADS):
                 Overhead of OpenMP constructs
    -------------------+-------+-------+-------+-------+-------+ 
                       |   1T  |  2T   |  4T   |  8T   |  16T  |
    -------------------+-------+-------+-------+-------+-------+ 
    PARALLEL FOR time  |       |       |       |       |       |
    PARALLEL FOR cycle |       |       |       |       |       |      
    -------------------+-------+-------+-------+-------+-------+ 
    Barrier   time     |       |       |       |       |       |
    Barrier   cycles   |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    Critical  time     |       |       |       |       |       |
    Critical  cycles   |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    Atomic    time     |       |       |       |       |       |
    Atomic    cycles   |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    Reduction time     |       |       |       |       |       |
    Reduction cycles   |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+
    Compare this results with the paper.   
      
  2. Run schedbench and fill the following table (set number of threads by setenv OMP_NUM_THREADS):
                Overhead of OpenMP for loop scheduling
    -------------------+-------+-------+-------+-------+-------+ 
                       |   1T  |  2T   |  4T   |  8T   |  16T  |
    -------------------+-------+-------+-------+-------+-------+ 
    STATIC     1 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    STATIC     8 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    STATIC   128 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    DYNAMIC    1 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    DYNAMIC    8 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    DYNAMIC  128 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    GUIDED     1 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    GUIDED     8 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    GUIDED   128 time  |       |       |       |       |       |
    -------------------+-------+-------+-------+-------+-------+ 
    Compare the results with paper.
      
  3. In the following loop, a, b and c are all of type double:

    
        for {i=0; i < N; i++}{
           a[i]=a[i]+b[i]*c;
        }
        
    There are no dependencies and the loop can in principle be run in parallel.
  4. Under what circumstances (values of N and numbers of threads) would you advise me to parallelise the above loop on XE system?
  5. If use the #pragma omp parallel for directive what sort of scheduling would you advise me to use?