OpenMP is a standard for shared memory parallelism supported by many C/C++ compilers. In this lab you will explore the basic features of OpenMP for parallelising loops.
This lab will again use the NCI Vayu system. A reminder that comprehensive documentation for the NCI Vayu system is available HERE. From the student systems, you can mount your home directory on Vayu using the following menu option:
Places -> Connect to Server...
Server = vayu.nci.org.au
Type = SSH
Folder = /home/444/<username>
A tarball containing all the programs for this lab is available HERE. Save this tarball to your local desktop and then transfer it to the Vayu system as per last lab.
There are three parts to this lab
The OpenMP Application Program Interface (API) is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications. The OpenMP standard supports multi-platform shared-memory parallel programming in C/C++ and Fortran. It has been jointly defined by a group of major computer hardware and software vendors. For more information see the OpenMP Tutorial by Lawrence Livermore National Laboratory, or the OpenMP web pages.
OpenMP defines a set of program directives and a small number of function/subroutine calls. The function/subroutine calls are associated with the execution runtime environment, memory locking, and timing. The directives are primarily responsible for the parallelisation of the code. For C/C++ code, the directives take the form of pragmas:
#pragma omp
A program written using OpenMP directives begins execution as a single process, or “master thread”. The master thread executes sequentially until it encounters the first parallel construct. When this happens a team of threads is created and it assumes the role of master. Upon completion of the parallel construct the threads synchronise (unless specified otherwise) and only the master continues execution. Any number of parallel constructs may be specified in the program, and as a result the program may “fork” and “join” many times.
The number of threads to spawn may be:
Note that the number of threads may exceed the number of physical CPUs on the machine (known as oversubscription); in this case, the operating system must schedule the threads as best it can. On a given system whether this is permitted or not is determined by the environment variable OMP_DYNAMIC. If this is true then the OMP environment will not permit you to have more threads than physical CPUs on the machine. To override this and obtain more threads than CPUs you need to issue the command export OMP_DYNAMIC=false.
parallel RegionsA parallel region is a structured block of code that is to be executed in parallel by a number of threads. Each thread executes the structured block independently.
Note: it is illegal for code to branch out of a parallel region.
The basic structure is as follows
#pragma omp parallel [clause]
{
/* structured block */
}
The value of [clause] determines data scoping, as follows:
private (list): specifies that each thread has its own uninitialised local copy of the variables listed.
shared (list): specifies single variables that are visible to all threads. If you specify a variable as shared, you are stating that all threads can safely share a single copy of the variable.
default(shared | none): specifies the default scope for all variables. When you specify the default clause, you declare the default for all variables in the code block to be shared or to have no default (none). If you specify a default of none, you must explicitly scope each variable referred to within the parallel region as either private or shared.
Note: Fortran also permits a default of
private. This is not available in C/C++ since many of the standard libraries use global variables and scoping these as private (local) would give errors.
As an example consider the example program ompexample1.cc:
#include <stdio.h>
#include <omp.h>
int main(void) {
int i = 1, j = 2, k;
printf(" Initial value of i %i and j %i \n", i, j);
#pragma omp parallel for default(shared) private(i)
for (k = 0; k < 4; k++) {
printf(" Initial value in parallel of i %i and j %i \n", i, j);
i = i+99;
j = j+99;
printf(" Final value in parallel of i %i and j %i \n", i, j);
}
printf(" Final value of i %i and j %i \n", i, j);
return 0;
}
Compile it using the command make ompexample1
Run the code with one thread using the command ./ompexample1
What value is printed out for i and j and why?
Now run the code with 4 threads by first setting the OMP_NUM_THREADS environment variable: OMP_NUM_THREADS=4 ./ompexample1
What is printed out now? Explain why these values are printed.
Explain the effect of changing private(i) to firstprivate(i). What happens if you change it to lastprivate(i)?
reduction ClauseA reduction clause can be added to the parallel directive. This specifies that the final values of certain variables are combined using the specified operation (or intrinsic function) at the end of the parallel region. For example consider the example program ompexample2.cc:
#include <stdio.h>
#include <omp.h>
int main(void) {
int t_id;
int i = 10, j = 10, k = 10;
printf("Before parallel region: i=%2i, j=%2i, k=%2i\n", i, j, k);
#pragma omp parallel default(none) private(t_id) reduction(+:i) \
reduction(*:j) reduction(&:k)
{
t_id = omp_get_thread_num() + 1;
i = t_id;
j = t_id;
k = t_id;
printf("Thread %i: i=%i, j=%i, k=%i\n", t_id, i, j, k);
}
printf("After parallel region: i=%2i, j=%2i, k=%2i\n", i, j, k);
return 0;
}
The above program demonstrates a number of reduction operations and also shows the use of the omp_get_thread_num function to uniquely define each thread.
export OMP_NUM_THREADS=14Some other useful OpenMP routines are
omp_set_num_threads(np): this allows you to set the number of parallel threads from within the programomp_get_num_threads(): this determine the number of threads currently existing (note this is 1 unless you are in a parallel region)omp_get_max_threads(): gives the maximum number of threads that will be usedThe above three functions are used in the example program ompexample3.cc:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]) {
int np, t_id, num_threads, max_threads;
if (argc != 2) {
printf(" %s Number_of_threads \n", argv[0]);
return -1;
} else {
np = atoi(argv[1]);
if (np < 1) {
printf("Error: Number_of_threads (%i) < 1 \n", np);
return -1;
}
}
omp_set_num_threads(np);
num_threads = omp_get_num_threads();
max_threads = omp_get_max_threads();
printf("Before Parallel: num_threads=%i max_threads %i\n",
num_threads, max_threads);
#pragma omp parallel default(none) private(num_threads, t_id)
{
num_threads = omp_get_num_threads();
t_id = omp_get_thread_num();
printf("In Parallel: num_threads=%i t_id=%i \n", num_threads, t_id);
}
num_threads = omp_get_num_threads();
printf("After Parallel: num_threads=%i \n", num_threads);
return 0;
}
ompexample3.cc Run the programming requesting 20 threads.
How many are you actually allocated and why?
What do you have to do to actually get 20 threads?
Make that change and verify that you are correct.
Program ompexample4.cc computes the sum of all integers from 1 to num_elem, where num_elem is given as command line input. This task is performed using the following loop:
int sum = 0;
int i = 0;
do {
i++;
sum += i;
} while (i < num_elem);
for loop and use
#pragma omp for) up
the loop operations amongst the available OpenMP threads. Your
parallel code
must continue to use a while statement.
for DirectiveIn the above you parallelised a loop by manually assigning different loop indices to different threads. With for loops OpenMP provides the for directive to do this for you. This directive is placed immediately before a for loop and automatically partitions the loop iterations across the available threads.
#pragma omp for [clause[,clause ...]]
for ()
An important optional clause is the schedule(type[,chunk]) clause. This can be used to define specifically how the tasks are divided amongst the different threads. Two distribution schemes are
(static,chunk-size): iterations are divided into pieces of a size specified by chunk and these chunks are then assigned to threads in a round-robin fashion.(dynamic,chunk-size): iterations are divided into pieces of a size specified by chunk. As each thread finishes a chunk, it dynamically obtains the next available chunk of loop indices.ompexample5.cc computes the value of Pi by
numerically integrating the function 1/(1+x2) between
the values of 0 and 1 using the trapezoid method. (The value of
this integral is Pi/4). The code divides the domain [0-1]
into N trapezoids where the area of each trapezoid is given
by the average length of the two parallel sides times the width of
the trapezoid. The lengths of the two parallel sides are
given by the values of the function 1/(1+x2) for the
lower and upper value of x for that domain.omp for directive.
barrier DirectiveIn any parallel program there will be certain points where you wish to synchronise all your threads. This is achieved by using the barrier directive:
#pragma omp barrier
All threads must wait at the barrier before any of them can continue.
single DirectiveCertain pieces of code you may only want to run on one thread - even though multiple threads are executing. For example, you often only want output to be printed once from one thread. This can be achieved using the single directive:
#pragma omp single [clause]
{
/*structured block*/
}
By default all other threads will wait at the end of the structured block until the thread that is executing that block has completed. You can avoid this by augmenting the single directive with a nowait clause.
critical DirectiveIn some instances interactions between threads may lead to wrong (or runtime variable) results. This can arise because two threads are manipulating the same data objects at the same time and the result depends on which tread happened to start first. The critical directive ensures that a block of code is only executed by one processor at a time. Effectively this serialises portions of a parallel region.
#pragma omp critical [(name)]
{
/*structured block*/
}
A thread will wait at the beginning of the critical section until no other thread in the team is executing that (named) section.
atomic DirectiveIn a somewhat similar vein to critical, the atomic directive ensures that two memory locations are never updated at precisely the same time. (Note - reading shared variables is not a problem - only storing to shared variables). The atomic directive sets locks to ensure unique access to a given shared variable:
#pragma omp atomic
The directive refers to the line of code immediately following it. Be aware that there is an overhead associated with the setting and unsetting of locks - so use this directive and/or critical sections only when when necessary. For example we could use the atomic directive to parallelise an inner product:
#pragma omp parallel for shared(a,b,sum) private(I,tmp)
for (i=0; i < n; i++){
tmp = a[i]*b[i];
#pragma omp atomic
sum = sum + tmp;
}
but the performance would be very poor!
ompexample6.cc is an
attempt to implement such a thing. However it does not work
correctly (try it for various values of
maxtasks). Explain why the
current version does not work correctly. (You may have to run this
several times, e.g. using csh you can do the following to run the code
100 times and search for errors.
repeat 100 ./ompexample6 4 109 | grep Something
ompexample6.cc so it works correctly.In lab 2 you were provided with a code to evaluate the Mandelbrot set. The same code is provided in this lab as mpi_mandel.cc, with a sequential version given in omp_mandel.cc.
omp_mandel.cc using OpenMP.
Recall that this code suffers from load imbalance.
Include in your OpenMP parallelisation a dynamic load balancing scheme.
In lab 3 you were provided with a code to solve the heat distribution problem on a rectangular grid. Similar code is provided in this lab as mpi_heat.cc and a sequential version given in omp_heat.cc.
omp_heat.cc using OpenMP.All OpenMP constructs incur some overhead. As an application programmer it is important to have some feeling for the size of these overheads to use them appropriately in your codes. (Also, so you can beat up different vendors so that they produce better OpenMP implementations.)
In a paper presented to the European workshop on OpenMP (EWOMP) in 1999 Mark Bull (from Edinburgh Parallel Computing Centre - EPCC) presented a series of benchmarks for Measuring Synchronisation and Scheduling Overheads in OpenMP. Although the results are now somewhat old and were obtained with early versions of OpenMP enabled compilers, the benchmarks are still appropriate for more modern architectures and compilers. (Note - Mark Bull has since published an update paper that augments the benchmark suite for OpenMP 2.0 and gives more recent results - but it is not necessary for you to read this paper).
Read Mark Bull’s first paper and then answer the following questions. Note that !$OMP DO is the Fortran equivalent to C++ directive #pragma omp for, and PARALLEL DO means the !$OMP PARALLEL and !$OMP DO directives are combined on a single line. Otherwise the Fortran/C++ relations should be obvious.
In the following loop, a, b and c are all of type double. There are no dependencies and the loop can in principle be run in parallel.
for (i=0; i < N; i++) {
a[i] = a[i] + b[i] * c;
}
#pragma omp parallel for directive, what sort
of scheduling would you advise me to use? (Again on the Sun HPC 3500
system and with justification).
c = 0.0;
for (i=0; i < N; i++) {
c = c + a[i]*b[i];
}