![]() |
ANU College of Engineering and Computer Science (CECS)
Department of Computer Science
|
|
COMP4300/6430 2009: Laboratory 4Shared Memory Parallel Programming with OpenMPThe aim of this lab is to use OpenMP to provide a basic introduction to shared memory parallel programming. Initially you will explore the basic features of OpenMP.
ssh c43XXX@fremontusing the same password as you were originally given on saratoga. Ask me if you have forgotten. Again this is a resource within the Computer Systems research group, so it is run an administered by the group (not the Technical Support Group). So also treat this machine with respect. Be aware - THERE ARE ALSO NO BACKUPS ON FREMONT - it is your responsibility to periodically move your files back to the student system. Also this does not share disk with saratoga, so you will need to transfer anyfiles from there manually (Although I don't think that will be necessary) A tar file for this lab is available on Fremont in /tmp/COMP4300. Copy this and untar it as follows: cp /tmp/COMP4300/lab4.tar . tar -xvf lab4.tarThere are three parts to this lab
PART1: OpenMP BackgroundThe OpenMP Application Program Interface (API) is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications. The OpenMP standard supports multi-platform shared-memory parallel programming in C/C++ and Fortran. It has been jointly defined by a group of major computer hardware and software vendors. For more information see the OpenMP web pages OpenMP consists of a series of program directives and a small number of function/subroutine calls. The function/subroutine calls are associated with the execution runtime environment, memory locking, and timing. The directives are primarily responsible for the parallelisation of the code. For C/C++ code the directives take the form of pragmas:
The number of threads that are spawned may be
OpenMP Directivesparallel RegionsA Parallel Region is a structured block of code that is to be executed in parallel by a number of threads. Each thread executes the structured block independently. NOTE it is illegal for your code to branch out of a parallel region. The basic structure is as follows
#pragma omp parallel [clause]
{
/*structured block*/
}
Clause can be a variety of things for example:
#include <stdio.h>
#include <omp.h>
int main(void){
int i=1,j=2;
printf(" Initial value of i %i and j %i \n",i,j);
#pragma omp parallel default(shared) private(i)
{
printf(" Initial value in parallel of i %i and j %i \n",i,j);
i=i+99;
j=j+99;
printf(" Final value in parallel of i %i and j %i \n",i,j);
}
printf(" Final value of i %i and j %i \n",i,j);
return 0;
}
The above code is contained in file ompexample1.c. Compile it
by typing
make ompexample1
The reduction ClauseA reduction clause can be added to the parallel directive. This specifies that the final values of certain variables are combined using the specified operation (or intrinic function) at the end of the parallel region. For example consider program ompexample2.c
#include <stdio.h>
#include <omp.h>
int main(void){
int tnumber;
int i=10,j=10,k=10;
printf("Before parallel region: i=%i,j=%i,k=%i\n",i,j,k);
#pragma omp parallel default(none) private(tnumber) reduction(+:i) \
reduction(*:j) reduction(&:k)
{
tnumber=omp_get_thread_num()+1;
i = tnumber;
j = tnumber;
k = tnumber;
printf("Thread %i: i=%i,j=%i,k=%i\n",tnumber,i,j,k);
}
printf("After parallel region: i=%i,j=%i,k=%i\n",i,j,k);
return 0;
}
The above program demonstrates a number of reduction operations and also
shows the use of the omp_get_thread_num function to uniquely define
each thread.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int np, iam, nthread, mxthread;
if (argc != 2) {
printf(" %s Number_of_threads \n",argv[0]);
return -1;
}
else {
np = atoi(argv[1]);
if (np < 1){
printf("Error: Number_of_threads (%i) < 1 \n",np);
return -1;
}
}
omp_set_num_threads(np);
nthread=omp_get_num_threads();
mxthread=omp_get_max_threads();
printf("Before Parallel: nthread=%i mxthread %i\n",nthread,mxthread);
#pragma omp parallel default(none) private(nthread,iam)
{
nthread=omp_get_num_threads();
iam=omp_get_thread_num();
printf("In Parallel: nthread=%i iam=%i \n",nthread,iam);
}
nthread=omp_get_num_threads();
printf("After Parallel: nthread=%i \n",nthread);
return 0;
}
The for DirectiveIn the above you parallelised a loop by manually assigning different loop indices to different threads. With for loops OpenMP provides the for directive to do this for you. This directive is placed immediately before a for loop and automatically partitions the loop iterations across the available threads.
#pragma omp for [clause[[,]clause ...]
for ()
An important optional clause is the schedule(type[,chunk])
clause. This can be used to define specifically how the tasks are
divide amongst the different threads. Two distribution schemes are
The barrier DirectiveIn any parallel program there will be certain points where you wish to synchronize all your threads. This is achieve by using the barrier directive. All threads must wait at the barrier before any of them can continue.
#pragma omp barrier
The single DirectiveCertain pieces of code you may only want to run on one thread - even though multiple threads are executing. For example, you often only want output to be printed once from one thread. This can be achieved using the single directive
#pragma omp single [clause]
{
/*structured block*/
}
By default all other threads will wait at the end of the structured
block until the thread that is executing that block has completed. You
can avoid this by augmenting the single directive with a
nowait clause.
The critical DirectiveIn some instances interactions between threads may lead to wrong (or runtime variable) results. This can arise because two threads are manipulating the same data objects at the same time and the result depends on which tread happened to start first. The critical directive ensures that a block of code is only executed by one processor at a time. Effectively this serializes portions of a parallel region.
#pragma omp critical [(name)]
{
/*structured block*/
}
A thread will wait at the beginning of the critical section until no
other thread in the team is executing that (named) section.
The atomic DirectiveIn a somewhat similar vein to critical, the atomic directive ensures that two memory locations are never updated at precisely the same time. (Note - reading shared variables is not a problem - it is just storing to shared variables). The atomic directive sets locks to ensure unique access to a given shared variable:
#pragma omp atomic
The directive refers to the line of code immediately following
it. Be aware that there is an overhead associated with the setting and
unsetting of locks - so use this directive and/or critical
sections only when when necessary. For example we
could use the atomic directive to prallelise and inner product:
#pragma omp parallel for shared(a,b,sum) private(I,tmp)
for (i=0; i < n; i++){
tmp = a[i]*b[i];
#pragma omp atomic
sum = sum + tmp;
}
but the performance would be very poor!
PART2: Application ParallelisationIn lab 2 you were provided with a code to evaluate the mandelbrot set. The same code is provided in this lab as mpi_mandel.c, with a sequential version given in omp_mandel.c. Also included is file batch_job that demonstrates how to run both the OpenMP and MPI version of the mandel code using the queuing system. (It is the same Sun Grid Engine queuing system as you used on Saratoga).
PART3: OVERHEADSAs mentioned in lectures all OpenMP constructs incur some overhead. As an application programmer it is important to have some feeling for the size of these overheads. (Also so you can beat up different vendors so that they produce better OpemMP implementations). In a paper presented to the European workshop on OpenMP (EWOMP) in 1999 Mark Bull (from Edinburgh Parallel Computing Centre - EPCC) presented a series of benchmarks for Measuring Synchronisation and Scheduling Overheads in OpenMP. Although the results are now somewhat old and were obtained with early versions of OpenMP enabled compilers, the Sun processors used are actually identical to those in Karajan. Thus if we repeated the benchmarks today I would expect improved results, but not orders of magnitude different. (Note - Mark Bull has since published an update paper that augments the benchmark suite for OpenMP 2.0 and gives more recent results - but it is not necessary for you to read this paper).
Read Mark Bull's first paper and then
answer the following questions. Note that !$OMP DO is the
Fortran equivalent to C directive #pragma omp for, and
PARALLEL DO means the !$OMP PARALLEL and !$OMP
DO directives are combined on a single line. Otherwise
the Fortran/C relations should be obvious.
|
||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please direct all enquiries to: Alistair.Rendell@anu.edu.au Page authorised by: Head of Department, DCS |
| The Australian National University — CRICOS Provider Number 00120C |