![]() |
ANU College of Engineering and Computer Science
School of Computer Science
|
|
|
COMP8320 Laboratory 3 - week 3, 2011
Advanced Topics in OpenMPThis laboratory session will give you hands-on experience using two advanced features of the OpenMP API: Nested parallelism and tasks.
Basic SetupCopy the files for the session into your home directory area:
Nested ParallelismBackground briefing material for this portion of the lab is in the form of slides here. Basic NestingCompile the sample program example1-Basic.c:
Run the program as-is. What does it do? Look at the source code: Do you understand how the program is nesting parallel regions? Make a copy of this program and play with it. Adjust the number of threads in some of the regions, then recompile and re-run the program. How does it behave? Now try adding the function call
Add an additional level of parallelism to a copy of example1-Basic.c, and compile and run it. Next, examine the source of example2-FlexLevels.c. Compile it using the command
Run the executable. We will now examine the effects of the Sun OpenMP environment variables OMP_MAX_ACTIVE_LEVELS and OMP_THREAD_LIMIT (the notes refer to the equivalent vendor-specific variables SUNW_MP_MAX_NESTED_LEVELS and SUNW_MP_MAX_POOL_THREADS). Before doing this, retrieve and record their default settings:
Now, play around with the environment variables OMP_MAX_ACTIVE_LEVELS (e.g. export OMP_MAX_ACTIVE_LEVELS=2) and OMP_THREAD_LIMIT, restricting their values to small quantities that will allow fewer levels of parallelism or fewer threads than the example program wants by default. What is the relationship between these variables and the number of levels where there is more than 1 thread? Restore the default values corresponding to these environment variables by unsetting them:
Now, copy example2-FlexLevels.c into a new file, say
Finally, there are new routines in the OpenMP 3.0 that allow further
queries of a thread's place in the nest environment. You can see
examples of these calls in example4-OpenMP3.0-LibCalls.c.
Compile and run this program (after export OMP_NUM_THREADS=4).
Examine the matrix-matrix multiplication program
example5-MatMatMul.c. You will note that it contains one
level of parallelism, but quite a lot of work to be done. You will
also notice that this program determines the number of OpenMP threads
it uses from the environment variable OMP_NUM_THREADS.
Compile this program by executing the command
Study its scaling behavior by timing it on 1, 2, 4, 8, 16, 24, and 32
cores. This will require that you set OMP_NUM_THREADS
accordingly. If you are particularly ambitious, perform multiple time
trials to get some feel for the uncertainties in your timing results.
As computer scientists, you should know how to automate this
process!;-) What sort of scaling behavior do you see? Plot the
speedup versus the single thread time, and if you have multiple time
trials, include vertical error bars in your results.
Armed with these results as a baseline, investigate the efficacy of
adding a second level of parallelism in the main loop. You should be
able to figure out how to do this on your own. But, if you are
genuinely stymied, check out the program
example6-MatMatMul-Nested.c (warning: it shows the idea for
nested parallelism but in has a deliberate a bug in the variable
scoping).
Run your nested code on thread counts that are perfect squares and
compare with the performance results from your single-level code. If
you have automated multiple timing trials for the single-level example,
try this out in this situation as well. What do you see? How do the
results differ? Are they statistically significant? That is, are the
differences greater than the span of the error bars?
Try using OpenMP's facilities for setting the number of threads within
levels and repeat these timing experiments. Also, try increasing the
sizes of the matrix dimensions (the variables l, m,
and n in main(), and see how large you need to make
them to reduce the differences seen by single-versus-dual level
parallelism to a statistically insignificant level (i.e., the error
bars overlap). Can you even do this?
Finally, can this algorithm be re-implemented using OpenMP 3.0 tasks?
Finally, remove the firstprivate(p,q) clauses in
par_quick_sort. Recompile and run the program again. What
does this tell you about the clause you removed?
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Please direct all enquiries to:
Page authorised by: Head of Department, DCS |
| The Australian National University — CRICOS Provider Number 00120C |