![]() |
ANU College of Engineering and Computer Science
School of Computer Science
|
|
|
COMP8320 Laboratory 01 - week 1, 2011
Introduction to the T2 Multicore ProcessorThis session will provide an introduction to multicore programming on the UltraSPARC T2 using OpenMP and automatic parallelization techniques. The primary objective is to you give you a feeling for the performance of a multicore computer and of the more straightforward programming paradigms available for it. In this session, you will encounter issues, some quite deep, relating to multicore computer programming and performance. These notes will ask you questions on these, as they come up. Its good to think about them, but as time is limited, quickly check your understanding by asking the demonstrator, and then move on. Logging in and Setting UpYou will first need to customize your command line environment for this course. To do this, simply add to your ~/.bashrc file the line:
Warning! the T2's /dept/dcs is a `copy' of /dept/dcs on the student system. It may be out-of-sync! Copy files from /dept/dcs/comp8320 on the student system. Copy the files for the session into your home account area:
Automatic Parallelization: integer arithmetic-intensive programsThe program sum.c initializes an array of integers (its size determined by its command line argument), and computes its sum. It uses 64-bit integers (long long int) in order to avoid overflow. The time it takes to do this is also measured. Open it with a text editor (e.g. using the command emacs sum.c &) and inspect the code. Compile it using the command:
You will see that it did not parallelize the sum loop. This is because it treats sum as a normal variable. Now try compiling it again, instructing it to use special techniques (reductions) to parallelize such loops:
Repeat the above for 4 threads. Note that you can use the arrow keys to edit and re-execute previously typed commands. Finally, comment out the line to print out the total in sum.c, and re-compile. Why has the compiler decided the loops are not `profitable'? Hints: try running the program; what do you observe? Or try re-compiling without -fast. This illustrates a potential pitfall when you are using an aggressively optimizing compiler. Restore the print statement and re-compile before continuing. Automatic Parallelization: memory-intensive programsInspect the program dcopy.c. This program is similar to sum.c except that it copies the array into a second array (which is now double procession floating point) instead of summing it. Compile it with the command:
Note that while the T2 has nominally 64 CPUs, only 56 have been reserved for the exported virtualized T2 (wallaman) on the student system (execute the command /usr/sbin/psrinfo to verify this). In fact only 7 of the 56 are real CPUs; using a technique called hardware threading, 8 sets of registers all share a real CPU (and the integer, floating point and load/store execution units). It is thus extremely cheap; the question is, how close does it come to emulating the expensive alternative (56 real CPUs)? Run the sum program for up to 32 threads; you can use:
Automatic Parallelization: memory- and floating point intensive programInspect the program dadd.c. This program is similar to dcopy.c except it adds a multiple of one array and a second, storing it into a third. Repeat the above exercise for dcopy.c for this program. How do the speedups compare now?OpenMP: parallel loops and reductionsWhile automatic parallelization works well for many simple programs, there are situations where the programmer will need to specify the parallelism more explicitly. One of the simplest paradigms, with an established user community, is OpenMP. OpenMP uses directives, annotations to a normal C or Fortran program, which instructs the compiler how to parallelize the code. This enables a program to be parallelized incrementally, which is often a great advantage. Often (but not always!), these directives only affect the speed of computation, and not the result of the computation.Copy dsum.c (a double version of sum.c into a new file sum_omp.c (cp dsum.c dsum_omp.c). Just above the second for loop, add the OpenMP directive:
OpenMP: the issue of atomicityCopy dsum_omp.c into a new file dsum_omp_atomic.c. In dsum_omp_atomic.c, remove the reduction(+:sum) part of the directive. Compile and re-run with 1, 2 and 4 threads.
export OMP_NUM_THREADS=1; ./dsum_omp_atomic 1000000 export OMP_NUM_THREADS=2; ./dsum_omp_atomic 1000000 export OMP_NUM_THREADS=4; ./dsum_omp_atomic 1000000 export OMP_NUM_THREADS=8; ./dsum_omp_atomic 1000000 We will now look at an alternate way in OpenMP of correcting this problem. We can protect the update of the variable sum by adding the line:
Concluding RemarksIn this session, we have looked at relatively simple techniques to harness the power of multicore computing. In doing so, we have also encountered some non-trivial concepts and seen some pitfalls related to parallel programming. As a review, consider the following questions:
The examples have been oriented to parallelizing simple loops. But the T2 is designed for commercial applications; how are they programmed to harness concurrency? Generally, threads are explicitly programmed, in for example Java. The programming is more complex, too complex to cover in a one hour session, but the issues of data hazards, speedups, shared and private data apply equally.
Last modified: 31/08/2011, 16:07
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Please direct all enquiries to:
Page authorised by: Head of Department, DCS |
| The Australian National University — CRICOS Provider Number 00120C |