T2 Multicore Computer Workshop
16 Oct 2008
Welcome to the Workshop!
This workshop will provide an introduction to multicore programming on the UltraSPARC T2 using OpenMP and automatic parallelization techniques. Familiarity with the C programming language and the Unix environment is assumed. We have a bit less than one hour, so we can't get into the depths, but I hope you will get a feeling for the performance of a multicore computer and of the more straightforward programming paradigms available for it.
In this workshop, you will encounter issues, some quite deep, relating to multicore computer programming and performance. These notes will ask you questions on these, as they come up. Its good to think about them, but as time is limited, quickly check your understanding by asking the demonstrator, and then move on.
Logging in and Setting Up
As the Labs are on a different subnet to the T2, you will need to login using a guest account, and the ssh to your own account on the T2 (mavericks) from there.You will all be sharing the same guest account a269557. Once in the lab, login with the username a269557 and the password supplied by the demonstrator. Once your desktop comes up, open a Konsole application (use the top-left button on the 2 by 4 grid of shortcut icons, on the lower-left of your screen).
Your demonstrator will give you your account name and password for mavericks. Type the command:
- ssh -Y
- xterm &
- xemacs, emacs, vi, vim
- cp /data0/home/workshop/* .
Automatic Parallelization: integer arithmetic-intensive programs
The program sum.c initializes an array of integers (its size determined by its command line argument), and computes its sum. It uses 64-bit integers (long long int in order to avoid overflow. The time it takes to do this is also measured. Open it with a text editor (e.g. using the command xemacs sum.c &) and inspect the code. Compile it using the command:-
cc -fast -xautopar -xloopinfo -o sum sum.c
You will see that it did not parallelize the sum loop. This is because it treats sum as a normal variable. Now try compiling it again, instructing it to use special techniques (reductions) to parallelize such loops: dd
-
cc -fast -xautopar -xloopinfo -xreduction -o sum sum.c
-
./sum 1000000
-
setenv OMP_NUM_THREADS 2; ./sum 1000000
Repeat the above for 4 threads.
Finally, comment out the line to print out the total in sum.c, and re-compile. Why has the compiler decided the loops are not `profitable'? Hints: try running the program; what do you observe? Or try re-compiling without -fast. This illustrates a potential pitfall when you are using an aggressively optimizing compiler. Restore the print statement nad re-compile before continuing,
Automatic Parallelization: memory-intensive programs
Inspect the program dcopy.c. This program is similar to sum.c except it copies the array into a second array (which is now double procession floating point) instead of summing it. Compile it with the command:-
cc -fast -xautopar -xloopinfo -o dcopy dcopy.c
-
setenv OMP_NUM_THREADS 1; ./dcopy 1000000
Note that while the machine mavericks has nominally 64 CPUs, 32 have been reserved for the exported virtual T2 (wallaman) which the undergraduate students are using for their concurrency course. Hence only 32 are currently available for mavericks users (execute the command /usr/sbin/psrinfo to verify this). Furthermore, in fact 4 of the 32 are real CPUs; using a technique called hardware threading, 8 sets of registers all share a real CPU (and the integer, floating point and load/store execution units). It is thus extremely cheap; the question is, how close does it come to emulating the expensive alternative (32 real CPUs)?
Run the sum program for up to 32 threads. How does its scalability compare to dcopy?
OpenMP: parallel loops and reductions
While automatic parallelization works well for many simple programs, there are situations where the programmer will need to specify the parallelism more explicitly. One of the simplest paradigms, with an established user community, is OpenMP. OpenMP uses directives, annotations to a normal C or Fortran program, which instructs the compiler how to parallelize the code. This enables a program to be parallelized incrementally, which is often a great advantage. Often (but not always!), there directives only affect the speed of computation, and not the result of the computation.Copy sum.c into a new file sum_omp.c (cp sum.c sum_omp.c). Just above the second for loop, add the OpenMP directive:
- #pragma omp parallel for reduction(+:sum)
- cc -fast -xopenmp -o sum_omp sum_omp.c
-
setenv OMP_NUM_THREADS 1; ./sum_omp 1000000
OpenMP: the issue of atomicity
Copy sum_omp.c into a new file sum_omp_atomic.c. In sum_omp_atomic.c, remove the reduction(+:sum) part of the directive. Compile and re-run with 1, 2 and 4 threads.- cc -fast -xopenmp -o sum_omp_atomic sum_omp_atomic.c
setenv OMP_NUM_THREADS 1; ./sum_omp_atomic 1000000
setenv OMP_NUM_THREADS 2; ./sum_omp_atomic 1000000
We will now look at an alternate way in OpenMP of correcting this problem. We can protect the update of the variable sum by adding the line:
-
#pragma omp atomic
-
cc -fast -xopenmp -S sum_omp_atomic.c
OpenMP: how reductions are implemented
We will now look at how reductions are implemented in OpenMP. Not only is this important in itself, the exercise will uncover more feature of OpenMP and uncover issues in parallel programming. The file sum_omp_psum.c is set up to implement reductions using a technique call partial sums. Instead of a single thread of execution in a normal program, when an OpenMP program executes, $OMP_NUM_THREADS threads get created. These are then activated whenever a parallel loop is executed. In this case, each thread is given a segment of the array to sum. Then, in a non-parallel loop, these partial sums are added together to get the total.The program uses an array psums to do this. Two issues arise: how does the program determine the size of the array, and how do the threads index the array. The former can be done by a call to the OpenMP intrinsic function omp_get_max_threads(). The latter can be done by calling the intrinsic omp_get_thread_num() which returns a unique id for each thread. However, this can only be done in a (parallel) part of the program when all the threads are active!
This brings us to the concept of parallel regions. So far, a parallel region has been a simple loop, but we want each thread to get its thread id outside the loop. In C programs, a region can be defined in a code block ({ ... } ). You will see such a code block around the call omp_get_thread_num() and the subsequent for loop. Just above this block, insert the directive:
-
#pragma omp parallel private(thr_id)
So far so good, but we have not actually instructed the compiler to parallelize the loop! To do so, insert the directive:
-
#pragma omp for
- cc -fast -xopenmp -o sum_omp_psum sum_omp_psum.c
-
setenv OMP_NUM_THREADS 1; ./sum_omp_psum 1000000
A final note: for SMP systems (with CPUs on separate chips with cache coherency hardware between them), performance will be highly degraded unless we pad out the psum array so that there is one element used per cache line (typically 8 words). The phenomenon is called false cache line sharing. However, as it is a multicore processor with CPUs on a single chip, this makes little difference on the T2.
Concluding Remarks
In this workshop, we have looked at relatively simple techniques to harness the power of multicore computing. In doing so, we have also encountered some non-trivial concepts and seen some pitfalls related to parallel programming.The examples have been oriented to parallelizing simple loops. But the T2 is designed for commercial applications; how are they programmed to harness concurrency? Generally, threads are explicitly programmed, in for example Java. The programming is more complex, too complex to cover in a one hour workshop, but the issues of data hazards, speedups, shared and private data apply equally.
