CECS Home | ANU Home | Search ANU
The Australian National University
ANU College of Engineering and Computer Science
School of Computer Science
Printer Friendly Version of this Document

UniSAFE

COMP8320 Laboratory 01 - week 1, 2011

Introduction to the T2 Multicore Processor


This session will provide an introduction to multicore programming on the UltraSPARC T2 using OpenMP and automatic parallelization techniques. The primary objective is to you give you a feeling for the performance of a multicore computer and of the more straightforward programming paradigms available for it.

In this session, you will encounter issues, some quite deep, relating to multicore computer programming and performance. These notes will ask you questions on these, as they come up. Its good to think about them, but as time is limited, quickly check your understanding by asking the demonstrator, and then move on.

Logging in and Setting Up

You will first need to customize your command line environment for this course. To do this, simply add to your ~/.bashrc file the line:
    source /dept/dcs/comp8320/login/Bashrc
Make sure the line is properly terminated (press the `Enter' key at the end of the line -- otherwise it won't work!). To ensure that this file always gets sourced when you log in, add to your ~/.profile file the line:
    source ~/.bashrc

Warning! the T2's /dept/dcs is a `copy' of /dept/dcs on the student system. It may be out-of-sync! Copy files from /dept/dcs/comp8320 on the student system.

Copy the files for the session into your home account area:

    cp -r /dept/dcs/comp8320/public/lab01/ .
To access the T2, simply type the command:
    ssh -Y wallaman
and cd to your lab01 sub-directory. The following editors are available on wallaman:
    xemacs, vi
Note that wallaman imports your home directory, so you can edit files on the CSIT labs workstations.

Automatic Parallelization: integer arithmetic-intensive programs

The program sum.c initializes an array of integers (its size determined by its command line argument), and computes its sum. It uses 64-bit integers (long long int) in order to avoid overflow. The time it takes to do this is also measured. Open it with a text editor (e.g. using the command emacs sum.c &) and inspect the code. Compile it using the command:
    cc -fast -xautopar -xloopinfo -o sum sum.c
This compiles the program under full optimization (-fast) and attempts to parallelize any loops (-xautopar) provided that doing so is both safe and worthwhile. It also reports on the parallelization of each loop (-xloopinfo).

You will see that it did not parallelize the sum loop. This is because it treats sum as a normal variable. Now try compiling it again, instructing it to use special techniques (reductions) to parallelize such loops:

    cc -fast -xautopar -xloopinfo -xreduction -o sum sum.c
Run the program:
    ./sum 1000000
and note the time taken. This will be the serial execution time. To run this program using two threads, execute the command:
    export OMP_NUM_THREADS=2; ./sum 1000000
and note the time. The Solaris operating system will try to schedule the two threads on different CPUs (assuming they are available), and if so we now have a parallel program running! How close did this come to the ideal reduction in time (being halved)?

Repeat the above for 4 threads. Note that you can use the arrow keys to edit and re-execute previously typed commands.

Finally, comment out the line to print out the total in sum.c, and re-compile. Why has the compiler decided the loops are not `profitable'? Hints: try running the program; what do you observe? Or try re-compiling without -fast. This illustrates a potential pitfall when you are using an aggressively optimizing compiler. Restore the print statement and re-compile before continuing.

Automatic Parallelization: memory-intensive programs

Inspect the program dcopy.c. This program is similar to sum.c except that it copies the array into a second array (which is now double procession floating point) instead of summing it. Compile it with the command:
    cc -fast -xautopar -xloopinfo -o dcopy dcopy.c
Run the program with a single thread (serially):
    export OMP_NUM_THREADS=1; ./dcopy 1000000
Repeat the above for 2, 4, 8, 16 and 32 threads and observe the decrease in time (after 4 threads, run the program several times and look for the best time). Note that you can use the command:
    ./threadrun 32 ./dcopy 1000000
to do this for you. How close were these to the ideal speedup (decrease in time is proportional to the number of threads)?

Note that while the T2 has nominally 64 CPUs, only 56 have been reserved for the exported virtualized T2 (wallaman) on the student system (execute the command /usr/sbin/psrinfo to verify this). In fact only 7 of the 56 are real CPUs; using a technique called hardware threading, 8 sets of registers all share a real CPU (and the integer, floating point and load/store execution units). It is thus extremely cheap; the question is, how close does it come to emulating the expensive alternative (56 real CPUs)?

Run the sum program for up to 32 threads; you can use:

    ./threadrun 32 ./sum 1000000
How does its scalability compare to dcopy? Compile the double precision version of the sum program, dsum.c.
    cc -fast -xautopar -xloopinfo -xreduction -o dsum dsum.c
Run the dsum program for up to 32 threads. How does its scalability compare to dsum?

Automatic Parallelization: memory- and floating point intensive program

Inspect the program dadd.c. This program is similar to dcopy.c except it adds a multiple of one array and a second, storing it into a third. Repeat the above exercise for dcopy.c for this program. How do the speedups compare now?

OpenMP: parallel loops and reductions

While automatic parallelization works well for many simple programs, there are situations where the programmer will need to specify the parallelism more explicitly. One of the simplest paradigms, with an established user community, is OpenMP. OpenMP uses directives, annotations to a normal C or Fortran program, which instructs the compiler how to parallelize the code. This enables a program to be parallelized incrementally, which is often a great advantage. Often (but not always!), these directives only affect the speed of computation, and not the result of the computation.

Copy dsum.c (a double version of sum.c into a new file sum_omp.c (cp dsum.c dsum_omp.c). Just above the second for loop, add the OpenMP directive:

    #pragma omp parallel for reduction(+:sum)
which instructs the compiler to parallelize the loop, applying the reduction technique on the variable sum. Also, just above the first loop, add the line:
    #pragma omp parallel for
Now compile the program to be parallelized using OpenMP.
    cc -fast -xopenmp -o dsum_omp dsum_omp.c
Run the program in single threaded mode:
    export OMP_NUM_THREADS=1; ./dsum_omp 1000000
and repeat for 2, 4 and 8 threads. Compare the performance with the auto-parallelized programs. Which is better? Is this surprising?

OpenMP: the issue of atomicity

Copy dsum_omp.c into a new file dsum_omp_atomic.c. In dsum_omp_atomic.c, remove the reduction(+:sum) part of the directive. Compile and re-run with 1, 2 and 4 threads.
    cc -fast -xopenmp -o dsum_omp_atomic dsum_omp_atomic.c
    export OMP_NUM_THREADS=1; ./dsum_omp_atomic 1000000
    export OMP_NUM_THREADS=2; ./dsum_omp_atomic 1000000
    export OMP_NUM_THREADS=4; ./dsum_omp_atomic 1000000
    export OMP_NUM_THREADS=8; ./dsum_omp_atomic 1000000
In particular, run with 8 threads several times and observe the variation in the output. For the reported sum, what do you observe about the reported value as opposed to the correct value (given by the single threaded execution)? This phenomenon is called a data hazard or race hazard, and is a common pitfall in parallel programming.

We will now look at an alternate way in OpenMP of correcting this problem. We can protect the update of the variable sum by adding the line:

    #pragma omp atomic
just above it (inside the loop). This will force the instructions implementing the statement sum += array[i]; to be executed as if by a single instruction. Re-compile and re-run the program for up to 8 threads. You will observe the correct result is reported, but what about the time!

Concluding Remarks

In this session, we have looked at relatively simple techniques to harness the power of multicore computing. In doing so, we have also encountered some non-trivial concepts and seen some pitfalls related to parallel programming. As a review, consider the following questions:
  • How effectively does the T2 scale using separate cores? (scaling from 1 to 4 threads)?
  • How effectively does the T2 scale using hardware threading? (scaling from 8 to 32 threads)?
  • Is the above different for integer versus floating point? For memory-intensive vs less memory intensive?
  • How effective is automatic parallelization for simple loops?
  • What causes a race hazard? Within a parallelized loop, is forcing an atomic update likely to be useful?

The examples have been oriented to parallelizing simple loops. But the T2 is designed for commercial applications; how are they programmed to harness concurrency? Generally, threads are explicitly programmed, in for example Java. The programming is more complex, too complex to cover in a one hour session, but the issues of data hazards, speedups, shared and private data apply equally.

Last modified: 31/08/2011, 16:07

Copyright | Disclaimer | Privacy | Contact ANU