COMP3320/6464 2012: Laboratory 5
Advanced OpenMP and Streaming SIMD instructions (SSE2)
The aim of this lab is to introduce the concept of OpenMP tasks
and provide a basic introduction to the Streaming SIMD instruction programming technique.
YOU MUST LOGON TO XE.NCI.ORG.AU TO DO THIS LAB
Start by creating a directory on XE and copying all the relevant programs
from lab5.tar.gz into that directory.
Set up your environment by typing:
- module load gcc/4.4.4
- module load papi
Lab Assessment
You are not required to submit labs, but all lab material is
examinable! You are therefore advised to record you answers to each
question.
Content of the lab5.tar.gz archive:
- OpenMP folder
- fibonacci - calculate Fibonacci number
- vector_add - vector addition using tasks
- vector_search - parallel vector search using tasks
- SSE2 folder
- vector_add - SSE implementation of vector addition
- vector_dot - SSE implementation of vector dot product
- vector_saxpy - SSE implementation of the saxpy loop
OpenMP Tasks
The concept of OpenMP tasks is the biggest addition in the OpenMP 3.0.
With this new future, a wider range of applications can be now parallelised, namely the recursive algorithms (linked lists and trees traversals, quicksort, binary search, etc).
A task has:
- Code to execute (structured block, function, etc)
- A data environment (it owns its data)
- An assigned thread that executes the code and uses the data
Each threads that encounters a task packages a new instance of a task (code and data) and put
it into a queue. The tasks are removed from the queue by threads at some time later. Under some conditions, the task can migrate between threads (untied clause) or can be executed immediately.
Open this presentation and learn more details about OpenMP tasks. If you don't understand anything, ask the tutor.
Anatomy of the OpenMP task construct
#pragma omp task [clause[[,]clause] ...]
structured-block
where
clause can be one of
if (expression) // if the expression is false, the task is executed immediately
untied // any thread can execute this tasks
shared (list)
private (list)
firstprivate (list)
default( shared | none )
Take a look at this simple example:
#pragma omp parallel
{
#pragma omp single private(p)
{
p = listhead ;
while (p) {
#pragma omp task
process (p)
p=next (p) ;
}
}
}
All the threads enter the parallel region. You can specify how many threads are wanted to
execute this part of the code. Then, only a single thread enters a structured block and
reads the head of the linked list. This thread traverses the list and generates a task
for every element of the list. Be aware that those tasks are executed in parallel at some point in the future. All tasks have been finished when leaving the parallel construct (an implicit barrier).
Fibonacci numbers
The first example of the OpenMP tasks calculates fibonacci numbers The code is contained in file
fibonacci.cpp. Compile it
by typing
make fibonacci
The program accepts two parameters. The first one denotes the number of threads
and the second one the value we want to calculate fibonacci number for.
- Run the code by typing fibonacci 1 30. What
value is printed out? Make sure you understand what the code is doing.
- Now run the code with different number of threads and
compare the execution time. Comment on the observed behaviour.
- Verify if the code is really being executed by multiple threads
by adding an appropriate function call (printf) into the fib function.
Vector addition using OpenMP tasks
The next example presents vector addition. You can find the source code in file vector_add.cpp. There are three functions in this file
VectorAdd - Standard OpenMP implementation of vector add
VectorAddTask - Your implementation using OpenMP tasks
ExecuteVectorAdds - Function that runs both versions and shows the results
You can run this example by typing
vector_add num_of_threads vector_size num_of_repetitions. When you've finished the implementation run the code with parameters:
vector_add 1 10 1
or
vector_add 4 10 1
- Implement vector addition using OpenMP tasks. Verify that you get consistent results for varying number of threads.
- Open the file vector_add.h and uncomment line 11
(#define PERFORMANCE).
- Fill the following table. You should be able to explain the results.
Execution time of vector add
---------------------+---------------+---------------+
| loop time | task time |
---------------------+---------------+---------------+
vector_add 1 2 2048 | | |
---------------------+---------------+---------------+
vector_add 2 2 2048 | | |
---------------------+---------------+---------------+
vector_add 4 2 2048 | | |
---------------------+---------------+---------------+
vector_add 1 256 256 | | |
---------------------+---------------+---------------+
vector_add 2 256 256 | | |
---------------------+---------------+---------------+
vector_add 4 256 256 | | |
---------------------+---------------+---------------+
vector_add 1 1024 64 | | |
---------------------+---------------+---------------+
vector_add 2 1024 64 | | |
---------------------+---------------+---------------+
vector_add 4 1024 64 | | |
---------------------+---------------+---------------+
- Form the table, you can see slowdown when working with tasks.
The problem is that we generate a huge number of tasks which are tiny. Open the file
vector_add.h and comment line 11 (#define PERFORMANCE) again - we want to modify the implementation.
Now, rework your code to create much lower number of much bigger tasks
(each tasks adds a sub-vector, not a single element). Fill the table again.
Execution time of vector add
---------------------+---------------+---------------+
| loop time | task time |
---------------------+---------------+---------------+
vector_add 1 2 2048 | | |
---------------------+---------------+---------------+
vector_add 2 2 2048 | | |
---------------------+---------------+---------------+
vector_add 4 2 2048 | | |
---------------------+---------------+---------------+
vector_add 1 256 256 | | |
---------------------+---------------+---------------+
vector_add 2 256 256 | | |
---------------------+---------------+---------------+
vector_add 4 256 256 | | |
---------------------+---------------+---------------+
vector_add 1 1024 64| | |
---------------------+---------------+---------------+
vector_add 2 1024 64| | |
---------------------+---------------+---------------+
vector_add 4 1024 64| | |
---------------------+---------------+---------------+
COMP6464 Only
The OpenMP task paradigm is usually used to parallelise recursive algorithms.Your task in
this example is to implement a parallel version of the binary search in an unsorted array.
In each step, the algorithm compares the input key value with the key value of the middle
element of the array. If the keys match, then a matching element has been found so its index (position)
, is returned. Otherwise, the algorithm tries to look up the key in the left and the right part
of the array.
The algorithm is implemented in file vector_search.cpp There are three functions in
this file:
VectorSearch - Standard sequential recursive implementation of this search algorithm
VectorSearchParallel - Your implementation using OpenMP tasks
ExecuteVectorSearches - Function that runs both versions and shows the results
You can run this example by typing
vector_search num_of_threads vector_size key.
While implementing run the code with following parameters to verify it:
vector_add 1 10 9
or
vector_add 4 10 9
- Implement a parallel search algorithm using
OpenMP tasks and verify it on arrays with different sizes.
Streaming SIMD instructions
SSE2 intrinsics instructions
Several Intel/AMD processors enable development of optimized applications through extensions to previously implemented instructions. A lot of compute-intensive applications can significantly improve performance by using Single Instruction, Multiple Data (SIMD) instructions to process data elements in parallel.
The most direct way to use these instructions is to inline the assembly language instructions into your source code. However, this process can be time-consuming and tedious. In addition, your compiler may not support inline assembly language programming. The GNU and Intel C++ Compiler enable easy implementation of these instructions through the use of API extension sets built into the compiler. These extension sets are referred to as intrinsic functions or intrinsics.
An online SSE Intrinsics manual is here. It is intended for using with the Microsoft Visual Studio, but the instruction names and formats perfectly match with gcc and intel compilers. An exhaustive Intel manual can be found here.
The SSE2 datatypes
The emmintrin.h header file contains the declarations for the SSE2 intrinsics.
This header file defines all the instructions and data types in SSE2.
SSE2 intrinsics can be used with following datatypes:
- __m128 - packed FP datatype containing 4 floats
- __m128d - packed FP datatype containing 2 doubles
- __m128i - packed FX datatype containing 4 ints, or 8 shorts or 16 bytes
The naming and usage syntax for SSE2 instructions
Most intrinsic names use the following notational convention:
_mm_<intrin_op>_<suffix>
- <intrin_op> Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction.
- <suffix> Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (p), extended packed (ep), or scalar (s). The remaining letters and numbers denote the type, with notation as follows:
- s single-precision floating point
- d double-precision floating point
- i128 signed 128-bit integer
- i64 signed 64-bit integer
- u64 unsigned 64-bit integer
- i32 signed 32-bit integer
- u32 unsigned 32-bit integer
- i16 signed 16-bit integer
- u16 unsigned 16-bit integer
- i8 signed 8-bit integer
- u8 unsigned 8-bit integer
The most frequently used SSE2 intrinsics
- Load operations:
__m128 _mm_load1_ps(float * p) // Load one SP FP value into all four words
__m128 _mm_load_ps (float * p) // Load four SP FP values, the address must be 16-byte-aligned
- Set operations:
__m128 _mm_set1_ps(float w) // Set the four SP FP values to w
__m128 _mm_set_ps(float z, float y, // Set the four SP FP values to the four inputs.
float x, float w )
- Store operations:
void _mm_store1_ps(float * p, __m128 a) // Stores the lower SP FP value across four words.
void _mm_store_ps(float *p, __m128 a) // Stores the four SP FP values. The address must be 16-byte-aligned
- Arithmetic operations:
__m128 _mm_add_ss(__m128 a, __m128 b) // Add the lower single-precision, floating-point (SP FP) values of a and
// the upper 3 SP FP values are passed through from a.
__m128 _mm_add_ps(__m128 a, __m128 b) // add two registers element by element
etc.
SSE Vector add example
Our first SSE example will focus on vector addition implemented in SSE.
You can find the implementation of three versions of vector addition in file
vector_add.cpp. There are four functions in this file
VectorAdd - Standard implementation of vector add
VectorAddUnrolled - Unrolled version of vector add
VectorAddSSE - SSE implementation of vector add
ExecuteVectorAdds - Function that runs all versions and shows the results
- Make the vector addition project by typing make vector_add.
You should now see a very long report from the compiler. It provides you with details about
autovectorization process used in Standard and Unrolled implementation. Of course, it wasn't
able to vectorize the code written in SSE. Go through the output and write down what loops have
been vectorized.
- You can run the code by typing vector_add vector_size
number_of_repetitions . The number of elements taken as a parameter is
multiplied by 1024 in order to work with large arrays. We have to execute multiple runs
of the vector add to get a measurable time. Run this program with following parameters
and fill the table.
Execution time and MFLOPS of vector add
--------------------+--------+--------------+---------------+
| | exec. time | MFLOPS |
--------------------+--------+--------------+---------------+
vector_add 2 8192 | Add | | |
| Unroll | | |
| SSE | | |
--------------------+--------+--------------+---------------+
vector_add 256 1024 | Add | | |
| Unroll | | |
| SSE | | |
--------------------+--------+--------------+---------------+
vector_add 2048 64 | Add | | |
| Unroll | | |
| SSE | | |
--------------------+--------+--------------+---------------+
- You can see that the performance on small arrays is much higher.
Can you explain why?
- You should also see something suspicious on the performance in
FLOPS reported by PAPI. For shorter execution time, we get much lower number of FLOPS.
How are FLOPS calculated in this case (SSE)
SSE vector dot product
The next example deals with an SSE implementation of the vector dot product. You can find the implementation of three versions of vector dot product in file vector_dot.cpp. There are four functions in this file
VectorDot - Standard implementation of vector dot product
VectorDotUnrolled - Unrolled version of vector dot product
VectorDotSSE - SSE implementation of vector dot product
ExecuteVectorDots - Function that runs all versions and shows the results
- Make the vector dot product project by typing make vector_dot.
How many loops was the compiler able to vectorize?
- Open the file vector_dot.cpp and take a look at it. When you reach VectorDotSSE you will see that something is missing there. To complete the code, you have to put TWO SSE instructions here. Verify the code by typing vector_dot 10 1
-
When the code is running correctly, we want to measure its performance. Open the header file vector_dot.h and uncomment the line 11 (#define PERFORMANCE) and rebuild the project. Fill in following table:
Execution time and MFLOPS of vector dot product
--------------------+--------+--------------+---------------+
| | exec. time | MFLOPS |
--------------------+--------+--------------+---------------+
vector_dot 2 8192 | Add | | |
| Unroll | | |
| SSE | | |
--------------------+--------+--------------+---------------+
vector_dot 256 1024 | Add | | |
| Unroll | | |
| SSE | | |
--------------------+--------+--------------+---------------+
vector_dot 2048 64 | Add | | |
| Unroll | | |
| SSE | | |
--------------------+--------+--------------+---------------+
The performance should be much higher here. Explain why.
COMP6464 Only
SSE implementation of SAXPY loop
The last examples focuses on a SAXPY loop implementation (the same as DAXPY but with single
precision floating point numbers). You can find the implementation of three versions of
SAXPY in file
saxpy.cpp. There are four functions in this file
SAXPY - Standard implementation of
SAXPYUnrolled - Unrolled version of vector add
SAXPYSSE - SSE implementation of vector add
ExecuteSAXPYs - Function that runs both versions and shows the results
- Based on two codes provided in the file SAXPY.cpp create an SSE implementation. You can make the SAXPY project by typing make vector_saxpy
- When you are sure the code produces the right answer, measure the
performance of the code. Do not forget to uncommnet the line 11 (#define PERFORMANCE)
in saxpy.h. Use the same parameters as above.
- The performance can further be improved by Unrolling the SSE loop!
Unroll the innermost loop in SAXPYSSE. Can you see the improvement and for what array sizes?
- Finally, to improve pipelining, you can reorder some instructions
putting all loads, all adds and mulls together and overlap the data dependencies. Make it so and measure the performance again.