This lab will use the NCI Vayu system. You should have an account on this system by now. If not you should email comp4300@cs.anu.edu.au or if you are in the lab you will need to talk with the tutor.
A reminder that comprehensive documentation for the NCI Vayu system is available HERE.
From the student systems, you can mount your home directory on Vayu using the following menu option:
Places -> Connect to Server...
Server = vayu.nci.org.au
Type = SSH
Folder = /home/444/<username>
This will allow you to use the text editor of your choice (and other tools installed on the student lab machines).
A tarball containing all the programs for this lab is available HERE. Save this tarball on your local desktop and then transfer it to the Vayu system.
For this lab, we will use the Intel C++ compiler icpc version 13.2.146 and Threading Building Blocks. A module file has been prepared for this assignment to load the required software. On Vayu, load the module as follows:
module use /short/c07/modules
module load tbb
parallel_forThe file matmul.cc contains a naïve matrix multiplication using a triply-nested loop as follows:
for (size_t i = 0; i < M; i++) {
for (size_t j = 0; j < N; j++) {
double sum = 0.0;
for (size_t l = 0; l < K; l++) {
sum += A(i,l) * B(l,j);
}
C(i,j) = sum;
}
}
Using tbb::parallel_for, parallelise the outer loop with a one-dimensional decomposition using tbb::blocked_range.
To use the TBB classes and functions, you first need to import the TBB header file just as you would in C. C++ adds the complication of namespaces, which group names into hierarchical scopes. To use a class or function from the TBB namespace, you need to introduce it into a scope in your code, for example:
#include <tbb/tbb.h>
using tbb::parallel_for;
If a name is not introduced in this way, it must be fully qualified e.g. tbb::parallel_for.
Express the loop body using a lambda expression, for example:
[&](const blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); ++i) {
// loop body
}
}
The square brackets are the variable capture clause: in this case, [&] denotes default capture by reference of any variables from the enclosing scope. The paretheses hold the parameters to the lambda expression; in this case, the expression accepts one parameter which is a blocked_range<size_t>. The curly braces enclose the lambda body, i.e. the code to run.
By default, TBB will create a number of worker threads based on the available hardware parallelism on the system. For example, on the 8-core nodes of Vayu, TBB will typically use 8 worker threads.
Use the task_scheduler_init class to specify the number of worker threads to use. What speedup do you observe running on multiple cores on a single node of Vayu? Complete the following table:
| matmul | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | ||||
| Speedup | 1.0 | |||
Can you explain these results?
Now decompose the two outer loops using blocked_range2d.
Is the speedup different when using 2D decomposition, and if so, why?
parallel_reduceThe Euclidean distance between two vectors a and b is defined as . In other words it is the square root of the sum of element-wise squared differences.
This is a simple example of the map-reduce pattern, in that a map (also called an elemental function) is applied to pairs of input elements, and the results of the map are combined through a reduction operation.
The file distance.cc calculates the Euclidean distance between two vectors.
double dist_squared = 0.0;
for (size_t i = 0; i < kSize; i++) {
dist_squared += (b[i] - a[i]) * (b[i] - a[i]);
}
distance = sqrt(dist_squared);
Parallelise the code using tbb::parallel_reduce.
What parallel speedup do you observe?
Complete the following table as for the previous example:
| distance | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | ||||
| Speedup | 1.0 | |||
The file bucket_sort.cc implements a simple sequential bucket sort of a vector of doubles. The basic steps of bucket sort algorithm are as follows:
std::sort).Use tbb::parallel_for to perform the sorting of each bucket in parallel. What parallel speedup and efficiency do you observe? Complete the following table.
| distance | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | ||||
| Speedup | 1.0 | |||
| Efficiency | 1.0 | |||
Now also parallelise the gathering of elements to their correct positions in the sorted vector.
Hint:
tbb::parallel_scancan be used to calculate a prefix sum over bucket start positions.
Does this improve speedup, and why?
Can the scattering of elements to buckets be done in parallel? If you have time, implement parallel scattering.