parallel_forThe parallel loop with TBB looks like:
#include <tbb/tbb.h>
using tbb::blocked_range2d;
using tbb::parallel_for;
...
parallel_for(
blocked_range<size_t>(0, M),
[&](const blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); ++i) {
for (size_t j = 0; j < N; j++) {
double sum = 0.0;
for (size_t l = 0; l < K; l++) {
sum += A(i,l) * B(l,j);
}
C(i,j) = sum;
}
}
}
);
The observed performance is as follows:
| matmul | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | 7.23 | 3.64 | 1.83 | 0.943 |
| Speedup | 1.0 | 1.99 | 3.95 | 7.67 |
TBB shows very good parallel speedup for this problem. However it should be remembered that the basic (single-threaded) performance of this code is very poor compared to the performance achieved by an optimised implementation of DGEMM - see assignment 1.
The two-dimensional decomposition is very similar:
parallel_for(
blocked_range2d<size_t>(0, M, 0, N),
[&](const blocked_range2d<size_t>& r) {
for (size_t i = r.rows().begin(); i < r.rows().end(); ++i) {
for (size_t j = r.cols().begin(); j < r.cols().end(); ++j) {
...
}
}
}
);
The speedup with two-dimensional decomposition is not significantly better:
| matmul 2D | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | 7.17 | 3.59 | 1.83 | 0.935 |
| Speedup | 1.0 | 2.0 | 3.92 | 7.67 |
parallel_reduceThe Euclidean distance calculation using parallel_reduce is as follows:
double dist_squared = parallel_reduce(
blocked_range<size_t>(size_t(0), kSize),
0.0,
[&a, &b](const blocked_range<size_t>& r, double dist_squared)->double {
for (size_t i = r.begin(); i < r.end(); ++i) {
dist_squared += (a[i] - b[i]) * (a[i] - b[i]);
}
return dist_squared;
},
[](double x, double y)->double {
return x + y;
}
);
distance = sqrt(dist_squared);
The observed performance is as follows:
| distance | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | 0.012 | 0.0084 | 0.0070 | 0.0068 |
| Speedup | 1.0 | 1.42 | 1.73 | 1.78 |
Parallel speedup is nowhere near as good for this code. There are only a very small number of operations performed within each iteration of the loop. The task granularity may be too small for efficient scheduling.
Parallelising the sorting of each bucket is straightforward:
tbb::parallel_for(
bin_index_type(0),
kMaxBins,
[&](bin_index_type bin) {
std::sort(buckets[bin].begin(), buckets[bin].end());
}
);
The observed performance is as follows:
| distance | Number of cores | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Time (s) | 11.1 | 6.58 | 4.49 | 3.53 |
| Speedup | 1.0 | 1.69 | 2.48 | 3.15 |
| Efficiency | 1.0 | 0.85 | 0.62 | 0.39 |
Increasing parallelism shows speedup to 8 threads, although efficiency drops off quickly. A significant portion of the code (scattering to buckets and gathering to the sorted vector) has not been parallelised, so the reduction in efficiency is a consequence of Amdahl’s Law.