COMP4300/6430 2013 - Some Notes on Lab 4 - Shared Memory Parallel Programming with TBB

Section 1: Matrix Multiplication Using parallel_for

The parallel loop with TBB looks like:

#include <tbb/tbb.h>
using tbb::blocked_range2d;
using tbb::parallel_for;
...
parallel_for(
        blocked_range<size_t>(0, M),
        [&](const blocked_range<size_t>& r) {
            for (size_t i = r.begin(); i < r.end(); ++i) {
                for (size_t j = 0; j < N; j++) {
                    double sum = 0.0;
                    for (size_t l = 0;  l < K; l++) {
                        sum += A(i,l) * B(l,j);
                    }
                    C(i,j) = sum;
                }
            }
        }
);

The observed performance is as follows:

matmul Number of cores
1248
Time (s)7.233.641.830.943
Speedup1.01.993.957.67

TBB shows very good parallel speedup for this problem. However it should be remembered that the basic (single-threaded) performance of this code is very poor compared to the performance achieved by an optimised implementation of DGEMM - see assignment 1.

The two-dimensional decomposition is very similar:

parallel_for(
        blocked_range2d<size_t>(0, M, 0, N),
        [&](const blocked_range2d<size_t>& r) {
            for (size_t i = r.rows().begin(); i < r.rows().end(); ++i) {
                for (size_t j = r.cols().begin(); j < r.cols().end(); ++j) {
                    ...
                }
            }
        }
);

The speedup with two-dimensional decomposition is not significantly better:

matmul 2D Number of cores
1248
Time (s)7.173.591.830.935
Speedup1.02.03.927.67

Section 2: Euclidean Distance Using parallel_reduce

The Euclidean distance calculation using parallel_reduce is as follows:

double dist_squared = parallel_reduce(
        blocked_range<size_t>(size_t(0), kSize),
        0.0,
        [&a, &b](const blocked_range<size_t>& r, double dist_squared)->double {
            for (size_t i = r.begin(); i < r.end(); ++i) {
                dist_squared += (a[i] - b[i]) * (a[i] - b[i]);
            }
            return dist_squared;
        },
        [](double x, double y)->double {
            return x + y;
        }
);
distance = sqrt(dist_squared);

The observed performance is as follows:

distance Number of cores
1248
Time (s)0.0120.00840.00700.0068
Speedup1.01.421.731.78

Parallel speedup is nowhere near as good for this code. There are only a very small number of operations performed within each iteration of the loop. The task granularity may be too small for efficient scheduling.

Section 3: Parallel Bucket Sort

Parallelising the sorting of each bucket is straightforward:

tbb::parallel_for(
    bin_index_type(0), 
    kMaxBins,
    [&](bin_index_type bin) {
        std::sort(buckets[bin].begin(), buckets[bin].end());
    }
);

The observed performance is as follows:

distance Number of cores
1248
Time (s)11.16.584.493.53
Speedup1.01.692.483.15
Efficiency1.00.850.620.39

Increasing parallelism shows speedup to 8 threads, although efficiency drops off quickly. A significant portion of the code (scattering to buckets and gathering to the sorted vector) has not been parallelised, so the reduction in efficiency is a consequence of Amdahl’s Law.