Parallelising the code with TBB is fairly straightforward: simply convert the main loop into a parallel reduction. For example for a one-dimensional decomposition:
max_diff = parallel_reduce(
blocked_range<size_t>(1, n_y-1),
0.0,
[&](const blocked_range<size_t>& r, double max)->double {
for (size_t j = r.begin(); j < r.end(); ++j) {
for (size_t i = 1; i < n_x-1; i++) {
size_t idx = j*n_x+i;
t_new[idx]=0.25*(t_old[idx+1] + t_old[idx-1] +
t_old[idx+n_x] + t_old[idx-n_x]);
double tdiff = fabs(t_old[idx] - t_new[idx]);
max = std::max(max, tdiff);
}
}
return max;
},
[](double x, double y)->double {
return std::max(x, y);
}
);
However, the performance of the resulting code is surprisingly poor: the execution time for one thread increases by around 100%. The problem is not TBB overhead: it is that restructuring the loop in this way confuses the compiler such that it can no longer vectorise the loop. To observe this, request loop vectorisation reports from the Intel compiler:
$ icpc ... -vec-report=3 heat_tbb.cc ...
...
heat_tbb.cc(95): (col. 50) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
heat_tbb.cc(96): (col. 21) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
...
To regain the expected single-thread performance, it is necessary to inform the compiler that your code follows strict ISO/ANSI aliasing rules:
$ icpc ... -vec-report=3 -ansi-alias heat_tbb.cc ...
For a full example code using TBB see heat_tbb.cc.
There are a variety of ways to parallelise the code using Pthreads. Probably the best way is to have each thread perform all iterations, operating on some portion of the grid. It is therefore necessary to synchronise between threads at each iteration. Each thread will calculate a local maximum difference, which can be summed by a single thread, or accumulated to the global max diff in a critical section protected by a mutex as in the example below.
do {
iter++;
if (pthread_barrier_wait(barr) == PTHREAD_BARRIER_SERIAL_THREAD) {
max_diff = 0.0;
}
pthread_barrier_wait(barr);
// update local values and calculate max diff
...
// update global max_diff
pthread_mutex_lock(max_diff_lock);
max_diff = std::max(max_diff, local_diff);
pthread_mutex_unlock(max_diff_lock);
pthread_barrier_wait(barr);
...
} while (max_diff > converge && iter < max_iter);
Note the use of barriers to ensure that the global max diff is zeroed by a single thread before any thread updates it for this iteration; and again to ensure that the max diff has been updated by all threads before it is used to test for convergence.
For a full example code using Pthreads see heat_threads.cc.
There are two quad-core Xeon X5570 processors on each node of Vayu. TBB recognises this and so by default will create 8 worker threads (see tbb::task_scheduler_init::default_num_threads()).
Table 1 shows the execution time for heat_tbb.cc and heat_threads.cc as well as speedup and efficiency compared to the fastest sequential code, which is heat.cc.
| Sequential time (s) (heat_threads): 0.87 | ||||||
| No. Threads | heat_tbb time (s) | Speedup | Efficiency | heat_threads time (s) | Speedup | Efficiency |
|---|---|---|---|---|---|---|
| 1 | 0.87 | 1.00 | 1.00 | 0.93 | 0.94 | 0.94 |
| 2 | 0.46 | 1.88 | 0.94 | 0.70 | 1.24 | 0.62 |
| 4 | 0.34 | 2.56 | 0.64 | 0.43 | 2.02 | 0.5 |
| 8 | 0.35 | 2.48 | 0.31 | 0.40 | 2.17 | 0.27 |
Table 2 shows the times for a problem size of
| Sequential time (s) (heat_threads): 3.48 | ||||||
| No. Threads | heat_tbb time (s) | Speedup | Efficiency | heat_threads time (s) | Speedup | Efficiency |
|---|---|---|---|---|---|---|
| 1 | 3.50 | 0.99 | 0.99 | 3.70 | 0.94 | 0.94 |
| 2 | 1.87 | 1.86 | 0.93 | 1.91 | 1.82 | 0.91 |
| 4 | 1.46 | 2.38 | 0.60 | 1.72 | 2.02 | 0.51 |
| 8 | 1.39 | 2.50 | 0.31 | 1.76 | 1.98 | 0.25 |
Speedup and efficiency are good for both codes with two threads, but drop off rapidly for higher numbers of threads. Speedup with 16 threads is less than speedup with 8 threads for all problems tested (not shown).
The slowdown is not due to locking overhead. This can be confirmed for heat_threads.cc by removing all locks and barriers (and the check for convergence). The code is now incorrect, but does not show performance improvement for 4-8 threads.
Profiling with hpctoolkit shows that for heat_threads.cc, as the number of threads increases, the number of Level 2 cache misses increases markedly.
| No. Threads | L2 load misses | L2 store misses |
|---|---|---|
| 1 | 1.0e+07 | 2.0e+06 |
| 2 | 2.6e+07 | 4.0e+06 |
| 4 | 4.2e+07 | 2.8e+07 |
| 8 | 9.8e+07 | 6.4e+07 |
hpctoolkit also indicates that the misses occur entirely within the t_new update loop (see figure 1).

The increase in cache misses indicates increased interference between larger numbers of threads - this may contribute to the lack of speedup.
Apart from performance, we are also interested in the productivity implications of using TBB vs. Pthreads. The TBB version of the code is much simpler to implement, requiring changes to less than 20 lines of code, which is less than 1/4 of the LOC required using Pthreads. In addition, the TBB changes are entirely ‘inline’ in the existing code - they do not require new functions to be written. The only difficulty encountered using TBB was the failure of compiler vectorisation - this reminds us that when using high-level libraries, it is sometimes still necessary to examine low-level implementation details.
Both codes have some locking overhead - in the Pthreads version this is explicit through the use of a mutex, whereas in TBB it is implicit in the parallel reduction. The main parallel overhead is cache interference (see above) - it may be possible to reduce this through blocking for improved cache usage and/or padding to reduce false sharing on t_new. There are also overheads due to thread creation (Pthreads) and task creation and scheduling (TBB). These may be expected to increase with larger numbers of threads, although if the problem size were increased accordingly this may not be a significant impact.
For a full example code that tests false sharing between threads see heat_false.cc.
Table 4 shows timings for heat_false, which demonstrate that false sharing significantly reduces performance. Increasing the stride used on the max_diff_local array from 1 to 8 reduces the execution time by over 50%.
| Stride | Time (s) |
|---|---|
| 1 | 5.12 |
| 2 | 4.43 |
| 4 | 3.39 |
| 8 | 2.43 |
False sharing may also be an issue for updates to t_new: each thread updates disjoint elements of this array, but if two threads are updating elements that reside on the same cache line, false sharing will occur.
Note: the false sharing of
t_newshould not be confused with the true sharing which occurs when one thread updates an element oft_newwhich is then read (ast_old) by another thread in the next iteration.