COMP4300/6430 2013 - Assignment 2 sample solution

Section 1: Threading Building Blocks

Parallelising the code with TBB is fairly straightforward: simply convert the main loop into a parallel reduction. For example for a one-dimensional decomposition:

max_diff = parallel_reduce(
    blocked_range<size_t>(1, n_y-1),
    0.0,
    [&](const blocked_range<size_t>& r, double max)->double {
        for (size_t j = r.begin(); j < r.end(); ++j) {
            for (size_t i = 1; i < n_x-1; i++) {
                size_t idx = j*n_x+i;
                t_new[idx]=0.25*(t_old[idx+1] + t_old[idx-1] +
                                 t_old[idx+n_x] + t_old[idx-n_x]);
                double tdiff = fabs(t_old[idx] - t_new[idx]);
                max = std::max(max, tdiff);
            }
        }
        return max;
    },
    [](double x, double y)->double {
        return std::max(x, y);
    }
);

However, the performance of the resulting code is surprisingly poor: the execution time for one thread increases by around 100%. The problem is not TBB overhead: it is that restructuring the loop in this way confuses the compiler such that it can no longer vectorise the loop. To observe this, request loop vectorisation reports from the Intel compiler:

$ icpc ... -vec-report=3 heat_tbb.cc ...
...
heat_tbb.cc(95): (col. 50) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
heat_tbb.cc(96): (col. 21) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
...

To regain the expected single-thread performance, it is necessary to inform the compiler that your code follows strict ISO/ANSI aliasing rules:

$ icpc ... -vec-report=3 -ansi-alias heat_tbb.cc ...

For a full example code using TBB see heat_tbb.cc.

Section 2: Pthreads

There are a variety of ways to parallelise the code using Pthreads. Probably the best way is to have each thread perform all iterations, operating on some portion of the grid. It is therefore necessary to synchronise between threads at each iteration. Each thread will calculate a local maximum difference, which can be summed by a single thread, or accumulated to the global max diff in a critical section protected by a mutex as in the example below.

do {
    iter++;

	if (pthread_barrier_wait(barr) == PTHREAD_BARRIER_SERIAL_THREAD) {
        max_diff = 0.0;
    }
    pthread_barrier_wait(barr);

    // update local values and calculate max diff
    ...

    // update global max_diff
    pthread_mutex_lock(max_diff_lock);
    max_diff = std::max(max_diff, local_diff);
    pthread_mutex_unlock(max_diff_lock);

    pthread_barrier_wait(barr);
    ...
} while (max_diff > converge && iter < max_iter);

Note the use of barriers to ensure that the global max diff is zeroed by a single thread before any thread updates it for this iteration; and again to ensure that the max diff has been updated by all threads before it is used to test for convergence.

For a full example code using Pthreads see heat_threads.cc.

Section 3: Measurement and Analysis

There are two quad-core Xeon X5570 processors on each node of Vayu. TBB recognises this and so by default will create 8 worker threads (see tbb::task_scheduler_init::default_num_threads()).

Table 1 shows the execution time for heat_tbb.cc and heat_threads.cc as well as speedup and efficiency compared to the fastest sequential code, which is heat.cc.

Table 1: strong scaling for heat stencil: 4000^2
Sequential time (s) (heat_threads): 0.87
No. Threadsheat_tbb time (s)SpeedupEfficiencyheat_threads time (s)SpeedupEfficiency
10.871.001.000.930.940.94
20.461.880.940.701.240.62
40.342.560.640.432.020.5
80.352.480.310.402.170.27

Table 2 shows the times for a problem size of 8000^2

Table 2: strong scaling for heat stencil: 8000^2
Sequential time (s) (heat_threads): 3.48
No. Threadsheat_tbb time (s)SpeedupEfficiencyheat_threads time (s)SpeedupEfficiency
13.500.990.993.700.940.94
21.871.860.931.911.820.91
41.462.380.601.722.020.51
81.392.500.311.761.980.25

Speedup and efficiency are good for both codes with two threads, but drop off rapidly for higher numbers of threads. Speedup with 16 threads is less than speedup with 8 threads for all problems tested (not shown).

The slowdown is not due to locking overhead. This can be confirmed for heat_threads.cc by removing all locks and barriers (and the check for convergence). The code is now incorrect, but does not show performance improvement for 4-8 threads.

Profiling with hpctoolkit shows that for heat_threads.cc, as the number of threads increases, the number of Level 2 cache misses increases markedly.

Table 3: L2 cache misses for heat stencil: 8000^2
No. ThreadsL2 load missesL2 store misses
11.0e+072.0e+06
22.6e+074.0e+06
44.2e+072.8e+07
89.8e+076.4e+07

hpctoolkit also indicates that the misses occur entirely within the t_new update loop (see figure 1).

Figure 1: Screenshot of hpcviewer showing L2 load misses within update loop

Screenshot of hpcviewer output for 8 threads for heat stencil, showing L2 cache load misses within update loop

The increase in cache misses indicates increased interference between larger numbers of threads - this may contribute to the lack of speedup.

Apart from performance, we are also interested in the productivity implications of using TBB vs. Pthreads. The TBB version of the code is much simpler to implement, requiring changes to less than 20 lines of code, which is less than 1/4 of the LOC required using Pthreads. In addition, the TBB changes are entirely ‘inline’ in the existing code - they do not require new functions to be written. The only difficulty encountered using TBB was the failure of compiler vectorisation - this reminds us that when using high-level libraries, it is sometimes still necessary to examine low-level implementation details.

Both codes have some locking overhead - in the Pthreads version this is explicit through the use of a mutex, whereas in TBB it is implicit in the parallel reduction. The main parallel overhead is cache interference (see above) - it may be possible to reduce this through blocking for improved cache usage and/or padding to reduce false sharing on t_new. There are also overheads due to thread creation (Pthreads) and task creation and scheduling (TBB). These may be expected to increase with larger numbers of threads, although if the problem size were increased accordingly this may not be a significant impact.

Section 4 (COMP6430 students only): False Sharing

For a full example code that tests false sharing between threads see heat_false.cc.

Table 4 shows timings for heat_false, which demonstrate that false sharing significantly reduces performance. Increasing the stride used on the max_diff_local array from 1 to 8 reduces the execution time by over 50%.

Table 4: Timings for heat_false with different strides: 8000^2
StrideTime (s)
15.12
24.43
43.39
82.43

False sharing may also be an issue for updates to t_new: each thread updates disjoint elements of this array, but if two threads are updating elements that reside on the same cache line, false sharing will occur.

Note: the false sharing of t_new should not be confused with the true sharing which occurs when one thread updates an element of t_new which is then read (as t_old) by another thread in the next iteration.