COMP4300/6430 2013 - Assignment 2

Stencil Computation Using Threads and Tasks

Deadline: 13:00 on Wednesday 22 May

This assignment is worth 20% of your total course mark. It will be marked out of 40 for COMP4300 students, and out of 50 for COMP6430 students.

Please report any errors, omissions and ambiguities to the course lecturer.
(Amendments to this document after its release in non-draft will be marked in red)

Submission

This assignment must be submitted electronically. This can be done using the following commands (according to your enrolment):

submit comp4300 ass2 ass2.zip

submit comp6430 ass2 ass2.zip

Objectives

In completing this assignment, you will gain:

Description

The file ass2.zip contains a file heat.cc, which is a sequential implementation of a stencil computation in C++ to solve the 2-dimensional heat diffusion problem by Jacobi iteration. It also contains a Makefile which should be used to compile the programs, and an example input file which you can use to execute the program non-interactively e.g. ./heat < heat.input. Copy these files into a local directory before you begin work on your programs. You can compile the sequential system with the simple command make heat.

On the NCI system, we will use the Intel C++ Compiler version 13.2.146 and Threading Building Blocks. A module file has been prepared for this assignment to load the required software. On Vayu, load the module as follows:

module use /short/c07/modules
module load tbb

The provided heat.cc sets up and solves a 2-D heat diffusion problem, identical to that described in lab 3.

This models a metal plate of size Rx by Ry for which the temperature at the edge is held at some constant value Tedge. The aim is to determine the temperature in the middle of the plate. This is done iteratively by dividing the domain of the metal plate into a grid of points. The new temperature at each grid point is calculated as the average of the current temperatures at the four adjacent grid points. Iteration continues until the maximum change in temperature for any grid point is less than some threshold. A diagrammatic illustration of the problem is shown below:

Illustration of Jacobi iteration for heat problem. Grid of R_x by R_y points with temperature fixed to T_edge for edge points. New temperature for a grid point is average of current temperature at the four neighbouring grid points.

Section 1: TBB

Parallelise the code using Threading Building Blocks. The file containing your parallel code should be called heat_tbb.cc.

You may use either a one- or two-dimensional domain decomposition; the same decomposition should be used in section 2.

Section 2: Pthreads

Parallelise the code using Pthreads. The file containing your parallel code should be called heat_threads.cc. This code should use the same domain decomposition as used in your TBB code from section 1.

A reduction operation is required to calculate max_diff, the maximum difference for a grid point between iteration.
Implement this reduction using a local variable for the contribution from each thread, which is subsequently reduced to a single overall maximum.

Section 3: Measurement and Analysis

Measure and compare the performance of both codes using different numbers of threads on a single node of Vayu. Consider various problem sizes e.g. 40002, 60002 80002. What parallel speedup and efficiency do you observe? Comment on the speedup of your codes versus ‘ideal’ speedup.

Reflect on the advantages and disadvantages of implementing this code using TBB vs. Pthreads. What are the overheads in each code and what implications might this have for scaling to larger numbers of threads?

Section 4 (COMP6430 students only): False Sharing

Using your Pthreads code as a base, investigate the issue of false cache line sharing. To do this, replace the use of a local variable to store the max_diff on each thread with an element in a globally shared array, and then perform a reduction over these results. The file containing your test code should be called heat_false.cc.

Does spacing the elements in memory alter the performance for multiple threads? Is false sharing an issue elsewhere in the computation? Provide timing measurements and/or profile data to support your answers.

Additional Requirements

Your code must be written using C++, with calls to the TBB and Pthreads libraries, standard C++ libraries and system libraries.

You are required to submit a single file ass2.zip, containing:

Your code should be written with good programming style (and so should not produce any warnings when compiled with the -Wall option). It should be well commented, so as to enable the reader to understand how it works. Identifiers should be meaningful and (unless their role is trivial) commented where declared, any constants should be named rather than hard-coded, and defensive programming should be used. The latter includes checks being made for function calls that can return an error status; once detected, a message should be generated with a call to the system error function perror() and the process should exit with a non-zero status. It also includes using assert() to perform other kinds of checks, and the deleting (freeing) of any memory allocated using new or malloc before (normal) exit (this permits mcheck or valgrind to perform memory consistency checks).

Hints

If in doubt about which code style standard to follow, consider the Google C++ Style Guide. It is distributed with cpplint, which is a handy script that checks C++ files for compliance with the standard.

The NCI User Guide has many helpful suggestions on debugging sequential and parallel codes. A few minutes with gdb or Totalview (TotalView Tutorial) could save you hours reading the output of printf statements!

When running jobs on the Vayu system, please use the normal queue rather than the express queue, as the express queue is charged at 3x the standard usage units. Please be mindful that you share the allocation with all other students in this course!

Marking Scheme

SectionMarks
1: TBB 15
2: Pthreads 15
3: Analysis 10
Total (COMP4300) 40
4: False Sharing+10
Total (COMP6430)50

In awarding marks, considerations will include:

Extensions

See the course Administration page.

Late penalties are as follows:

How LatePenalty from 40 Marks
less than 1 hour-1
between 1 and 6 hours-2
between 6 and 24 hours (1 day)-5
between 24 and 48 hours (2 days)-10
between 48 and 72 hours (3 days)-20
more than 72 hours (4 days)-40 (forget it!)

Note: late penalties apply even if you submit just one file after the deadline.