Hands-On Session PG-2: 2D Stencil in Chapel

Objective: To develop a simple Chapel program to exploit parallelism across multiple nodes of a HPC cluster.
The code for this exercise can be found here.
Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

In this exercise, we will develop a Chapel version of the heat program that we used in previous exercises.

Setup your Chapel environment on Raijin just as for the previous session.

Exercise 1: Single-Locale Parallelism

The file heat.chpl contains a simple sequential heat program. Compile it and confirm that it runs correctly on a single locale, using the following commands:
  bash-4.1$ make
  bash-4.1$ ./heat -nl 1 < heat.input
The variable mxDiff is computed using an array reduction - make sure you understand how this works.

Change the program so that the loop which computes individual points in tnew is performed in parallel. Does this speed up the computation?

Try setting different values -- e.g. [1,2,4,8] -- for the environment variable CHPL_RT_NUM_THREADS_PER_LOCALE to control the number of Chapel tasks used to execute forall loops. Do you get different results if you change the number of Chapel tasks used?

Exercise 2: Distributed Parallelism

Change the program to partition the arrays told and tnew across all locales using a block distribution.

Run the program on [1,2,4,8] locales using the batch_heat script, and observe the results.

Note the following line from the batch script:

  MPIRUN_CMD="mpirun -np %N --bind-to socket --map-by ppr:1:socket %C"  
This modifies the MPI command used to launch the Chapel program. What effect does it have - what happens if you remove this line? You may wish to consult the documentation for mpirun, particularly the options to map processes.

Exercise 3: Improving Communication Performance

The communication performance of the program using a block distribution is far from optimal. In every iteration of the main loop, each locale needs to access multiple elements of told that are stored on other locales. In the current version of the program, each element is accessed individually, resulting in many more messages than the original MPI version of the program.

Improve your program by using a stencil distribution. Run the improved program on [1,2,4,8] locales using the batch_heat script, and observe the results.