Hands-On Session PG-1: Locales and Domain Maps in Chapel

Objective: To understand different approaches to data distribution in Chapel.
The code for this exercise can be found here.
Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

In this exercise, we will create some Chapel programs which distribute data between locales running on different nodes of the Raijin system.

Setup

Load the Chapel programming environment with the following command:
  bash-4.1$ . /short/c37/modules/chapel-1.16.0/util/setchplenv.bash
Chapel requires a newer version of OpenMPI than we have been using (it needs support for the MPI_THREAD_MULTIPLE threading mode). You can confirm that your Chapel environment is correctly configured for Raijin, as follows:
  bash-4.1$ chpl --version
  chpl Version 1.16.0
  bash-4.1$ env | grep CHPL
  CHPL_COMM=gasnet
  CHPL_COMM_SUBSTRATE=ibv
  CHPL_HOST_PLATFORM=linux64
  CHPL_TARGET_ARCH=sandybridge
  CHPL_HOME=/short/c37/modules/chapel-1.16.0
  CHPL_LAUNCHER=gasnetrun_ibv
  bash-4.1$ mpirun -version
  mpirun (Open MPI) 3.0.0

  Report bugs to http://www.open-mpi.org/community/help/

Exercise 1: Using Domain Maps to Distribute Data

The file daxpy.chpl contains a simple Chapel program which allocates two arrays X and Y, and then performs the following update:

Y = α × X + Y

Both arrays share the same domain, but they have different distributions (i.e. the points in the domain are mapped differently to the set of locales).

Compile and run the program on four domains:

  make
  ./daxpy -nl 4 --N 10000
You should see that X is divided into equal chunks for each locale using a block distribution, whereas Y is stored entirely on Locales[0].

Change the program so that both arrays use the same domain map. You should see a significant improvement in performance - can you explain why?

Now change the program so that Y uses a cyclic distribution across all locales. How does this affect the performance? What about if both arrays use the same cyclic distribution?

Exercise 2: Measuring Communication Performance

The file pingPong.chpl contains a Chapel version of the 'ping-pong' benchmark, which measures communication bandwidth between locales.

Review the code; you will see that there are no explicit messages like there were in the MPI version of ping-pong. Where does communication take place, and why?

Run the program on [1,2,4] locales using the batch_pingPong script. How does the bandwidth measured for Chapel compare with the bandwidth you measured for MPI?