## Hands-On Session PG-1: Locales and Domain Maps in Chapel

Objective: To understand different approaches to data distribution in Chapel.
The code for this exercise can be found here.
Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

In this exercise, we will create some Chapel programs which distribute data between locales running on different nodes of the Raijin system.

## Setup

Load the Chapel programming environment with the following command:
  bash-4.1$. /short/c37/modules/chapel-1.16.0/util/setchplenv.bash  Chapel requires a newer version of OpenMPI than we have been using (it needs support for the MPI_THREAD_MULTIPLE threading mode). You can confirm that your Chapel environment is correctly configured for Raijin, as follows:  bash-4.1$ chpl --version
chpl Version 1.16.0
bash-4.1$env | grep CHPL CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv CHPL_HOST_PLATFORM=linux64 CHPL_TARGET_ARCH=sandybridge CHPL_HOME=/short/c37/modules/chapel-1.16.0 CHPL_LAUNCHER=gasnetrun_ibv bash-4.1$ mpirun -version
mpirun (Open MPI) 3.0.0

Report bugs to http://www.open-mpi.org/community/help/


## Exercise 1: Using Domain Maps to Distribute Data

The file daxpy.chpl contains a simple Chapel program which allocates two arrays X and Y, and then performs the following update:

Y = α × X + Y

Both arrays share the same domain, but they have different distributions (i.e. the points in the domain are mapped differently to the set of locales).

Compile and run the program on four domains:

  make
./daxpy -nl 4 --N 10000
You should see that X is divided into equal chunks for each locale using a block distribution, whereas Y is stored entirely on Locales[0].

Change the program so that both arrays use the same domain map. You should see a significant improvement in performance - can you explain why?

Now change the program so that Y uses a cyclic distribution across all locales. How does this affect the performance? What about if both arrays use the same cyclic distribution?

## Exercise 2: Measuring Communication Performance

The file pingPong.chpl contains a Chapel version of the 'ping-pong' benchmark, which measures communication bandwidth between locales.

Review the code; you will see that there are no explicit messages like there were in the MPI version of ping-pong. Where does communication take place, and why?

Run the program on [1,2,4] locales using the batch_pingPong script. How does the bandwidth measured for Chapel compare with the bandwidth you measured for MPI?