Hands-On Session DS-3: OpenMPI Implementation

Hands-On Session Distributed HPC Systems #3:
OpenMPI Implementation

Objective: To understand how OpenMPI is implemented, including its internal structure and the source of its overheads (its structure being one of them).
The code for this exercise can be found here. Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

Open MPI Overheads

For this session we will use the latest version of Open MPI, as this corresponds to the source code we will be reading (the default version on Raijin is usually 1.6.3, which is now quite old).

module unload openmpi; module load openmpi/3.0.0

Untar the tar file and type the command make. Inspect the program bucketSort.c. This is the completed parallel bucket sort with 2 enhancements for studying MPI overheads: the program performs and times two consecutive sorts, and it also measures whole program execution time (after MPI_Init()). Briefly inspect both batch files and submit them.

The batch file batch_bucketSort also records the wall times for each mpirun call (0.01s resolution), plus the dates between them. The difference between consecutive dates should correspond to the mpirun wall times. When the job has completed, inspect the output.
- Do you see the second repetition being faster and if so how does this depend on p? Is the parallel sort scalable after all? How does the difference between the first and second repetitions change with p? What does this tell you about MPI connection setup overheads?
- The difference between the total program time and the mpirun wall time should give a (rough) indication of the MPI process creation time. How much is this, and does this vary with p?
- The time difference between respective date commands should match the mpirun wall time. Check that this is the case. Now compare the time difference between the first and last date timestamps and the Walltime Used for the entire PBS job, at the bottom of the file. Any difference might indicate PBS overheads. Do these match?
At the end of the session, time permitting, repeat this with the default OpenMPI (1.6.3), to see if OpenMPI implementations have have improved with respect to overheads. You will need to do the following, both on the command line and in the batch file.
and do make clean; make to get a 1.6.3 version of the executables.
We can restrict the transports Open MPI uses for messaging by restricting which BTL (Byte Transport Layer) components get loaded in at runtime. Execute the command
to see which are available. What do you think vader is for? Various aspects of Open MPI's Modular Component Architecture can be controlled at run-time with the -mca option. The following command restricts the BTLs to those explicitly named:
Try this and then try reducing the number of BTLs. Strangely enough, you may find that openib is the only BTL that can run alone (the front-ends aren't even supposed to have an Infiniband adapter!).
Inspect the batch file batch_bucketSortBTL and the output of its batch job (hopefully there by now). You should see the effect of just restricting OpenMPI to use the self BTL (evidently needed on the Raijin nodes to establish initial connections and not used for actual messaging) and the others. The effect will be different within a node, and between 4 nodes. Why might restricting the available BTLs make a difference?
Note: we suspect that the tcp BTL has been configured to use IB between nodes, using TCP-over-IB. Within a node, tcp uses kernel-space shared memory to transfer data.

Open MPI Code Structure

The directory raijin:/short/c37/ompi contains a clone of https://github.com/open-mpi/ompi.git. For this part of the exercise, cd to this directory to find more about how OpenMPI is structured and organized. Note however, you can also browse the code at the above URL.

Locate the file where MPI_Allgather() is implemented. Hint: do grep -R MPI_Allgather * | less.
Note the definitions of the symbols MPI_Allgather and PMPI_Allgather. What is performed by this function and what is the function call to perform the actual all-gather?
Go to the sub-directory ompi/mca. You will notice sub-directories for the Portable Messaging Layer (pml) and the BML Management Layer (bml).
Locate where the tuned versions of the collectives are. Open the file which decides which of the various all-gather algorithms is to be employed. What algorithms are available? Locate the function which does the decision, and find the name of the function performing the ring algorithm. (Almost) Finally, locate and open the file which implements this function. Locate the loop which does the ring send and receive. What is the name of the function performing the actual communication? Locate where this function is implemented; inspect the code of this function and that of the function of the same name with the suffix actual appended to it (just below). Finally, we should have got down to where OpenMPI portably performs an internal point-to-point message. How does it make those calls - which MCA component is used?
From the sub-directory ompi/mca, locate the directory where shared memory (sm) collectives are performed. Locate and open the file corresponding to all-gather. You will find that no shared-memory specific collective has been implemented. Open however the file corresponding to broadcast. You will see that the message is broken into `fragments' (chunks of optimal size), which the root copies in to a shared memory area, and the others copy out (if the others have children (?), they copy between segments in the shared memory area). Locate the header file where these copy macros are defined; you will notice that for the copy between macro, it is clear what mechanism transfers the data.
From the sub-directory opal/mca, you will notice sub-directories corresponding to the various BTL (Byte Transport Layer) components, and the RCache and MPool sub-components. Go to the BTL sub-directory: you will notice directories corresponding to the OpenIB, TCP, ans Shared Memory BTLs (among others). Go to the OpenIB directory. You will see files corresponding to setting up connections between endpoints, indicating this is a non-trivial part of the BTL's implementation (and overhead!). Open the file implementing the RDMA Put operation. You will see code for checking endpoints, queueing fragments (these size of these is generally set to Infiniband's (optimal) Message Transfer Unit size (MTU)), for calling a yet lower level internal function and checking the result. You will also see evidence of extra overheads if MPI is in threaded mode (locks etc). The internal function implementation is just below. Here at last, you will see code for manipulating IB message queues; locate the call that initiates the actual transfer of the fragment.

Hands-On Session Distributed HPC Systems #3: OpenMPI Implementation

Open MPI Overheads

Open MPI Code Structure

Hands-On Session Distributed HPC Systems #3:
OpenMPI Implementation