module unload openmpi; module load openmpi/3.0.0
make
. Inspect the
program bucketSort.c. This is the completed parallel bucket sort with
2 enhancements for studying MPI overheads: the program performs and times two
consecutive sorts, and it also measures whole program execution time (after
MPI_Init()). Briefly inspect both batch files and submit them.
module unload openmpi; module load openmpi
make clean; make
to get a 1.6.3 version of the
executables.
ompi_info | grep btl
Inspect the batch file batch_bucketSortBTL and the output of its batch job (hopefully there by now). You should see the effect of just restricting OpenMPI to use the self BTL (evidently needed on the Raijin nodes to establish initial connections and not used for actual messaging) and the others. The effect will be different within a node, and between 4 nodes. Why might restricting the available BTLs make a difference?
Note: we suspect that the tcp BTL has been configured to use IB between nodes, using TCP-over-IB. Within a node, tcp uses kernel-space shared memory to transfer data.
raijin:/short/c37/ompi
contains a clone of
https://github.com/open-mpi/ompi.git. For this part of the exercise,
cd to this directory to find more about how OpenMPI is structured
and organized.
Note however, you can also browse the code at the above URL.
grep -R MPI_Allgather * | less
.
Note the definitions of the symbols MPI_Allgather and PMPI_Allgather. What is performed by this function and what is the function call to perform the actual all-gather?
Locate where the tuned versions of the collectives are. Open the file which decides which of the various all-gather algorithms is to be employed. What algorithms are available? Locate the function which does the decision, and find the name of the function performing the ring algorithm. (Almost) Finally, locate and open the file which implements this function. Locate the loop which does the ring send and receive. What is the name of the function performing the actual communication? Locate where this function is implemented; inspect the code of this function and that of the function of the same name with the suffix actual appended to it (just below). Finally, we should have got down to where OpenMPI portably performs an internal point-to-point message. How does it make those calls - which MCA component is used?