Skip Navigation | ANU Home | Search ANU | Search FEIT | Feedback
The Australian National University
Faculty of Engineering and Information Technology (FEIT)
Department of Computer Science
Printer Friendly Version of this Document
High Performance Scientific Computing COMP3320

COMP3320/6464 2008: Laboratory 5
Brief Introduction to MPI

This lab will give you a brief introduction to message passing with MPI. We will use fremont to complete this lab, even though this actually a shared memory system we can still use MPI.

You need first to modify your search PATH so that you find the relevant MPI commands. To do this edit add the following line to your .cshrc

    setenv PATH ${PATH}:/opt/SUNWhpc/HPC7.1/bin
and also issue type the above at the command line (or log out and log in again). If you have done this right you should see the following if you type "which mpirun":
   fremont> which mpirun
   /opt/SUNWhpc/HPC7.1/bin/mpirun
Ask me if you have problems.

Now copy the Sun MPI examples into your directory, look at the README file and compile the code

   fremont> cp -r /opt/SUNWhpc/HPC7.1/examples/ .
   fremont> more README
   fremont> make
This will build two programs written using a variety of different languages. We will look at the C codes.
  • hello_c: this simply creates a number of MPI processes and has each printout its ID.
  • ring_c: this creates a ring of processes, passes an integer token that has an initial value of 10 around a ring of processes, decrementing the value of the token each time it completes one circuit of the ring, and terminating when the token reaches a value of 0.
Look at the code to see that you understand what is happening

To run these tests we use mpirun. This is responsible for spawning the MPI processes. For us using a single machine it is fairly easy. We are really only worried about how many MPI processes to create. This is controlled by the -n flag. Full details of all the mpirun options are obtained by typing mpirun -h. Test the hello_c and ring_c executable by typing

mpirun -n 4 hello_c
mpirun -n 8 ring_c
Some other web pages that you might find useful include When you are happy with basic use of MPI on fremont go on to the following.

MPI PingPong Code


The aim of the program is to measure the time it takes to send a message of varying length between two MPI processes. The code is shown below.

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#define LEN 512  
int main( argc, argv )
int argc;
char **argv;
{
  int        rank, size, i, len;
  MPI_Status status;
  int  *sbuf,*rbuf;
  double t1,t2,time;

  MPI_Init( &argc, &argv );
  MPI_Comm_rank( MPI_COMM_WORLD, &rank );
  MPI_Comm_size( MPI_COMM_WORLD, &size );

  printf("Number of processes = %3d My id = %3d\n",size,rank);
  len=1;
  while (len <= LEN ){
    sbuf = malloc(len*sizeof(int));
    rbuf = malloc(len*sizeof(int));
    MPI_Barrier(MPI_COMM_WORLD);
    for (i=0; i < 100; i++){

      t1 = MPI_Wtime();
      if (rank == 0){
        MPI_Send( sbuf, len, MPI_INTEGER, rank+1, 0, MPI_COMM_WORLD );
        MPI_Recv( rbuf, len, MPI_INTEGER, rank+1, 1, MPI_COMM_WORLD, &status );
      }
      else if (rank == 1){
        MPI_Send( sbuf, len, MPI_INTEGER, rank-1, 1, MPI_COMM_WORLD);
        MPI_Recv( rbuf, len, MPI_INTEGER, rank-1, 0, MPI_COMM_WORLD, &status );
      } 
      t2 = MPI_Wtime()-t1;
      if (i == 0) time = t2;
      time = (t2 < time) ? t2 : time;
    }
    MPI_Barrier(MPI_COMM_WORLD);
    free(sbuf);
    free(rbuf);
    if (!rank)printf("Bytes %10d minimum time (sec) %12.9f\n",
                      len*sizeof(int),time);
    len*=2;
  }

  MPI_Finalize( );
  return 0;
}
The while loop is used to define the length of the message we are going to send - we start at length 1 and double it each time to a maximum of 1MW. We then have a for loop that repeats our timing experiment 100 times. We will record the minimum time - like we did in previous timing experiments. We do this using the MPI timing routine MPI_Wtime. The message is sent from process with rank 0 to process with rank 1 and then back again. Thus we have two pairs of send and receive operations. If we run the code with more than 2 processes then everyone else will do nothing. The job ends with an MPI_Finalize which synchronizes the processes and flushes the buffers.

  • Compile and run mpipingpong with 2 processes. Now change the size of LEN to 1024, rebuild and re-run the code. You should find that the code never finishes. What is the problem. Fix it.
  • Using the above fixed code, determine the latency and bandwidth for communications between two MPI processes.

Computing Pi


In lab4 you encountered the following code to compute Pi:
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[]){
   int np;
   int nele,i;
   double xl,fxl,xh,fxh,xw,area;

  if (argc != 3) {
    printf(" %s Number_of_processes Number_of_segments\n",argv[0]);
    return -1;
  }
    else {
      np = atoi(argv[1]);
      nele = atoi(argv[2]);
      if (np < 1 || nele < 1){
        printf("Error: Number_of_processes (%i) < 1 \n",np);
        printf("Error: Number_of_segments  (%i) < 1 \n",nele);
        return -1;
      }
  }

  xl=0.0;
  xw=1.0/nele;
  area=0.0;
  for (i=0; i < nele; i++){
     xh=xl+xw;
     fxl=1.0/(1.0+xl*xl);
     fxh=1.0/(1.0+xh*xh);
     area+=0.5*(fxl+fxh)*xw;
     xl=xh;
   }

   printf("Elements %i area %14.12f \n",nele,area*4.0);

   return 0;
}
  • Back then you parallelised the code using OpenMP, now parallelise it using MPI. Make sure the results agree! (You might find the MPI calls MPI_Bcast and MPI_Reduce to be useful.)
  • Compare the performance of your OpenMP code with your MPI code for a variety of different numbers of processes/threads and input parameters. Do you expect to see large differences? If so why, if not why not?

Global Operations


The following piece of code uses a binary tree to perform a basic broadcast.
  • Change the code to perform a global sum of the vector mbuf over all processes, returning the result to all processes. You should also use a binary tree algorithm. This operation should equate to what is performed with MPI_Allreduce. But in your case only worry about integers. Makesure your code works!
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

void bcast(int *mbuf, int len, int iam, int size);

int main( argc, argv )
int argc;
char **argv;
{
  int        rank, size, i, len;
  MPI_Status status;
  int     *mbuf;
  double t1,t2,time;

  MPI_Init( &argc, &argv );
  MPI_Comm_rank( MPI_COMM_WORLD, &rank );
  MPI_Comm_size( MPI_COMM_WORLD, &size );

  len=1;
  mbuf = malloc(len*sizeof(int));
  if (mbuf == NULL){
    printf("Malloc failed - aborting \n");
    abort();
  }
  if (rank == 0) *mbuf=99;

  bcast(mbuf, len, rank,size);

  printf(" %d final mbuf %d \n",rank, *mbuf);

  MPI_Barrier(MPI_COMM_WORLD);

  MPI_Finalize( );
  return 0;
}

void bcast(int *mbuf, int len, int iam, int size)
{
  int nlevel,i,N;
  MPI_Status status;

  nlevel=0;
  i=1;
  while (i < size){
    i*=2;
    nlevel++;
  }

  N=2;
  for (i=1; i <= nlevel; i++){
    if (iam < N){
      if (iam < N/2 && iam+N/2 < size ){
        MPI_Send(mbuf, len, MPI_INTEGER, iam+N/2, 99, MPI_COMM_WORLD );
     } else if (iam >= N/2 && iam < size ){
        MPI_Recv(mbuf, len, MPI_INTEGER, iam-N/2, 99, MPI_COMM_WORLD, &status );      }
    }
    N*=2;
  }

}

Not compulsory: Heat Distribution


This problem has some similiarities to the cloth simulation in assignment 2, in that we have a grid, and the update of data at grid points depends on surounding grid points.

We have a metal plate of size Rx by Ry for which the temperature at the edge is held at some constant value (Tedge). The aim is to determine the temperature in the middle of the plate. This is can be done iteratively by dividing the domain of the metal plate into a grid of points. The new temperature at each grid point is calculated as the average of the four current temperatures at the adjacent grid points. The iterations continue until the temperatures of all grid points does not change between iterations to within some threshold. A diagramatic illustration of the problem is shown below:

Code implementing this is given below.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "mpi.h"

#define MAX_PRINT 20

int main( argc, argv )
int  argc;
char **argv;
  {
   int Nx, Ny, Max_iter;
   double Tedge, converge;
   MPI_Status status;
   int rank, size;
   int i, j, iter, ierr;
   double *tnew, *told, *ttmp, tdiff, mxdiff;
   double t1,tseq;
   int idata[3];
   double data[2];

   MPI_Init( &argc, &argv );
   MPI_Comm_rank( MPI_COMM_WORLD, &rank );
   MPI_Comm_size( MPI_COMM_WORLD, &size );
   MPI_Barrier(MPI_COMM_WORLD);

 /*  INPUT PROCESSING */
   if (!rank){
      printf("\n COMP3320/6464 Hot Plate Program\n");
      printf(" Number of processes %d\n",size);
      printf(" Input number of X and Y data points \n");
      scanf("%d%d",&Nx,&Ny);
      printf(" Input temperatures along edges \n");
      scanf("%lf",&Tedge);
      printf(" Input maximum iterations \n");
      scanf("%d",&Max_iter);
      printf(" Input convergence \n");
      scanf("%lf",&converge);
      for (i=0;i<72;i++)printf("-");
      printf("\n");
      printf(" Nx       %8d Ny     %8d \n",Nx, Ny);
      printf(" Tedge    %8.2lf\n", Tedge);
      printf(" Max_iter %8d\n", Max_iter);
      printf(" Converge %8.2lf\n",converge);
      for (i=0;i<72;i++)printf("-");
      printf("\n");
      idata[0]=Nx;idata[1]=Ny;idata[2]=Max_iter;
      MPI_Bcast(idata, 3, MPI_INT, 0, MPI_COMM_WORLD);
      data[0]=Tedge;data[1]=converge;
      MPI_Bcast(data, 2, MPI_DOUBLE, 0, MPI_COMM_WORLD);
   } else {
      MPI_Bcast(idata, 3, MPI_INT, 0, MPI_COMM_WORLD);
      Nx=idata[0];Ny=idata[1];Max_iter=idata[2];
      MPI_Bcast(data, 2, MPI_DOUBLE, 0, MPI_COMM_WORLD);
      Tedge=data[0];converge=data[1];
   }
   MPI_Barrier(MPI_COMM_WORLD);

 /***********************************************************************/
 /*                    MAIN  COMPUTE PART                               */
 /***********************************************************************/
   // Allocate data array

   told   = (double*) malloc( Nx*Ny*sizeof(double) );
   tnew   = (double*) malloc( Nx*Ny*sizeof(double) );

   /* Set Boundary Conditions */
   for (i = 0; i < Nx*Ny; i++){told[i]=0.0;tnew[i]=0.0;}
   j=0;    for (i = 0; i < Nx; i++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
   j=Ny-1; for (i = 0; i < Nx; i++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
   i=0;    for (j = 0; j < Ny; j++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
   i=Nx-1; for (j = 0; j < Ny; j++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
   iter = 0;

   if (!rank)printf("\n Begin Iterations\n");
   MPI_Barrier(MPI_COMM_WORLD);
   t1 = MPI_Wtime();
   do {
     iter++;
     /* update the grid points*/
     for (j = 1; j < Ny-1; j++){
       for (i = 1; i < Nx-1; i++)
         tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
                            told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
     }

     /*find maximum change in any grid point*/
     mxdiff = fabs( (double) (told[0] - tnew[0]));
     for (i = 0; i < Nx*Ny; i++){
       tdiff = fabs( (double) (told[i] - tnew[i]));
       mxdiff = (mxdiff < tdiff) ? tdiff : mxdiff;
     }
     if (!rank)printf(" iteration %d convergence %lf\n",iter,mxdiff);
     /*swap pointers and iterate again*/
     ttmp=told;
     told=tnew;
     tnew=ttmp;
   }while ( mxdiff > converge && iter < Max_iter || iter == 1);
   MPI_Barrier(MPI_COMM_WORLD);
   tseq = MPI_Wtime() - t1;

   if (!rank){
     printf("\nTotal time sequential %8.4lf\n",tseq);
   }

 /***********************************************************************/
 /*                    END MAIN COMPUTE PART                            */
 /***********************************************************************/

  /* Print out grid points if not too many */
   MPI_Barrier(MPI_COMM_WORLD);
   if ( Nx < MAX_PRINT && Ny < MAX_PRINT) {
     printf(" final temperatures on process\n");
     for (j = 0; j < Ny; j++){
       printf("\n Row %d \n",j);
       for (i = 0; i < Nx; i++){
         printf("%12.5f", tnew[j*Nx+i]);
         if (i%6 == 5)printf("\n");
       }
     }
   }

   printf("\n");
   MPI_Finalize( );
   return 0;
}
The code requests as input
Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Tedge - the fixed temperature at the boundary
Max_iter - Maximum number of iterations permitted
Converge - the convergence criteria
Except at the edges, the grid is initially set to zero. For problem sizes with Nx and Ny less than 20 the resulting temperatures are printed. This is really just for interest.

Although the code has been adapted to run using MPI, the problem is only solved by process zero. (This is typically the first step in converting a sequential code to an MPI parallel code).

  • With Nx = 20 and Ny = 40 and Tedge=100.0 how many iterations does it take before grid point (7,5) is non-zero (assume grid point numbering begins at 0,0)? Generally for Nx=Ny=(2n+1) how many iterations does it take for point (n,n) to be none zero? Explain the rational behind your answer.
  • Now divide the work involved in the iterative loop between the MPI processes.