COMP3320/6464 2008: Laboratory 5
Brief Introduction to MPI
This lab will give you a brief introduction to message passing with MPI.
We will use
fremont to complete this lab, even though this
actually a shared memory system we can still use MPI.
You need first to modify your search PATH so that you find the relevant
MPI commands. To do this edit add the following line to your .cshrc
setenv PATH ${PATH}:/opt/SUNWhpc/HPC7.1/bin
and also issue type the above at the command line (or log out and log in
again). If you have done this right you should see the following if
you type "which mpirun":
fremont> which mpirun
/opt/SUNWhpc/HPC7.1/bin/mpirun
Ask me if you have problems.
Now copy the Sun MPI examples into your directory, look at the README file
and compile the code
fremont> cp -r /opt/SUNWhpc/HPC7.1/examples/ .
fremont> more README
fremont> make
This will build two programs written using a variety of different languages.
We will look at the C codes.
- hello_c: this simply creates a number of MPI processes and has each printout its ID.
- ring_c: this creates a ring of processes, passes an integer token
that has an initial value of 10 around a ring of processes, decrementing the
value of the token each time it completes one circuit of the ring, and
terminating when the token reaches a value of 0.
Look at the code to see that you understand what is happening
To run these tests we use mpirun. This is responsible for
spawning the MPI processes. For us using a single machine it is fairly
easy. We are really only worried about how many MPI processes to
create. This is controlled by the -n flag. Full details of
all the mpirun options are obtained by typing mpirun
-h. Test the hello_c and ring_c executable by
typing
mpirun -n 4 hello_c
mpirun -n 8 ring_c
Some other web pages that you might find useful include
When you are happy with basic use of MPI on fremont go on to the following.
MPI PingPong Code
The aim of the program is to measure the time it takes to send a
message of varying length between two MPI processes. The code is shown
below.
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#define LEN 512
int main( argc, argv )
int argc;
char **argv;
{
int rank, size, i, len;
MPI_Status status;
int *sbuf,*rbuf;
double t1,t2,time;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf("Number of processes = %3d My id = %3d\n",size,rank);
len=1;
while (len <= LEN ){
sbuf = malloc(len*sizeof(int));
rbuf = malloc(len*sizeof(int));
MPI_Barrier(MPI_COMM_WORLD);
for (i=0; i < 100; i++){
t1 = MPI_Wtime();
if (rank == 0){
MPI_Send( sbuf, len, MPI_INTEGER, rank+1, 0, MPI_COMM_WORLD );
MPI_Recv( rbuf, len, MPI_INTEGER, rank+1, 1, MPI_COMM_WORLD, &status );
}
else if (rank == 1){
MPI_Send( sbuf, len, MPI_INTEGER, rank-1, 1, MPI_COMM_WORLD);
MPI_Recv( rbuf, len, MPI_INTEGER, rank-1, 0, MPI_COMM_WORLD, &status );
}
t2 = MPI_Wtime()-t1;
if (i == 0) time = t2;
time = (t2 < time) ? t2 : time;
}
MPI_Barrier(MPI_COMM_WORLD);
free(sbuf);
free(rbuf);
if (!rank)printf("Bytes %10d minimum time (sec) %12.9f\n",
len*sizeof(int),time);
len*=2;
}
MPI_Finalize( );
return 0;
}
The
while loop is used to define the length of the message we
are going to send - we start at length 1 and double it each time to a
maximum of 1MW. We then have a
for loop that repeats our
timing experiment 100 times. We will record the minimum time - like we
did in previous timing experiments. We do this using the MPI timing routine
MPI_Wtime. The message is sent from process with rank 0 to
process with rank 1 and then back again. Thus we have two pairs of
send and receive operations. If we run the code with more than 2
processes then everyone else will do nothing. The job ends with an
MPI_Finalize which synchronizes the processes and flushes the
buffers.
- Compile and run mpipingpong with 2 processes. Now change
the size of LEN to 1024, rebuild and re-run the code. You should find
that the code never finishes. What is the problem. Fix it.
- Using the above fixed code, determine the latency and bandwidth
for communications between two MPI processes.
Computing Pi
In lab4 you encountered the following code to compute Pi:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[]){
int np;
int nele,i;
double xl,fxl,xh,fxh,xw,area;
if (argc != 3) {
printf(" %s Number_of_processes Number_of_segments\n",argv[0]);
return -1;
}
else {
np = atoi(argv[1]);
nele = atoi(argv[2]);
if (np < 1 || nele < 1){
printf("Error: Number_of_processes (%i) < 1 \n",np);
printf("Error: Number_of_segments (%i) < 1 \n",nele);
return -1;
}
}
xl=0.0;
xw=1.0/nele;
area=0.0;
for (i=0; i < nele; i++){
xh=xl+xw;
fxl=1.0/(1.0+xl*xl);
fxh=1.0/(1.0+xh*xh);
area+=0.5*(fxl+fxh)*xw;
xl=xh;
}
printf("Elements %i area %14.12f \n",nele,area*4.0);
return 0;
}
- Back then you parallelised the code using OpenMP, now parallelise it
using MPI. Make sure the results agree! (You might find the MPI calls
MPI_Bcast and MPI_Reduce to be useful.)
- Compare the performance of your OpenMP code with your MPI code for
a variety of different numbers of processes/threads and input
parameters. Do you expect to see large differences? If so why, if not
why not?
Global Operations
The following piece of code uses a binary tree to perform a basic broadcast.
- Change the code to perform a global sum of the vector mbuf over
all processes, returning the result to all processes. You should also
use a binary tree algorithm. This operation should equate to what is
performed with MPI_Allreduce. But in your case only worry about
integers. Makesure your code works!
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void bcast(int *mbuf, int len, int iam, int size);
int main( argc, argv )
int argc;
char **argv;
{
int rank, size, i, len;
MPI_Status status;
int *mbuf;
double t1,t2,time;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
len=1;
mbuf = malloc(len*sizeof(int));
if (mbuf == NULL){
printf("Malloc failed - aborting \n");
abort();
}
if (rank == 0) *mbuf=99;
bcast(mbuf, len, rank,size);
printf(" %d final mbuf %d \n",rank, *mbuf);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize( );
return 0;
}
void bcast(int *mbuf, int len, int iam, int size)
{
int nlevel,i,N;
MPI_Status status;
nlevel=0;
i=1;
while (i < size){
i*=2;
nlevel++;
}
N=2;
for (i=1; i <= nlevel; i++){
if (iam < N){
if (iam < N/2 && iam+N/2 < size ){
MPI_Send(mbuf, len, MPI_INTEGER, iam+N/2, 99, MPI_COMM_WORLD );
} else if (iam >= N/2 && iam < size ){
MPI_Recv(mbuf, len, MPI_INTEGER, iam-N/2, 99, MPI_COMM_WORLD, &status ); }
}
N*=2;
}
}
Not compulsory: Heat Distribution
This problem has some similiarities to the cloth simulation
in assignment 2, in that we have a grid, and the update of
data at grid points depends on surounding grid points.
We have a metal plate of size Rx by
Ry for which the temperature at the edge is held at some
constant value (Tedge). The aim is to determine the
temperature in the middle of the plate. This is can be done iteratively by
dividing the domain of the metal plate into a grid of points. The
new temperature at each grid point is calculated as the average of the
four current temperatures at the adjacent grid points. The iterations
continue until the temperatures of all grid points does not change
between iterations to within some threshold. A diagramatic
illustration of the problem is shown below:
Code implementing this is given below.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "mpi.h"
#define MAX_PRINT 20
int main( argc, argv )
int argc;
char **argv;
{
int Nx, Ny, Max_iter;
double Tedge, converge;
MPI_Status status;
int rank, size;
int i, j, iter, ierr;
double *tnew, *told, *ttmp, tdiff, mxdiff;
double t1,tseq;
int idata[3];
double data[2];
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
MPI_Barrier(MPI_COMM_WORLD);
/* INPUT PROCESSING */
if (!rank){
printf("\n COMP3320/6464 Hot Plate Program\n");
printf(" Number of processes %d\n",size);
printf(" Input number of X and Y data points \n");
scanf("%d%d",&Nx,&Ny);
printf(" Input temperatures along edges \n");
scanf("%lf",&Tedge);
printf(" Input maximum iterations \n");
scanf("%d",&Max_iter);
printf(" Input convergence \n");
scanf("%lf",&converge);
for (i=0;i<72;i++)printf("-");
printf("\n");
printf(" Nx %8d Ny %8d \n",Nx, Ny);
printf(" Tedge %8.2lf\n", Tedge);
printf(" Max_iter %8d\n", Max_iter);
printf(" Converge %8.2lf\n",converge);
for (i=0;i<72;i++)printf("-");
printf("\n");
idata[0]=Nx;idata[1]=Ny;idata[2]=Max_iter;
MPI_Bcast(idata, 3, MPI_INT, 0, MPI_COMM_WORLD);
data[0]=Tedge;data[1]=converge;
MPI_Bcast(data, 2, MPI_DOUBLE, 0, MPI_COMM_WORLD);
} else {
MPI_Bcast(idata, 3, MPI_INT, 0, MPI_COMM_WORLD);
Nx=idata[0];Ny=idata[1];Max_iter=idata[2];
MPI_Bcast(data, 2, MPI_DOUBLE, 0, MPI_COMM_WORLD);
Tedge=data[0];converge=data[1];
}
MPI_Barrier(MPI_COMM_WORLD);
/***********************************************************************/
/* MAIN COMPUTE PART */
/***********************************************************************/
// Allocate data array
told = (double*) malloc( Nx*Ny*sizeof(double) );
tnew = (double*) malloc( Nx*Ny*sizeof(double) );
/* Set Boundary Conditions */
for (i = 0; i < Nx*Ny; i++){told[i]=0.0;tnew[i]=0.0;}
j=0; for (i = 0; i < Nx; i++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
j=Ny-1; for (i = 0; i < Nx; i++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
i=0; for (j = 0; j < Ny; j++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
i=Nx-1; for (j = 0; j < Ny; j++){told[j*Nx+i]=Tedge;tnew[j*Nx+i]=Tedge;}
iter = 0;
if (!rank)printf("\n Begin Iterations\n");
MPI_Barrier(MPI_COMM_WORLD);
t1 = MPI_Wtime();
do {
iter++;
/* update the grid points*/
for (j = 1; j < Ny-1; j++){
for (i = 1; i < Nx-1; i++)
tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
}
/*find maximum change in any grid point*/
mxdiff = fabs( (double) (told[0] - tnew[0]));
for (i = 0; i < Nx*Ny; i++){
tdiff = fabs( (double) (told[i] - tnew[i]));
mxdiff = (mxdiff < tdiff) ? tdiff : mxdiff;
}
if (!rank)printf(" iteration %d convergence %lf\n",iter,mxdiff);
/*swap pointers and iterate again*/
ttmp=told;
told=tnew;
tnew=ttmp;
}while ( mxdiff > converge && iter < Max_iter || iter == 1);
MPI_Barrier(MPI_COMM_WORLD);
tseq = MPI_Wtime() - t1;
if (!rank){
printf("\nTotal time sequential %8.4lf\n",tseq);
}
/***********************************************************************/
/* END MAIN COMPUTE PART */
/***********************************************************************/
/* Print out grid points if not too many */
MPI_Barrier(MPI_COMM_WORLD);
if ( Nx < MAX_PRINT && Ny < MAX_PRINT) {
printf(" final temperatures on process\n");
for (j = 0; j < Ny; j++){
printf("\n Row %d \n",j);
for (i = 0; i < Nx; i++){
printf("%12.5f", tnew[j*Nx+i]);
if (i%6 == 5)printf("\n");
}
}
}
printf("\n");
MPI_Finalize( );
return 0;
}
The code requests as input
Nx - the number of data points along the x axis
Ny - the number of data points along the y axis
Tedge - the fixed temperature at the boundary
Max_iter - Maximum number of iterations permitted
Converge - the convergence criteria
Except at the edges, the grid is initially set to zero. For problem
sizes with Nx and Ny less than 20 the resulting temperatures
are printed. This is really just for interest.
Although the code has been adapted to run using MPI, the problem is only
solved by process zero. (This is typically the first step in converting
a sequential code to an MPI parallel code).
- With Nx = 20 and Ny = 40 and Tedge=100.0 how many iterations does
it take before grid point (7,5) is non-zero (assume grid point numbering
begins at 0,0)? Generally for
Nx=Ny=(2n+1) how many iterations does it take for point (n,n) to
be none zero? Explain the rational behind your answer.
- Now divide the work involved in the iterative loop between the
MPI processes.