With Nx = 15 and Ny = 19, it takes 5 iterations for grid point (7,5) to become non-zero. In general it takes n iterations for point (n,n) to become non-zero. This is because on every iteration, a grid point takes on the average value of its immediate neighbours, so that non-zero temperature ‘travels’ inwards from the edges of the grid at one grid point per iteration.
| Time (s) | Number of processors | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Sequential | 4.86 | n/a | n/a | n/a |
| Parallel | 4.86 | 5.59 | 6.37 | 5.36 |
| Speedup | 1.0 | 0.87 | 0.76 | 0.91 |
Parallel speedup is less than 1, i.e. adding more processes actually increases the execution time. This is not good!
The domain is decomposed along the y axis only. Every process is reponsible for updating at most columns of the grid. However, the area of the grid held by each process is larger at columns. The columns on either edge are ghost regions: copies of the data held at neighbouring processes. They are updated each iteration with data received from the neighbouring processes.
Because the sends and receives are symmetric for all processes, every process will first send to the next process and then receive from the previous process. This means that, with blocking communications, all processes must wait until the next process has received, before their receive can progress. Therefore communications proceed in a ‘domino’ fashion, completing first at process (size-1) and cascading down to process 0.
To avoid this serialization of communications, make the messages asymmetric, so that odd and even processes send and receive in the oppostive order.
| Time (s) | Number of processors | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Sequential | 4.86 | n/a | n/a | n/a |
| Parallel | 4.86 | 5.63 | 6.37 | 5.35 |
| Speedup | 1.0 | 0.86 | 0.87 | 0.91 |
Change the calls to MPI_Send and MPI_Recv to MPI_Isend and MPI_Irecv, and wait for all started communications at the end, e.g.
if (rank-1 >= 0) {
jst = rank*chk+1;
MPI_Isend(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1,
1, MPI_COMM_WORLD, &send_prev);
}
if (rank+1 < size) {
jst= (rank+1)*chk+1;
MPI_Irecv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
1, MPI_COMM_WORLD, &recv_next);
}
...
if (rank+1 < size) {
MPI_Wait(&send_next, &status);
MPI_Wait(&recv_next, &status);
}
if (rank-1 >= 0) {
MPI_Wait(&send_prev, &status);
MPI_Wait(&recv_prev, &status);
}
Performance is as follows: | Time (s) | Number of processors | |||
|---|---|---|---|---|
| 1 | 2 | 4 | 8 | |
| Sequential | 4.86 | n/a | n/a | n/a |
| Parallel | 4.86 | 5.64 | 6.37 | 5.45 |
| Speedup | 1.0 | 0.85 | 0.87 | 0.89 |
Part 2: Performance Analysis
MPI_Allreduce and MPI_Send take most of the time.
The communication topology clearly shows point-to-point communications between neighbouring processes, as expected.
mpiP gives more detail, including call hierachy and time for the different calls to each MPI function.
- do {
R iter++;
- // update local values
O jst = rank*chk+1;
O jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
P for (j = jst; j < jfin; j++) {
P for (i = 1; i < Nx-1; i++) {
P tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
- told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
- }
- }
-
- // Send to rank+1
O if (rank+1 < size) {
O jst = rank*chk+chk;
O MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
- 2, MPI_COMM_WORLD);
- }
O if (rank-1 >= 0) {
O jst = (rank-1)*chk+chk;
O MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE,rank-1,
- 2, MPI_COMM_WORLD, &status);
- }
-
- // Send to rank-1
O if (rank-1 >= 0) {
O jst = rank*chk+1;
O MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1,
- 1, MPI_COMM_WORLD);
- }
O if (rank+1 < size) {
O jst= (rank+1)*chk+1;
O MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
- 1, MPI_COMM_WORLD, &status);
- }
-
- // fix boundaries in tnew
R j=0; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
R j=Ny-1; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
R i=0; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
R i=Nx-1; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
-
O jst = rank*chk+1;
R lmxdiff = fabs( (double) (told[jst*Nx+1] - tnew[jst*Nx+1]));
O jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
P for (j = jst; j < jfin; j++) {
P for (i = 1; i < Nx-1; i++) {
P tdiff = fabs( (double) (told[j*Nx+i] - tnew[j*Nx+i]));
P lmxdiff = (lmxdiff < tdiff) ? tdiff : lmxdiff;
- }
- }
R for (i = 0; i < Nx*Ny; i++) told[i]=tnew[i];
-
P MPI_Allreduce(&lmxdiff, &mxdiff, 1, MPI_DOUBLE,
- MPI_MAX, MPI_COMM_WORLD);
R if (!rank) printf(" iteration %d convergence %lf\n",iter,mxdiff);
R } while ( mxdiff > converge && iter < Max_iter);
tnew to told. To avoid this cost, swap the old and new pointers on every iteration.