COMP4300/6430 2013

Some Notes on Lab 3 - Synchronization, Performance Analysis and Tuning

Part 1: Synchronization

  1. With Nx = 15 and Ny = 19, it takes 5 iterations for grid point (7,5) to become non-zero. In general it takes n iterations for point (n,n) to become non-zero. This is because on every iteration, a grid point takes on the average value of its immediate neighbours, so that non-zero temperature ‘travels’ inwards from the edges of the grid at one grid point per iteration.

  2. Time (s)Number of processors
    1248
    Sequential4.86n/an/an/a
    Parallel4.865.596.375.36
    Speedup1.00.870.760.91

    Parallel speedup is less than 1, i.e. adding more processes actually increases the execution time. This is not good!

  3. The domain is decomposed along the y axis only. Every process is reponsible for updating at most Ny/size columns of the grid. However, the area of the grid held by each process is larger at Ny/size+2 columns. The columns on either edge are ghost regions: copies of the data held at neighbouring processes. They are updated each iteration with data received from the neighbouring processes.

  4.  Because the sends and receives are symmetric for all processes, every process will first send to the next process and then receive from the previous process. This means that, with blocking communications, all processes must wait until the next process has received, before their receive can progress. Therefore communications proceed in a ‘domino’ fashion, completing first at process (size-1) and cascading down to process 0.
    To avoid this serialization of communications, make the messages asymmetric, so that odd and even processes send and receive in the oppostive order.

    Time (s)Number of processors
    1248
    Sequential4.86n/an/an/a
    Parallel4.865.636.375.35
    Speedup1.00.860.870.91
    With this change, the code is not noticeably faster. Interesting…

  5.  Change the calls to MPI_Send and MPI_Recv to MPI_Isend and MPI_Irecv, and wait for all started communications at the end, e.g.

     if (rank-1 >= 0) {
         jst = rank*chk+1;
         MPI_Isend(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1, 
                             1, MPI_COMM_WORLD, &send_prev);
     }
     if (rank+1 < size) {
         jst= (rank+1)*chk+1;
         MPI_Irecv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
                             1, MPI_COMM_WORLD, &recv_next);
     }
    ...
     if (rank+1 < size) {
         MPI_Wait(&send_next, &status);
         MPI_Wait(&recv_next, &status);
     }
     if (rank-1 >= 0) {
         MPI_Wait(&send_prev, &status);
         MPI_Wait(&recv_prev, &status);
     }
    
    Performance is as follows:
    Time (s)Number of processors
    1248
    Sequential4.86n/an/an/a
    Parallel4.865.646.375.45
    Speedup1.00.850.870.89
    Non-blocking communications does not improve performance, either.
    Time to profile!

Part 2: Performance Analysis

  1. The routines MPI_Allreduce and MPI_Send take most of the time. The communication topology clearly shows point-to-point communications between neighbouring processes, as expected.
  2. The output of mpiP gives more detail, including call hierachy and time for the different calls to each MPI function.
  3. -       do {
    R        iter++;
    -        // update local values
    O        jst = rank*chk+1;
    O        jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
    P        for (j = jst; j < jfin; j++) {
    P            for (i = 1; i < Nx-1; i++) {
    P                tnew[j*Nx+i]=0.25*(told[j*Nx+i+1]+told[j*Nx+i-1]+
    -                              told[(j+1)*Nx+i]+told[(j-1)*Nx+i]);
    -            }
    -        }
    -
    -        // Send to rank+1
    O        if (rank+1 < size) {
    O            jst = rank*chk+chk;
    O            MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1, 
    -                                2, MPI_COMM_WORLD);
    -        }
    O        if (rank-1 >= 0) {
    O            jst = (rank-1)*chk+chk;
    O            MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE,rank-1, 
    -                                2, MPI_COMM_WORLD, &status);
    -        }
    -
    -        // Send to rank-1
    O        if (rank-1 >= 0) {
    O            jst = rank*chk+1;
    O            MPI_Send(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank-1, 
    -                                1, MPI_COMM_WORLD);
    -        }
    O        if (rank+1 < size) {
    O            jst= (rank+1)*chk+1;
    O            MPI_Recv(&tnew[jst*Nx],Nx, MPI_DOUBLE, rank+1,
    -                                1, MPI_COMM_WORLD, &status);
    -        }
    -
    -        // fix boundaries in tnew
    R        j=0;    for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
    R        j=Ny-1; for (i = 0; i < Nx; i++)tnew[j*Nx+i]=Tedge;
    R        i=0;    for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
    R        i=Nx-1; for (j = 0; j < Ny; j++)tnew[j*Nx+i]=Tedge;
    -
    O        jst = rank*chk+1;
    R        lmxdiff = fabs( (double) (told[jst*Nx+1] - tnew[jst*Nx+1]));
    O        jfin = jst+chk > Ny-1 ? Ny-1 : jst+chk;
    P        for (j = jst; j < jfin; j++) {
    P            for (i = 1; i < Nx-1; i++) {
    P             tdiff = fabs( (double) (told[j*Nx+i] - tnew[j*Nx+i]));
    P             lmxdiff = (lmxdiff < tdiff) ? tdiff : lmxdiff;
    -            }
    -        }
    R        for (i = 0; i < Nx*Ny; i++) told[i]=tnew[i];
    -
    P        MPI_Allreduce(&lmxdiff, &mxdiff, 1, MPI_DOUBLE, 
    -                MPI_MAX, MPI_COMM_WORLD);
    R        if (!rank) printf(" iteration %d convergence %lf\n",iter,mxdiff);
    R    } while ( mxdiff > converge && iter < Max_iter);
    
    
  4. Most of the time is spent in copying from tnew to told. To avoid this cost, swap the old and new pointers on every iteration.