Only two-dimensional transforms have been tested here. However, higher dimensional FFTs are believed to show similar results. FFTs on a cluster are usually done in the following way:
A second algorithm uses the same scheme, but instead of using MPI_Alltoall in the matrix transpose routine, it first defines a new MPI data type using MPI_Type_vector that contains the block of data of size (Lx/np)*(Ly/np) that must be sent to other processors [Lx*Ly is the system size, np is the # of processors]. Then MPI_Irecv and MPI_Isend are used to send the data to all other processors in the matrix transpose routine (i.e., np(np-1) MPI_Irecv and MPI_Isend calls are necessary). For the 1d column and row transforms the fftw routine is used. The results for this algorithm are given under the column "fft2d-i".
Both algorithms described above separate cpu intensive parts and communication intensive parts completely: parts (1) and (3) are without any communication, whereas part (2), the matrix transpose involves communication only. This can be a distinct disadvantage, if the computer architecture and MPI distribution allow for efficient parallel computation and message passing on a single CPU. With this in mind the following algorithm was implemented:
The source code of all three test programs can be downloaded from here.
system size LxL = 1600x1600, nfft=6 =================================== np=1 42.65 np fftw fft2d-i fft2d-p mpich-1.2.1 2 43.10 (29.19) 43.14 (27.48) 43.14 (26.37) mpipro-1.5b7-3t 2 36.51 (32.22) 32.40 (28.84) 32.12 (28.81) lam-6.3.2 2 42.70 (34.90) 35.17 (27.88) 29.01 (24.73) mpich-1.2.1 4 31.09 31.69 31.90 mpipro-1.5b7-3t 4 27.15 23.93 23.13 lam-6.3.2 4 27.74 25.13 19.64 mpich-1.2.1 8 24.49 23.76 25.21 mpipro 8 17.02 17.86 15.40 lam-6.3.2 8 16.21 16.76 13.34 system size LxL = 800x800, nfft=24 ================================== np=1 33.11 np fftw fft2d-i fft2d-p mpich-1.2.1 2 41.18 (25.52) 40.04 (24.59) 30.53 (23.47) mpipro-1.5b7-3t 2 35.14 (29.26) 29.86 (25.41) 33.25 (27.25) lam-6.3.2 2 40.68 (30.89) 33.12 (24.61) 30.33 (21.72) mpich-1.2.1 4 31.69 34.17 26.07 mpipro-1.5b7-3t 4 26.08 24.70 25.42 lam-6.3.2 4 25.17 23.24 21.10 mpich-1.2.1 8 23.96 24.72 26.17 mpipro-1.5b7-3t 8 16.38 17.04 19.69 lam-6.3.2 8 15.60 16.07 13.83 system size LxL = 400x400, nfft=96 ================================== np=1 27.63 np fftw fft2d-i fft2d-p mpich-1.2.1 2 38.38 (22.05) 37.79 (23.26) 36.14 (22.44) mpipro-1.5b7-3t 2 32.03 (25.28) 28.29 (22.77) 38.50 (28.44) lam-6.3.2 2 36.76 (28.02) 35.28 (22.33) 35.25 (19.26) mpich-1.2.1 4 29.60 32.18 33.28 mpipro-1.5b7-3t 4 21.36 24.04 31.78 lam-6.3.2 4 17.83 22.59 21.47 mpich-1.2.1 8 14.35 14.92 38.96 mpipro-1.5b7-3t 8 13.95 15.15 26.00 lam-6.3.2 8 18.06 14.29 16.23
However, the algorithm "fft2d-p" uses messages of size L/np, which are small in the sense that the transfer time is dominated by latency. It seems that the lam MPI implementation has a much lower latency then the other two distributions and therefore the best results for the "fft2d-p" algorithm were obtained with lam-6.3.2. For the larger system sizes this combination (fft2d-p + lam) gave the fastest FFTs. This is probably also the best algorithm for higher dimensional FFTs.
It should be emphasized that from these result one cannot draw conclusions about which MPI distribution is "best". The results are valid for this particular problem only and depend sensitively on the algorithm. Actually, the results indicate that for sending large messages the conclusion may be different.
Also, the whole picture should change for transfer rates higher than 100BaseT. It is unclear which is the best algorithm in such cases and the whole study must be repeated.