First, the algorithm that generates the decompositions automatically (with nproc_x and nproc_y = -1) is far from ideal from what I can tell. That algorithm, Subroutine MPASPECT in external/RSL_LITE/module_dm.F, determines an nproc_x and an nproc_y if the user doesn't set them. It currently finds nproc_x and nproc_y such that nproc_x divides ntasks (the total number of MPI tasks) evenly, but nproc_y doesn't have this requirement. It then finds the combination of nproc_x and nproc_y that minimizes abs(nproc_y-nproc_x). As a result, the product of nproc_x and nproc_y is often larger than ntasks which leads to idle processors and the optimization is far from ideal.

Does anyone have any idea whether it's a requirement that nproc_x divide ntasks evenly? And how was the optimization min(abs(nproc_y-nproc_x)) chosen?

Below is a small subset of my timing results for my case on a Cray XE6 that show a number of the issues.

- Code: Select all
`local`

total default nproc_x nproc_y nproc_x* domain timing

tasks decomp nproc_y size (sec/day)

572 no 11 52 572 25x4 164

574 yes 14 41 574 20x5 167

575 yes 23 23 529 12x9 197

575 no 25 23 575 11x9 175

First, looking at the 575 default decomposition (23 x 23) versus a more reasonable decomposition (25 x 23) shows a performance difference of about 10% (197 to 175). The default decomposition sets nproc_x and nproc_y to 23 x 23 which is only 529 tasks. I think this means 46 tasks will be idle with this decomposition. A user defined decomposition of 25x23 (with 275 tasks) will keep all tasks busy and is 10% faster from what I can tell.

Second, I was surprised that the skinnier domain sizes 25x4 (572 tasks) and 20x5 (574 tasks) performed better than 11x9 (575 tasks). My expectation was that lower aspect ratio domain sizes would perform better. Does anyone have a better understanding of why the higher aspect ratio blocks perform better?

I have lots of other data, but I believe the MPASPECT subroutine has lots of room for improvement. I would like to know if others agree. Does the current implementation contain a bug? For instance, should nproc_y also be checked that it divides ntasks evenly? That would significantly improve the algorithm, although would lead to some task counts not being runable. Or would a better optimization help. I am planning to change my implementation so it optimizes on a weighted sum of the nproc_x/nproc_y aspect ratio and the total number of tasks used compared to the target number of tasks.

I am trying hard to understand WRF performance and trying to tune to a reasonably optimal task count and decomposition. I haven't been able to find much information in the User Guide or the Forum about this. If anyone has any other pointers to documentation on WRF parallelization performance, that would be very helpful.

thanks,

tony.....