The GRACOS (GRAvitational COSmology) code, a parallel implementation of the particle-particle/particle-mesh (P3M) algorithm for distributed memory clusters, uses a hybrid method for both computation and domain decomposition. Long-range forces are computed using a Fourier transform gravity solver on a regular mesh; the mesh is distributed across parallel processes using a static one-dimensional slab domain decomposition. Short-range forces are computed by direct summation of close pairs; particles are distributed using a dynamic domain decomposition based on a space-filling Hilbert curve. A nearly-optimal method was devised to dynamically repartition the particle distribution so as to maintain load balance even for extremely inhomogeneous mass distributions. Tests using $800^3$ simulations on a 40-processor beowulf cluster showed good load balance and scalability up to 80 processes. There are limits on scalability imposed by communication and extreme clustering which may be removed by extending the algorithm to include adaptive mesh refinement.
Please cite using the bibtex here: http://www.gracos.org/bibtex.bib