Joseph Wick at Argonne

One of my primary goals at Argonne National Lab was to effectively scale GPU enabled MPI code on the lab's supercomputers. I worked primarily on two machines: Polaris, a petascale NVIDIA machine; and Aurora, an in-development exascale machine with Intel hardware. Our code uses Python APIs to call CUDA and SYCL kernels.

Our code has an MPI driver that is hardware agnostic, and uses different kernel methods for each backend. Maintaining this modularity was often challenging as the Python HPC libraries are very much still in development, and the limited availability of CPU simulators for all hardware backends required attention to detail with our CI testing.

In order to achieve good scaling I minimized the amount of data transfer between the GPU and CPU, and manually divided the work assigned to each MPI rank among several GPUs. Using Python allowed for relatively quick development and easy integration with the scientific modeling work being done by my theoretical physicist coworkers, but had the downside of severe overheads for more than about 10,000 MPI ranks. Assigning multiple GPUs to each MPI rank was key to accessing the full power of the lab super computers.

The below figure illustrates strong scaling performance on 2100 nodes of Aurora, equalling 4200 MPI ranks and 12,600 GPUs. This work was done on a pre-production supercomputer with early versions of the Aurora software development kit. Given the strong scaling performance our code demonstates, the team will be able to receive an early science allocation for compute time in the near future.

HPC for Cosmological Summary Stats

Runtime Optimization

Model Optimization and Inference