Benchmarking the Dirac Operator and a Goodbye

Benchmarking the Dirac Operator and a Goodbye
Good bye typography BW old movie screen

It’s unfortunately the end of the Summer of HPC 2022. It was an absolute blast to work on High Performance Quantum Fields for the past two months!.

As mentioned in the previous blog post, the aim of this project was to benchmark the Dirac operator from Lattice Quantum Chromodynamics (Lattice QCD). We showed that the Staggered Dirac Operator had the equation for each element of the output source term χi(n)

Therefore, for each point n on the lattice, we perform 8 matrix-vector multiplications between an element in SU(3) and a complex vector with 3 elements as the lattice is a 4-dimensional hypercube. This forms the basis of the benchmark kernel.

Implementation in Kokkos

The single node benchmark was developed using the stream convention, which sums the total bytes read in by the Dirac kernel and the total bytes read out. This was straightforward to implement across both the CPU and GPU using Kokkos, thanks to it’s API that was discussed in my first post. The results of the benchmark are included in the figure below:

Staggered Dirac Kernel for multiple Execution Spaces.

It can be clearly seen that ,especially for the A100 GPU, that as the lattice volume increases, memory bandwidth utilization increases. This indicates that the Staggered Dirac Kernel scales well with volume. It is notable that a drop in memory bandwidth utlisation is seen for the CPU. This indicates that the memory channels have become saturated. One way to alleviate these issues is to set OMP_PROC_BIND=spread and OMP_PLACES=threads for OpenMP version 4.0 or greater.

Adding distributed memory communication with MPI was a much simpler process than anticipated. As Kokkos is a shared memory library, it compliments MPI quite nicely. Internode communication with the CPUs and GPUs (along with intranode communication for the GPUs) was done using the halo exchange method. However, since Kokkos::subviews are guaranteed to be contiguous, Kokkos::deep_copy was used to copy into and out of the halo buffers, without having to implement custom MPI_Datatypes.

A Final Recommendation

Please don’t forget to check the following video presentation I made with my colleague Apostolos Giannousas about the project. I highly recommend reading Apostolos’ work regarding extraction of the mass of the pion from Lattice simulations.

Tagged with: , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.