# Ducks and the GPU performance? – Part 1 of 2

Is it possible to account for all the particles in the system without actually iterating through each of them every time? Read this post to get an idea about my summer project and the groundwork for the answer to the title question!

## Problem/Motivation

Particle interactions are a common subject of simulations in the fields like Molecular Dynamics and Astrophysics. Imagine computing the forces on each planet in our solar system caused by all the other planets. For each of the planets, one would need to add up the contributions to the total force of all the others, one at a time. Assuming all the data is available, the task is computationally trivial. However, in typical simulations, where the particles are counted in trillions, computing anything with this approach would take a lifetime even on the most sophisticated computing architectures.

## Solution

A different approach is used in one of the most important algorithms of the 20th century – the Fast Multipole Method (FMM). It allows faster computation of particle interactions by grouping them together. Each group is represented by a single expression called the multipole expansion. Multipole expansion is an approximation of the impact the considered group has on its environment. When computing a force at a certain point, instead of accounting for each particle in the group, only this single expression is considered. That is how the complexity is reduced!

### How to Implement Such a Thing?

FMM includes a series of steps, but for brevity, I will focus only on one for now. In real-world simulations, domains of interest are the three dimensional ones. Imagine a cubic box which I will only refer to as a box. This box is divided into a lot of small boxes where each is represented by a single multipole expansion. Consider a point in which we want to compute the potential. Consider the box in which the point is located. It is required to compute the impact of all other boxes onto our box. The impact computation is done by a so called Multipole-to-Local (M2L) operator which is exactly the segment to be improved in my summer project.

### Recipe for Speed

Current implementation developed at FZ Jülich rotates the multipole twice followed by the translation into the box of interest. It can be shown that instead of the two rotations we can rotate it in a different manner six times to reduce complexity. Each of these six rotations are individually cheaper than the currently implemented two and when everything is put together, I expect to see some speed gained.

## Ducks?

It requires some effort to grasp the maths behind creation, rotation and the shift of multipoles which makes it too broad for the scope of the blog. However, implementation-wise, it comes down to building data structures to represent these concepts and that is exactly where the fun starts! Above you can see how the fancy multipole in the first picture can be represented in the computer. Coefficients of the multipole expansion are stored in each blue box. The first box on the left represents a monopole coefficient, the two boxes stacked on top of each other hold the dipole coefficients etc.

To win exclusive bragging rights, write your idea on how the ducks are linked to the story in the comments section below! Otherwise, you can find the answer in the second part of this post. For the outcome of the new way of rotation, stay tuned for further posts!

thank you for the clear structure of your post as well as the visual aids, i am exited for the following post on ducks!

very informative and well-written blog post! I am also curious how the ducks incorporate with such a complex topic. How come that you use a cube as an example of implementation which then follows by a point and a box?

Good point, thank you for the comment. I used the terms cubes and boxes interchangeably without any clarification. It should say cubic boxes. I just edited that part.

To answer your question – this stems from the work of much more experienced people and if I understood it correctly, the usage of cubes (or cubic boxes) is purely for simplicity. Each of them contains a certain number of little objects that are represented by a single expression (by a single expression,you can just imagine a single object that represents the whole box). This is a mathematical expression which is actually calculated by considering a sphere and not a cubic box. I do not dare to say it would be impossible, but using the spheres would definitely make the work significantly more convoluted without any reasonable practical benefits.

If I got your comment correctly, you are also wondering why we are considering a point inside the box. In simulations, the goal is to obtain a certain physical quantity in some or all parts of the system which is described e.g. imagine a table set for dinner. If you lean on it with your hand and you are interested in the force you applied on the table, you would want to find out the force at the place you put your hand and would not really care how the plates or the cutlery act on the table. If you had some spare time and a strong desire to make a computer simulation out of this scenario, your point of interest would be where you placed your hand. Analogously, when FMM (the algorithm I am describing here) is used in simulations, there are always specific points of interest and the user only cares about a physical quantity at a certain point and could not care less about the boxes I used to make my life easier.

I hope that makes it a bit more clear for you, let me know if there’s a piece missing.

thank you for your detailed response and the concrete example with the dining table. 🙂

Interesting article! If I grasped that correctly, the idea of FMM is to group all the bodies and determine the force they have on the others. Do you group them into units that act all with equal force? Also, how fast can can the grouping process be done?

Yes, that sums it up well. The objects in the system are described by their position e.g. x,y,z coordinates and a certain physical quantity e.g. mass. We group the objects into boxes only based on their positions. Therefore, the number of objects each box contain and their masses will vary so the answer is generally no. I am saying generally beacuse theoretically there could be a case all the groups act on a point with identical force but it’s highly improbable in a realistic scenario.

To answer the second question I need to expand a bit. Since the time it takes to execute an algorithm depends on multitude of factors, a measure used is a complexity of an algorithm. It is labeled with a “big O” notation e.g. an algorithm of complexity O(n), where n is the number of objects in your system, will be faster than that of complexity O(n^2) if you run them the same way on the same machine. The complexity of the grouping process is smaller than that of the M2L operator and is therefore not in my focus this summer. Hence, I am sorry to say I cannot give you an exact time it takes to group up a certain amount of objects.

Very well written blog! I am curious to see just how much the GPU performance can be improved by using different six step proccess as oposed to the conventional two step one.

Thanks for reading! Stay tuned for the further posts to find out!

Very interesting and informative post! I’m really interested in the M2L operator. Could you expand a little bit on that? What is the aim of improving it and is it only used for one box at a time?

Thank you for reading! The goal is to improve the overall performance of the Fast Multipole Method. Therefore, the current focus is on the part that works the slowest – M2L operator. The amount of boxes it is used for at once is arbitrary. It could be used on one box at a time but that would be quite slow. The current implementation stacks up the triangles that represent the boxes and executes the M2L in parallel on the whole stack of triangles. This is a much faster approach but there is still room left in the way the triangles are stacked up. One of the goals of the project is to stack them up in the optimal way for the GPU. I am currently making some graphics that make this way of stacking easy to grasp and will hopefully be publishing it in the next blog entry.

Thank you for this introduction to FMM! I have already done a few toy molecular dynamics simulations myself and wondered how one would deal with a large system of particles. Now I have a better idea of this approach