Marco Borelli

First Impressions

Accelerating Molecular Dynamics with GPUs

My name is Marco Borelli, and I’m writing from Jülich, Germany. For my Summer of HPC project I was assigned to the supercomputing facility of Jülich Forschungszentrum, a beautiful research center located in the region of North Rhein-Westphalia. My project consists of an implementation of the Fast Multipole Method on (GP)GPUs.

The Fast Multipole Method (FMM) is an algorithm for the resolution of the Coulomb problem, which consists of the calculation of the mutual interaction forces, due to their electrostatic charge, of a number of point particles in space. This problem needs to be resolved when simulating, for example, the interaction of biological molecules (molecular dynamics), but the same mathematical formulation can also describe the gravitational interaction of planets and stars. The simplest algorithm for the resolution of this problem consists of directly calculating the influence of each particle on all the other ones, then adding up these contributions to obtain the total force acting on each particle. This algorithm has a time complexity of O(N2), which means that the time needed for the computation increases as the square of the number of particles. This poses a hard limit on the number of particles which can be included in a simulation, so a faster algorithm would be very useful. It turns out that there are a number of faster methods; the ones we are concentrating on for this project are the multipole-based ones. The basic principle of these methods is that the further away a particle is from the observation position, the more it can be grouped together with its nearby particles to form a pseudo-particle, having a net charge given by the sum of the constituting charges. Correctly distilling this principle into a numerical algorithm, without introducing (or at least controlling) the errors, is definitely not easy, but leads to an impressive speedup: in an typical configuration, the complexity lowers to O(N).

In addition to its low complexity, the Fast Multipole Method (FMM)is highly parallelizable. This is very important if one wants to exploit the supercomputers provided by a research center like  Forschungszentrum Jülich, because these machines are obviously made up of a great number of cores. But parallelizability is also important to exploit one of the latest trends in HPC, General Purpose GPU computing. In fact, here at the research center there are a number of machines available for the development of scientific applications on GPUs, including a cluster of about 200 nodes equipped with two Nvidia Tesla cards each. So it’s no surprise that the FMM was already implemented with CUDA, the language created by Nvidia for programming its GPUs.

So you may ask: “what are you doing there?” Well, CUDA is unfortunately limited to Nvidia cards only; but Nvidia,d, is not the only vendor of GPGPU solutions; AMD also provides GPGPUs. So the challenge is: can we make this application portable? In other words, can we write it in such a way that it can be run on different GPU systems, without changing the code, and with still a high performance? This is the question we are trying to answer, and I’m trying to help as much as I can.

Up to now we have investigated a couple different solutions. If you know something about the field, you are probably already thinking of OpenCL, the standard for GPU computing developed by the Khronos group. And this is indeed the first approach we explored. Unfortunately OpenCL C, the language used to write the kernels which run on the GPU, is derived from C; this means that all the nice features which characterize C++, such as templates, function overloading and classes, are not available. This might not be a problem in general, but in our case it was, because the existing code for the FMM already made extensive (and optimized) use of classes and templates (CUDA is actually much more similar to C++, so it allows this), and rewriting the kernel parts to plain C would essentially mean going backwards in time. A possible workaround was represented by an extension to OpenCL C, proposed and implemented by AMD, which allows to use classes and other C++ features in kernel code. Unfortunately, Nvidia and other vendors have not adopted this extension, so using it would mean losing portability, defeating the main purpose of the project.[1] So we changed track a bit (but not that much), and we switched to C++ AMP (Accelerated Massive Parallelism), an open standard by Microsoft for the acceleration of applications via GPUs and other devices. This might seem like throwing away all the work done up to now, but the C++ AMP compiler we are using actually compiles to OpenCL C, so knowing it already is definitely useful. So, to summarize, what I’m doing right now is, on one hand, trying to learn and understand C++ AMP, and on the other hand trying to port (and understand!) the existing C++/CUDA implementation of the FMM to that new language. Of course the eye is always on performance, so the end goal is to see how much we can achieve with this implementation compared to CUDA. (Yes, it’s a bit like a race ;).

On a more personal note, I must say that I’m really enjoying my work and my stay here, and I’m very grateful for this opportunity. I’ve already learned a lot of useful things, and being a physicist, I don’t have the deep background in computer science that many of my fellow SoHPC-ers (not to mention my colleagues here) have, so this is a great chance for me to consolidate and expand my skills. The accommodation I was provided is also quite nice: I live at a guesthouse specifically made for the visitors of the Forschungszentrum, and the cool thing is that I have a small apartment all to myself. Jülich is not a touristic city, but it’s well maintained and has plenty of green areas, which is really nice. I was also given a bike to go to work, so I can enjoy the nature every day… but if it rains, I can easily take a bus and/or a Rurtalbahn train.

I hope you have enjoyed reading this post. See you in a month!



This photo was taken from the Seecasino, the canteen where we go lunch every day. Nice place, isn’t it?


This is the view from the top of my guesthouse. That one in the middle is the Jülich citadel.

[1]    In case you’re wondering, it’s not (yet) possible to compile OpenCL code with the AMD platform environment and bring the compiled binary to other platforms, because there isn’t a widely adopted binary standard. An effort is under way with SPIR, the Standard Portable Intermediate Representation (again by the Khronos group), but it has yet to see a wide implementation by vendors.

The SoHPC Experience

Accelerating molecular dynamics with GPUs – part 2

Hello, my name is Marco Borelli, and this is my second report from Jülich, Germany.

As you may recall from my first article, I was assigned to work at Forschungszentrum Jülich, on a project, which involved porting the Fast Multipole Method, an algorithm for the fast summation of interactions between point charges, to GPUs and other accelerators. After some investigation of the different technologies (languages and APIs) available for the task, and after some tests with OpenCL, we settled for C++ AMP, a specification by Microsoft, which defines a set of libraries and language extensions which allow to write code for accelerators directly in C++. In this report I will expand a bit more on the advantages of this technology, and I will present briefly the results we obtained.


Why using C++ AMP

The main advantage of C++ AMP over its competitors is the tight integration with the original language, C++. To run a piece of code on an accelerator, you simply call the library function parallel_for_each() with three arguments: an accelerator_view (optional), an extent representing your computational domain, and a functor object or lambda function representing the code to be executed. The functor or lambda should take one argument, an index, which will represent the “position” of a particular GPU thread within the whole computational domain.

As you can see, all these constructs belong natively to the C++11 language, and almost all of the specification is implemented via libraries; the only additions to the language itself are the restrict keyword and the tile_static type qualifier. The former instructs the compiler that a function is to be checked for compatibility with, and compiled for, execution on an accelerator, and the latter is similar to the __local keyword in OpenCL, which specifies that a variable should be shared among a block of GPU threads.

This is what a simple parallel vector addition looks like in C++ AMP:[1]

void ParallelVectorAdd(int a[], int b[], int sum[], int size) {

    usingnamespace concurrency;

    // Create C++ AMP objects.
    array_view<constint, 1> av_a(size, a); // Views over input data
    array_view<constint, 1> av_b(size, b); // ''
    array_view<int, 1> av_sum(size, sum); // View over output data
    extent<1> ext(size); // Define the computational domain

        [=](index<1> idx) restrict(amp)// The lambda starts here
            av_sum[idx] = av_a[idx] + av_b[idx];


As you can see, this code is completely integrated into the language. Also, thanks to the use of a lambda function, which automatically captures array_view objects, all necessary data buffers are transparently copied to the accelerator at the beginning of the parallel code, and back again after completion if necessary. In OpenCL, you would have to write a kernel in OpenCL C, and setup an infrastructure in host code to call that kernel and to copy the necessary data to and from the accelerator. In OpenACC you may not need to heavily modify the original code, but you would still have to add preprocessor macros, effectively writing code in another language, which we could call the “OpenACC preprocessor language”. The language integration, which characterizes C++ AMP is doubly beneficial, because the user doesn’t have to learn a new language, and because it allows to easily extend a basic C++ compiler to correctly parse C++ AMP. Indeed, the compiler we have been using was a modified Clang compiler by Multicoreware, Inc.[2] It should be noted that these advantages already characterize CUDA, and this is another reason to pursue them: we already had a functioning implementation of the FMM in CUDA C/C++.


Results of the project

By using C++ AMP we were able to easily port the near field part of the Fast Multipole Method so that it ran on an AMD S10000 accelerator. The performances were very good, and the resulting code was very similar to the original CPU-only and CUDA versions: only a few classes were modified, and the overall structure of the project was left unaltered, as well as the computational kernel.

For time reasons we couldn’t implement multi-GPU execution in the actual FMM code, but preliminary tests were made on a simpler test code, where we were able to achieve concurrent multi-gpu execution by using std::threads and by targeting every accelerator with a different host thread; therefore, it should just be a matter of implementing the same technique in the “real” code.


Personal notes

As far as the personal side is concerned, I have to say that this was an incredible experience to me. I had the opportunity to work in a stimulating and intellectually rich environment, to meet intelligent and funny people, and to see how actual scientific research is done in a renowned research center. I was also given the chance to put my skills to work and be appreciated for it (despite my novice status), and most importantly, to learn a lot. I am very satisfied of these two months, and there’s no doubt I would apply again.


[1]    Adapted from:


Please follow and like us:
Posted in Project reports 2014 Tagged with: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *