If you want to know the story from the beginning, I recommend checking out previous entry if you haven’t already that is about the project I’m working on this summer and opportunities you can get if you opt to apply for the programme.

This chapter aims to introduce an emerging approach of managing data that has the potential of breaking restrictions set by traditional hardware architectures and to show you how it is used and its performance in a real world.

What is a Non-Volatile Memory?

As you are probably familiar with, modern computers use different types of storage units to achieve different kinds of tasks. Mainly, there are two needs every processing unit requires:

  • Speed and
  • Persistence

But as of now, each storage device is only specialized in one of the two:

  • Registers – the fastest, but there are only few of them in each processing element and they are volatile
  • Caches – next in terms of speed, but also lack in capacity and persistence
  • DRAMs – last in “speed race” of non-persistent devices, but have the greatest capacity of them all
  • SSDs and HDDs – the devices which allows data to be saved independent of electrical power but also with speed not comparable to other types of volatile storage
Latency Pyramid
Memory Hierarchy Pyramid
Source: Intel

Whole situation with storage is changing really slow so far as it follows a trend of only enhancing hardware capabilities without any advancement in terms of new design. But that seems not to be the case any more since there is a new storage device that promises the speed comparable to current DRAMs and also ability to persist data only found in big-sized devices such as SSDs and HDDs.

This type of storage is called Non-Volatile Memory – nothing special in the name, but with the potential to change current ways of managing and saving the data especially important in high performance computing where I/O operations usually are application’s bottleneck. Next, we’ll dive deeper and learn about concrete hardware I’m using on my project as well as gain a first look at the device in action.

Storage Performance

After finishing the training week, I started working on this summer’s project – re-writing existing Charm++ fault tolerance system that uses online check-pointing to take advantage of newest non-volatile memory technology.

Courtesy of EPCC, my teammate Roberto and I have a chance to use their new cluster NextGenIO with nodes equipped with Intel’s Optane persistent memory modules.

Intel Optane Persistent Memory
Intel Optane Persistent Memory
Source: Intel

Memory provides two distinct modes in which it can be used:

  • Memory Mode – module is used as a volatile memory with big capacity where DRAM units become last level cache; good for applications that use huge chunks of memory, but the memory is still slower than DRAM which can have impact on performance of small-memory programs
  • App Direct Mode – module is used as a persistent unit and is mounted to regular file system accessible using regular file interfaces within the application; good for redirecting temporary program states from slow disk storage to much faster persistent memory and also enhance applications heavily dependent on I/O operations

In our case, we are using the memory in App Direct mode cause we need to persist the data each time we do the checkpoint. The first idea we’ve tried was to change the location where checkpoints are saved to now mounted persistent memory in our file system. Without too much trouble, we immediately managed to achieve an expected speedup.

FSDAX time
Persistent memory used by FSDAX and regular disk checkpointing times

After accessing the memory using standard file I/O interface (also called fsdax – file system direct access), we’ve enhanced previous solution by using development kit (a.k.a. PMDK) for directly accessing the device skipping overhead caused by OS file handling i.e. devdax.
Pmem library gives the programmer API similar to existing C interfaces for handling memory mapped files, but with potential gains in performance by directly accessing non-volatile memory which we were able to confirm after testing.

PMEM Time
Persistent memory used by PMEM library and regular disk checkpointing times

Best way of looking at the results by comparing gained speed-ups side by side. Worth noting is that these results are experimental and not confirmed with larger scaled tests, but we don’t expect to much of change in relative gains we managed to get so far.

Speed-ups
Checkpointing speed-ups of both approaches compared to regular disk

Next goal is to make whole system fully transactional which will definitely impact the overall execution of application, but with help of this new memory, maybe it’ll be feasible to get acceptable performance with full fault tolerance capability! Stay tuned to find out!

As announced in the last post, here you can understand the connection between sorting the ducks and the GPU performance.

Rotation

When using the Fast Multipole Method, the goal is to obtain a certain physical quantity e.g. a force. It needs to be obtained at a particular point in a particular box (target box) of the considered system. To do so, all the little boxes in the system need to be rotated and shifted to the target box. Here, only the rotation is considered. The implementation can be described with following steps:

  1. Build a pyramid that contains the rotation coefficients which are based on the fancy math .
  2. Build a triangle containing the multipole to be rotated
  3. Build another triangle to store the rotated multipole
  4. Multiply the pyramid entries with the corresponding entries of the multipole that needs to be rotated as shown in the image below
Data structures used for the rotation. To obtain the box in red of the rotated multipole (on the left), red boxes in the pyramid and the input triangle on the right need to be multiplied.

After spending some time playing with these and similar blocks, and sorting out in their minds how it all works, students have shown unusual behavior patterns while carrying out basic everyday routines. An instance of this behavior occured after a long day of coding on the student’s attempt (let’s call the person Student X) to brush his teeth before sleeping. Namely, all the lights were off and the Student X spent three full minutes switching the bathroom light on and off, contemplating about the possibility of the light bulb being burned out already, only to realize that the bathroom door was closed. Only a basic behavioral analysis was enough to diagnose that these building blocks remain trapped inside the student’s mind even after the working hours. At a late hour of the night, these blocks transform into little rubber ducks and other shapes, and start rotating around and multiplying, making the student act silly. The Student X happens to be myself.

Therefore, the goal is to sort the ducks out to improve the GPU performance. Whether or not it is accomplished, will be visible in the next blog post where you can expect some performance graphs!

Let me know in the comment section below if a certain part should be further clarified and thank you for reading!

Welcome again reader. I am Jerónimo Sánchez, one of the students working with EPCC during this summer. If you want to know more about the projects I will be working during Summer 2020, don’t forget to read my first post here.

To sum up that post, I am working on a parser tool of IOR files, allowing to quickly create performance reports from those files, given that before this tool it was done manually, so it is a relive for the people working with IOR who want to measure the performance of their systems.

Right now, the parser tool is mainly done, I am just finishing the Wiki about the project and adding some more documentation for maintenance that you can find on the following GitHub repository. Here is a sneak-peak about how the parsing tool works:

IOR parser usage message.

As you can see, the tool is not quite complex regarding the input parameters, but everything twists when the output report is an Excel file.

Here a short example of how the parser tool works:

Which outputs:

A quite a bit long log file.
And a Excel file similar to this one.

If you are willing to learn about the Excel formatting, I recommend you to read this series of posts from Microsoft about that. Also, for a practical example, read the Wiki entry on the GitHub project when it is done.

Furthermore, the Excel template can be customized and loaded at runtime, although the template creating process is a bit tedious.

Also, during this week, I have just started working on a problem related to OMPI and ARM. This problem is called OMPI pinning.

The aforementioned problem consists in that the OPMI (Open MPI) library expects Intel-like numbers when requesting the ID of a hardware thread, but ARM gives said data in a different format, mangling the threads at the moment the user wants a specific one, thus the programmer/user cannot use, for example, only one thread per each core.

At the moment, I still researching about this very problem, hoping to give some insights about it, although the final goal is to create a tool in which the user can introduce the threads they want to use and give them a console command that when introduced, OMPI will mangle the threads but the output will be the desired.

During the following weeks, an update post about the OMPI pinning problem will be uploaded. This is all for now. Thanks for reading!

Is it possible to account for all the particles in the system without actually iterating through each of them every time? Read this post to get an idea about my summer project and the groundwork for the answer to the title question!

Problem/Motivation

Particle interactions are a common subject of simulations in the fields like Molecular Dynamics and Astrophysics. Imagine computing the forces on each planet in our solar system caused by all the other planets. For each of the planets, one would need to add up the contributions to the total force of all the others, one at a time. Assuming all the data is available, the task is computationally trivial. However, in typical simulations, where the particles are counted in trillions, computing anything with this approach would take a lifetime even on the most sophisticated computing architectures.

Solution

A different approach is used in one of the most important algorithms of the 20th century – the Fast Multipole Method (FMM). It allows faster computation of particle interactions by grouping them together. Each group is represented by a single expression called the multipole expansion. Multipole expansion is an approximation of the impact the considered group has on its environment. When computing a force at a certain point, instead of accounting for each particle in the group, only this single expression is considered. That is how the complexity is reduced!

3D visualization of a multipole in front of a Coulomb potential surface from a real-world simulation.

How to Implement Such a Thing?

FMM includes a series of steps, but for brevity, I will focus only on one for now. In real-world simulations, domains of interest are the three dimensional ones. Imagine a cubic box which I will only refer to as a box. This box is divided into a lot of small boxes where each is represented by a single multipole expansion. Consider a point in which we want to compute the potential. Consider the box in which the point is located. It is required to compute the impact of all other boxes onto our box. The impact computation is done by a so called Multipole-to-Local (M2L) operator which is exactly the segment to be improved in my summer project.

Recipe for Speed

Current implementation developed at FZ Jülich rotates the multipole twice followed by the translation into the box of interest. It can be shown that instead of the two rotations we can rotate it in a different manner six times to reduce complexity. Each of these six rotations are individually cheaper than the currently implemented two and when everything is put together, I expect to see some speed gained.

Multipole expansion representation up to the third order. Each block holds a complex number

Ducks?

It requires some effort to grasp the maths behind creation, rotation and the shift of multipoles which makes it too broad for the scope of the blog. However, implementation-wise, it comes down to building data structures to represent these concepts and that is exactly where the fun starts! Above you can see how the fancy multipole in the first picture can be represented in the computer. Coefficients of the multipole expansion are stored in each blue box. The first box on the left represents a monopole coefficient, the two boxes stacked on top of each other hold the dipole coefficients etc.

To win exclusive bragging rights, write your idea on how the ducks are linked to the story in the comments section below! Otherwise, you can find the answer in the second part of this post. For the outcome of the new way of rotation, stay tuned for further posts!

Background

Hey Everyone! I’m Cathal and I’m super excited to partake in Summer of High Performace Computing (SoHPC) 2020. I have recently graduated from NUI, Galway in Ireland completing a BSc in Computer Science & Information Technology. In my spare time I like to exercise by running, climbing a mountain or anything that works up a sweat. Additionally, I have been playing the banjo since the innocent age of 8. Irish Traditional Music is one of the things very special about Ireland! Here’s a video to give you a taste of the music.

Myself and friends playing a few tunes in McGann’s pub situated in Doolin, Co. Clare.

Training Week

I intend to pursue a MSc in HPC with Data Science from the University of Edinburgh this coming September. I guess applying to SoHPC intended to be a good introduction and learning curve to what will come from the masters. The remote training week was definitely the educational experience I needed by getting introduced to HPC technologies such as super-computing, parallelization, OpenMP, MPI, and my favorite topic, CUDA programming for GPU’s. Hats off to all the organizers and mentors that made the training week happen so flawlessly.

This blog would never be complete without a ‘hello world’ example!

My SoHPC Project

I will be partaking in project #2018 called ‘Time Series Monitoring for HPC Job Queues’ hosted from SURFsara in Amsterdam. The goal of the project is to create a monitoring system which captures real-time information regarding the number of jobs running on HPC clusters, while integrating the work with DevOps and Continuous Integration practices and tools. Job queue information will be collected, processed and stored as time series and summarized and made available to users and administrators at SURFsara in the form of graphs. I hope my next post will be able describe tools, technologies and inner workings of the project more in-depth.

Thank you for reading, Cathal.

The less technical explanation of my project… meme format.

Today I’ll explain you how hybrid programming can cook the biggest pizza in the world. Our job for today is to cover the surface of Europe with pizza! The task is: “ Make a pizza based on the geospatial coordinates where the pizza is being prepared, so that you will use only local ingredients and toppings typical of the region, yummy!” So if the pizza is being cooked in Spain it will have something like “Jamon serrano”, or in Austria it could be “Wiener Schnitzel”.

To achieve this we will need several teams and being organized will be your prerogative.  

You will need to split the area of continental Europe in regions, so each team will take a piece resulting from the problem’s domain decomposition.

My idea is to make teams like I’m used to do with MPI, plus OpenMP, and this is actually a way of doing hybrid programming, but wait I’ll tell you more! Every team will have a head chef, or MPI process, and we will call them through a number, or rank. I will fix in the beginning how many chefs I want and where I want to pin them on the map the regions. Every chef gets to choose his helpers but I’ll fix the number (hey I’m still paying them and it’s expensive!) and he has a limit himself. We call them “threads” and they will split between them the workload of the region let’s say in counties.

Each MPI process… Hmm sorry… Chef… looks around and he discovers that in directions North, South, East and West he can communicate, e.g. process 4 can communicate with chef 3,5,7,1.

How a virtual topology may look like…

So they understand that there is some kind of ordering and they believe they are in a topology. They can send and receive any mail or mail boxes with ingredients to and from every other chef, but the post service is faster between neighbors, hence to keep this advantage we will keep communication just between them.

In order to solve this kind of problems I could use a stencil code. It’s an algorithm where the chef and his helpers perform a sequence of sweeps through a grid, the region given to each chef. They divide it in 1 m2 squares and these can be referred to as cells. In each sweep, the team updates all elements of the grid depending on a function or rule on the neighboring elements in a fixed pattern (called stencil). In stencil codes you can modify your cells in place, or have a copy of the grid that you will substitute at the end of sweep. Boundary cells sometimes have to be adjusted during the accomplishment of the task as well. The teams will stop when a certain error coefficient will be small enough, this coefficient measures how far they are from my idea of pizza.

The teams start by setting the grid to an initial value, margarita pizza, or zeros in computer memory terms.

The rule for putting the topping will be a function of the GPS coordinates (x,y) where the worker is located in that moment, the neighboring cells, and there will be some rules for our purpose regarding toppings that cannot be close to each other, so that food will be delicious. For example you can’t put onions close to more onions, or ananas close to ham (I’m joking do whatever you want for this last one!).

The stencil for our workers will be like this:

How stencils work.

In order to put a topping in a cell (x,y) we will need to know the other cells highlighted in the image and act accordingly. As the teams move along the grid they will get to the boundaries between regions.

Now teams will have to deal with interconnected problems, since the topping rule will affect their work on the boundaries. Neighboring teams will have to communicate and discover what the closest coworkers are doing behind the border walls.

Since the dough’s from each region are not really touching, each team has to add a stripe of dough that will be duplicating the part of pizza of the other team that they can’t connect to, so that it will be possible to keep track of what ingredients the other regional team has been using. We will call these stripes “Halo data”. This is a waste of dough (or memory data) because it’s just a duplication of what has already been done. N.B. A Dirichlet condition means that the cells that follow that have the same constant value, like a stuffed crust contains the same constant ingredient along the perimeter of the pizza.

To update the halo data, each MPI process will send and receive from another MPI process. The starting position and arrival of data/ingredients is exactly as in the image in a 2D problem.

Now they will know what to put or not put close to those ingredients based on the stencil. This communication is time consuming. Every time there is halo communication everything stops until the Chef knows what the others workers placed (post service has its own latency) and then placing the ingredients on the halo takes also some time.

BONUS Overlap: Can we do faster?

We could overlap communication and computation. “Openmp threads”, the helpers, can not only divide the work to make everything faster, but they can do “tasking”. It means that one of them takes some time in the beginning of an iteration to define a task for him and his colleagues, an action to accomplish with some resources in some area of the region whenever they can, in our case the boundary communication and “computation”. In the next iteration some of the them will perform the stencil on the inner cells and some will do the boundaries tasks and then go back to work on the stencil as soon as possible. Eventually when they will reach the boundaries of the internal region, the communication will be already done.

For today this is already a lot of stuff… See you next time!!!

For news on the project and my personal achievements you will find everything here!

————–The author: Kevin Mato———–
————–The author: Kevin Mato———–

Linkedin: https://www.linkedin.com/in/kevin-mato-657444173/
Github: https://github.com/KevinMTO

I just have noticed a little problem in my workplace: to be productive I need both fresh air and a calm, silent place. I have two possibilities: I can keep my door open and let the fresh air in (air conditioning is in a nearby room) or I can close it to limit the noises from outside. Both possibilities have downsides: if I close the door, the temperature in my room will rise, making it uncomfortable; if I let it open I tend to be distracted from the noises outside. It’s a lose-lose situation.

I know what you may think: “Why shall this interest us?”. My answer to that is: “Well, I don’t know, but I find interesting the fact that this kind of situation is very similar to many happening in Computer Science”. I’m talking about the fact that the ideal memory should be fast (like the RAM) and save things even after power-off (like the disk), but we can obtain only one of the two conditions, not both.

The problem is even more emphasized by the fact that data transfer to disk is usually a critical spot in HPC applications: a single data transfer can take more than 100 times more than on RAM. But applications must use the disk since it’s the only way to achieve persistence into the data. There seems to be no escape from the problem. But there is a solution.

Persistent memory (also called non-volatile memory or NVRAM) performs what its name says: it is memory (similar in performance to RAM) which can hold data even after power-off. It combines the merits of the two technologies and drops off most of the downsides. It’s also configurable to achieve the best performance depending on use: if you don’t need persistency, you can opt for Memory Mode, in which the RAM is used as a last-level cache for the slower NVRAM, achieving a transparent increase of available memory. If you are more worried about persistency, you can switch to App Direct Mode, where it is possible to obtain full control of the features of persistent memory or simply turn it into a filesystem (fsdax). Both approaches come with performance gains: App Direct Mode is faster than working with the disk and Memory Mode is better than swapping to a designed partition. We measured these gains during our tests with the NEXTGenIO cluster and we will show you more concrete numbers in future posts, after further refining and optimizations.

One of the many nodes of the NEXTGenIO cluster: dual-CPU Intel® Xeon® SP nodes of up to 56 cores, each with 192GB of conventional DRAM and 3TB of NVRAM.

Persistent memory, while being a very promising technology, isn’t very used as of today: development is still in progress to reduce costs and increase performance. As of now, only a few applications can fully exploit the functionalities it comes with: while the number is expected to grow, it can be a constraint for the development. Putting the persistent memory support inside frameworks can speed-up the integration process: this is the philosophy behind our project. As of now, we’ve managed to change Charm++ checkpointing to allow for the usage of persistent memory as a filesystem, and we will dig deeper into the code to achieve even more performance in the future weeks.

Those are the basics of persistent memory. If you are interested in this topic and want to know more about how to fully exploit its performance, take a look at this online course held at EPCC a few months ago. And if you have a solution to my door problem, please tell me because I still have found none!

Hello everyone, again!

I am very sorry for not writing to you for three weeks, it has been three weeks where we haven’t stopped adding changes and analyzing data in order to show you what we are achieving. In these weeks, we have encountered great challenges that had to be resolved without knowing if we were going to achieve it, now I assure you that we are on the right track. I take no more time and start telling you.

Our project started slowly, with many difficulties and without a path to follow, you have all the details in my previous post. In the same post I also told you that I finally decided to work on graphene, the study of this material is increasing due to its unusual properties, these range from extreme mechanical strength and lightness, through unique electronic properties to several anomalous features in quantum effects.

Carbon nanotubes structure

Experimental research on graphene includes simulations where the properties of strongly interacting matter are studied and can be based on the experience gathered in the Lattice QCD. These simulations require the repeated computation of solutions of extremely sparse linear systems.

My team, at the Jülich Supercomputing Centre (JSC), has developed a Monte Carlo Hybrid (HMC) method to calculate graphene’s electronic properties using lattice-discretized quantum field theory, a model widely used to make particle physic predictions using HPC. So the objective of the project is to understand how the algorithm works, the parts that surround it and which parts of the code present more complexity to the processor.

Specifically, in these three weeks, we have been working and understanding how to parallelize the computation of solutions of extremely sparse linear systems. Because it accumulates a large amount of computation and therefore becomes a bottleneck (more technical detail about the structure of the algorithm can be found in the original paper, Accelerating Hybrid Monte Carlo simulations of the Hubbard model on the hexagonal lattice).

Now, we only have to get down to work to apply all the theory that we have accumulated in practical improvements in the algorithm. We started with the HPC and openACC part, we started with testing the algorithm and analyzing its performance. Now! We start with everything interesting and the main focus of SoHPC, so don’t stop me now, let’s go.

About me

Hey, my name is Ömer. I am a computer engineering graduate student at Boğaziçi University. My background is from a different field, industrial engineering. During my undergraduate years, I became interested in computer science. Therefore, I decided to pursue a master’s degree in computer engineering.

Application

For the last year, I have been learning a lot about the field. I have also been looking for opportunities to expand my knowledge, see new places and meet new people. One day, I got an email from my professor about Prace Summer of HPC 2020 Programme. When I saw the flyer of the programme, it looked like the perfect event for me. It had everything I was looking for.

Training week

I would prefer to be in Vienna. But this is still a nice group photo

I was glad when I heard SoHPC was going to be online instead of being cancelled. I consider myself lucky because field of computing is more immune to viruses than any other field. People I met in the training week were really nice and enthusiastic. Training week went smoothly and I had a great time. I had crash courses on MPI, OpenMP, CUDA, Linux command line and even yoga. I would like to thank everyone who contributed for the training week. I am very excited about Summer of HPC 2020. I hope everyone will have a great summer.

I am assigned to project 2022 – Novel HPC Parallel Programming Models for Computing (both in CPU and GPU). I will be writing about the project in a further post. For the end of this post, I wish I was posting photos from Vienna and Luxembourg. But there is not really much to do. Instead, I will be posting beautiful views from İzmir.

Sunset in İzmir

DPD simulation of spontaneous vesicle formation of amphiphilic vesicle: formed vesicle (left) and cross-section with encapsulated water (right)
image credits : www.scd.stfc.ac.uk

Hello world! In this post I’m going to introduce myself. I’m Davide Crisante, I’m from Italy and I study Engineering and Computer Science in University of Bologna.

I had my first approach with HPC thanks to a university course, from that course I started to really like HPC and looking for a way to dive deep in that world I’ve found the Summer of HPC program! I really like thinking about resolutive approaches of every kind of problem, so the Summer of HPC will be a really pleasant way of living my summer!

I’ve been assigned to the Scaling the Dissipative Particle Dynamic (DPD) code, DL_MESO, on large multi-GPGPUs architectures project. My colleague Nursima ÇELİK and I will work on the DL_MESO software, a solver for mesoscale simulations of complex fluids’ behaviour.

image credits : www.sciencedirect.com/

The purpose of this project will be on porting and benchmarking some functions of the Dissipative Particle Dynamics section of the DL_MESO from FORTRAN to CUDA/C.

I really would have liked to write this post while I was enjoying the cool morning mist in Edinburgh. But here I am, at my home in Izmir, Turkey; suffering from a heat stroke.

I am a computer engineering student in here in my hometown’s university IZTECH. I’m trying my hand at every subject of CS that I can. I played with web development, machine learning; and then I learnt about the opportunity to actually spend my summer working in a whole new area, in a whole new country. Needless to say I was overjoyed. A work opportunity that actually requires you to travel, to learn a new technology, to go over your boundaries in every way possible.

Then a pandemic occurred and my expectations were flipped on their head. Yeah, great. I thought I was going to miss the most exciting summer I was ever going to have. And I was right! I would be, if it weren’t for soHPC going online. Surely, it can’t compare to changing a whole continent… But it really is the most amazing opportunity. I have access to a cluster, miles away from here, and I can actually make changes on it. It really blows my mind how all I need is my computer and a connection.

So, how hard it’s been? I have the greatest mentors and colleagues over at EPCC and I’m working with Fulhame (project 2006, if you would like to check it out). But it’s really hard to feel like you’re doing something worthwhile for your project when you can’t see the results, can’t interact with the involved people. I’m testing a new set of instructions and… They work. But it doesn’t feel solid to me.

Enough of the morbid whining though. I really am grateful for such an experience, creating beautiful memories and unforgettable bounds with new people. I will soon be writing a whole new post about Fulhame (and clusters in general), and maybe differences between intel based and ARM based HPC instructions’ differences.

Stay healthy folks!

This me…

Hi there, this is a brief introduction about myself and what I’m working on recently.

About Me

My name is Cem. My education life has some zigzagging from engineering to astrophysics, but finally I found myself in the world of scientific computation. Nowadays I’m finishing my Master’s degree on Computational Science and Engineering at Istanbul Technical University.

I enjoy thinking about mathematical modeling of natural and social phenomenons. Being able to define something we see in the nature in terms of mathematics and write it down in a form that computers can understand and solve always fascinates me. I think the best part of the computational studies is that you can focus on any topic you interested in from behavior of the people to universe. The problem which we are concerning about can be as small as Covid-19 virus or as big as a super massive black hole! So no chance to be get bored for computational scientist!

Beside this, I enjoy outdoor activities and amateur astronomy. Whenever I have free time, I throw mysef to mountains and forests where I can see a clear sky above and hear the beautiful sound of the nature.

This me on Montserrat/Catalonia

SoHPC Experience

When I am invited to the PRACE’s SoHPC program, it was such a happy moment for me, because I was sure that I will learn a lot during this program. I joined the team of High Performance Machine Learning project (#2001) with 3 weeks delay. Even though I started the program late, I have already learnt new concepts about HPC such as GASPI.

These days I’m working on implementation of a popular machine learning algorithm, called Gradient Boosting, in C language from scratch. I have used Gradient Boosting via Python’s machine learning module, Scikit-learn, before. With ready-to-use library in hand, it was pretty easy to implement no matter how complex the algorithm. Writing a complex machine learning algorithm by myself is completely different challenge for me.

As the most natural way to start, we are currently focusing on writing a serial code. In the next step, we will parallelize the algorithm with different perspectives of GASPI and MPI, so we obtain two codes with different parallelization strategy to compare their performance.

I’m sure that this summer we all will learn a lot. I will let you know about the updates.

Hello again! After talking about myself and training week in my last blog, I said that in my next articles I will explain the project. I keep my word and share the details of the project with you on this blog. Fasten your seatbelts!

What is Branch and Bound ?

Imagine that you are an owner of a machine shop. In addition you have $40,000 for buying new machines, 200 square feet of available floor space to your shop to get more money.

Machines Space Price Profit

  • Press 15 $8,000 $100
  • Lathe 30 $4,000 $150

If machines have features like these; of course you want to maximize the profit. First you have to have a solution and then change the number of presses and lathes to control if it is true or not. Branch and Bound works like this too! In order to get the best results Branch and Bound method, separates the calculated values into branches by constantly changing the variables and checks the results by comparing them.

Firstly it generates a root node with a upper and lower bound. Our aim is make upper and lower bound as close as possible. Then two new child node will generate. For example we can make number of lathes 4 for one, and 5 for other one.

These child nodes calculated and compare. Our lower bound is 928 so right child does not need to branch because it’s upper bound is 929. So next step will be branching left child node. This branching will continue until all leaf nodes ‘s upper bound are close to lower bound. These branching strategies may vary depending on the problem.

What is Max-cut Problem ?

Unfortunately, our problem with Branch and Bound is not the machine to buy, but the max-cut problem. This problem is based on graphs. Graphs consists of edges and vertices.

Graph with a 5 vertices and all edges weights are 1

Now consider that you have to cut this graph into 2 and get the maximum weights. For example; Vertices 1,5 is one group and 2,3,4 is the other group. The sum of the weights of the edges between the 2 groups will be our result.

In our example we found 3. Is this the max-cut number? I leave it up to you to find its correctness. You can apply the branch and bound method to find it of course 🙂

Why parallelization?

In the example above there was a graph of 5 vertices and we were able to reach the solving by ourselves. This is actually very small next to the graphs we are dealing with. Therefore, the branch and bound methods applied to them do not end in a short time. We want to shorten this calculation time thanks to parallelization. What they said, many hands make light work !

Now we investigate parallelization approaches and trying to implement these to our code. I hope that I can write more about these approaches to my next blog. In the meantime, I leave the link for those who are curious about the construction of the stones above. See you! https://www.youtube.com/watch?time_continue=8&v=yUQC6PYn0uw&feature=emb_logo

Since my last post, and over the last 2-3 weeks, I have been mainly familiarising myself with the GALILEO Cluster at CINECA, running basic models and simulations, and getting my hands dirty with sample models from the PLUTO Code for Astrophysical Modelling.

The PLUTO Code is a fluid dynamic code capable of reproducing magnetohydrodynamic (MHD) and supersonic fluids, such as the plasma which exist in stars and Supernova Remnants (SNr).

Once I was comfortable running sample simulations on GALILEO, I began modelling my own simulations of Supernova Explosions (SNe), and their resulting SNr. Firstly, I began running simple, 3D, spherically symmetric explosions, and monitoring the evolution of the blast over the course of ~1900 years.

See the photos below, which show the evolution of the blastwave, with clear forward and reverse shockwaves present. Forward and a reverse shockwaves are created when the supernova blastwave interacts with the surrounding Interstellar Medium (ISM). The forward shock continues to expand into the ISM, the reverse shock travels back into the freely expanding SNr.

Snapshots of a spherically symmetric expanding Supernova Explosion simulation. The above photo shows the density profile of the blast after 100, 450, 1000, and 1900 years.

Once I was happy with the shape and evolution of the simple SNe described above, it was time to introduce some realistic physical conditions, because in reality, no SNe looks that perfect. One of my supervisor’s suggestions was to introduce a torus around the expanding blastwave (imagine the star was surrounded by a ring, similar to Saturn’s). This torus like feature depicts a much more realistic SNe simulation, in which the expanding blastwave interacts with a region of high density somewhere out in the ISM.

In the photos below, you can clearly see the interaction of the blastwave with the surrounding torus feature, and the effect this additional condition has on the evolution of the SNe, relative to the previous photos.

Snapshots of a spherically symmetric expanding Supernova Explosion simulation, interacting with a surrounding ‘torus of matter’. The above photo shows the density profile of the blast after 350, 950, 1500, and 1900 years.

In the coming weeks, I plan on introducing more realistic physical conditions, such as random clumps of matter scattered around the SNr, among other things. It’s also worth noting that the images in this post are 2D slices of a 3D model.

If you have any questions about the project, please feel free to comment and I’ll be more than happy to answer.

Cathal

Hello all! In this first post I will introduce myself and share my point of view and interests regarding the Summer of HPC by PRACE.

Who I am?

My name is Víctor González Tabernero and I am a 23-year-old student from Spain. I was born and raised in a small city called Salamanca (Spain). During my elementary education, I became fascinated by all kinds of science, and this encouraged me to make the decision to study Physics and Mathematics. I studied both degrees at the University of Oviedo (Spain) from 2015 to this year, 2020, when I finished them.

My interests in computation and more

During my study I had many computational subjects, which made me feel passionate about them. As I learnt more about computational physics, numerical analysis and computation, I realized that these were the subjects which interested me the most.

Thus, during the last two years of my studies I took the decision to fully commit to a career in these fields. In order to do this, I taught myself programming languages and computational skills. To further pursue my goal, I wrote both of final degree dissertations about computational simulations and particle physics.

Not everything is studying and coding, though. My main passion is listening to music, I am the whole day discovering new music or listening to my own playlists. I also try to travel whenever I have the chance, I like discovering new cultures, getting lost in foreign cities, and meeting new people. My hobbies also include cooking, gaming, and hanging out with my friends.

Why SoHPC and what I will be doing here

A fellow student and a friend of mine talked to me about this programme, in which he got involved two years ago. At this point in the post, I believe that I’ve made my passion about computation related to physics and mathematics clear. That is precisely why the summer of HPC by PRACE seemed like a unique opportunity to learn more about computation and HPC. When I first read the projects, they all seemed to be fascinating in their own way.

I got selected for the project 2019 called Implementing task based parallelism for plasma kinetic code. Said project was proposed by the LECAD laboratory at the University of Ljubljana, Slovenia and I will be working in a team with two other students. We will be working with a code for simulating the behavior of plasma kinetics, which has exciting physics and mathematical tools behind. Our aim is to parallelize the existing code to increase its speed. I am really thrilled about this project, as I can find here many of the tools and features I already know from my degrees, and because I’ll learn and understand lots of new concepts and techniques about high performance computation which, without doubt, will be useful in my future.

Due to COVID-19 pandemic we have remote participation. So here is a photo of me participating online, even though I could not travel to a new place, I am cheery about been able to participate in SoHPC.

And this is all for this introductory post. Thank you for your time and stay tuned for new posts containing updates.

As promised, here is an update of my Summer of HPC adventure. For the past two weeks, I have been working in the project 2017 – Benchmarking and performance analysis of HPC applications on modern architectures using automating frameworks “at” SURFSara. 

From day one, I’ve been able to log in and run jobs in Cartesius, the Dutch supercomputer, and I must admit that the experience is being even better than expected. Having access to such processing power from my bedroom makes me feel like a child who has just received the toy he has been waiting for years.

At first, I got familiar with the system and with the framework ReFrame for automating regression tests in HPC systems. The usage of this kind of tools is becoming increasingly important and necessary due to the variety of competitive hardware architectures caused by the evolution of HPC. Thanks to these frameworks, HPC maintainers are able to detect in a simple, reliable and automatic way whether different HPC software or scientific programs are behaving as expected in a range of hardware solutions. Moreover, it is also possible to configure several toolchains and builds to check how the choice of one or the other affects the final execution. Thus, HPC specialists can not only detect possible failures in the systems but also benchmark the performance of the programs and find potential bottlenecks. 

Once I learnt the basics of ReFrame, I was assigned an exciting task: the creation, from scratch, of regressions tests for a widely used scientific software in HPC systems: NAMD. According to the official page, NAMD is “a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems”. Moreover, this software is involved in massive simulations of the SARS-COV-2 coronavirus envelope (more info here).

Example simulation performed using NAMD. Media took from Wikimedia Commons.

Therefore, during these first weeks, I have been designing different regressions tests for that application, using public benchmarks to test measure the performance of Cartesius. Moreover, I could test different architectures and even computing accelerators like NVIDIA GPUs. Once I ran all the tests, I obtained interesting insights from the results. With this information, I wrote a series of guidelines for future Cartesius users, so that they could execute their NAMD simulations in the best possible way.

As we have already met the objectives proposed for the NAMD tests, this week I have started to perform a similar study for Alya, which is part of ComBioMed Center of Excellence. I am really excited about this task because I will also have to perform detailed profiling to detect possible bottlenecks in the Cartesius system!

In the next post, I will inform you of my progress in this new challenge. Stay tuned, so you don’t miss anything.

A view from Bogazici University in the autumn, © Beyza Bah

Hi all, this is an introduction to myself and Summer of High Performance Computing (SoHPC) from a point where I stand. If this is your first time hearing about HPC or SoHPC, you should definitely check out About HPC & Prace and About Prace Summer of HPC.

My name is Nursima. I study computer engineering at Boğaziçi University in Turkey. Since covid-19 broke out, I have been continuing my studies in İzmir, which is on the west coast of Turkey. Here is a warm and comfortable city with ancient ruins from as early as Hellenistic period in every corner. Although my original intention was to fulfill the programme in UK, I enjoy working here in İzmir too, amidst salt and seaweed scents coming through my window.

Among my favorite topics are mathematics, computer science, languages, social sciences. Actually, I enjoy learning and thinking about most of the abstract topics and my favorite moments are the moments of enlightenment about how universe/people/machines work – or doesn’t work. However, just knowing does not mean so much when you don’t do anything. There are so much problem waiting to be solved. And that’s where my application to SoHPC comes into the scene.

Well, I heard about the SoHPC from a professor in my university and thought it would be a good idea to apply. Organization seemed well planned, projects were solid and you’d have a mentor throughout the programme. Now I’m glad that it turned out to be a great one. After an online training week where we are introduced to MPI, OpenMP, and CUDA with lectures and hands-on labs, we started to progress in our projects in the supervision of mentors.

My project is centered around a particle simulation code, namely DL_MESO. Currently, the code can be run on up to 4096 GPUs. Our aim is to improve code so that it can be run on larger number of GPUs thereby allowing to simulate systems of larger size.

I’ll be writing more about the project in the following days. Thanks for reading and stay tuned!

Hello everybody! I hope you are all well.
As you can see in the featured image I am really excited about receiving the programme’s t-shirt!
So, in my last blog post I introduced myself and asked you to answer a question in the comments. I am pleased that you took the challenge and answered. Now, in today’s post I would like to describe some information about the project I work on and my progress so far.

Project’s motivation

Nowadays, the use of Python is getting popular, mainly because it’s really user-friendly and saves quite a lot of developing and debugging time. However, when it comes to performance, there are some lower level programming languages, which means they are closer to the computer than the human, that produce programs with less execution time. Simulations, visualisations and other heavy calculation programs that run on supercomputers reveal Python’s poor performance. So the question is; can Python be optimised to run fast and benefit from HPC architectures?
Before answering to that, we have to understand some basic features of the Python program that we will study.

The program

In this example, the cavity is a square box with an inlet on one side and an outlet on another.
In this example, the cavity is a square box with an inlet on one side and an outlet on another.

The program performs a Computational Fluid Dynamics (CFD) simulation of fluid flow in a cavity.

The Fluid Dynamics problem is a continuous system that can be described by partial differential equations, but in order for a computer to run simulations, the calculations need to be put into a grid (discretisation). In this way, the solution can be approached by finite difference method, which means that the value of each point in the grid is updated using the values of neighboring points.

The blue point of the grid is updated using the top, bottom, left and right points.
The blue point of the grid is updated using the cyan top, bottom, left and right points.

The program can be parameterized by specifying the variables below:

  • Scale Factor – affects the dimensions of the box cavity and consequently the size of the array(s) in which the grid is stored.
  • Number of Iterations – affects the number of the steps in the algorithm, the larger it is the more accurate the result will be.
  • Reynolds number (Re) – defines the viscosity which affects the presence of vertices (whirlpools) in the flow.

The simulation result is visualized by arrows and colors drawn in an image representing the grid. The arrows demonstrate the direction of the fluid at each point, while the different colors indicate the fluid’s speed, with blue being low speed and red being high speed.

Output images after simulation with Reynolds number = 0 (left) and Reynolds number = 2 (right). For a non zero Reynolds number, arrows reveal the existence of whirlpools.
Output images after simulation with Reynolds number = 0 (left) and Reynolds number = 2 (right). For a non zero Reynolds number, arrows reveal the existence of whirlpools.

Optimisation possibilities

There are several techniques that can be applied to speed up Python codes. The goal of this project is to investigate optimisations for Python programs that run not only on CPUs but also on GPUs.

Progress in the first 3 weeks

Some of my early tasks were to study the algorithm, understand the existing Python and C codes, get access to HPC systems and submit my first jobs to the supercomputers.

Currently, I am working on optimisations to the Python code. The metric that interests us is the iteration time, which is derived if we divide the total time for N iterations by N. I have been trying to use several Python modules that accelerate the calculations and found out that numexpr module is the best for our case. The Python baseline (unoptimised) code uses the Numpy module to achieve fast array calculations. However, only the numexpr version of the Python code can compete against C, since the Numexpr module creates less temporary arrays and uses multi threading internally.

Graph representing Iteration Time over the Number of Threads for the optimised numexpr version, compared to the baseline versions. Python numexpr code performs better than the serial C code when more than 2 threads are used.
Graph representing Iteration Time over the Number of Threads for the optimised numexpr version, compared to the baseline versions. Python numexpr code performs better than the serial C code when more than 2 threads are used.

Next goals and conclusion

My next goal is to produce a performance graph for the MPI versions, which I will describe next time, and then move on to developing an equivalent Python program for GPUs.
Before concluding, my question for you this time is the question in the title:

“ To Py or not to Py? ”
And by that I mean what is your experience in Python, have you ever used it? If yes, is it your most preferred language? Have you ever bothered about its performance?
Please feel free to write your thoughts on that down in the comments section.

That’s all for this blog, I really hope you found some interesting points in it and I will be happy to see you in one of my future posts.

Hi, my name is Aarushi Jain. I am a Computer science and engineering student from Indore, India. I am in pre-final year of my graduation. I am working on project no. 2024 “Marching Tetrahedrons on GPU” under SoHPC-2020. I have made a tangential entry to this program. I was selected for internship at VSC through some other route but it did not materialise due to pandemic. I was very much disheartened, but by grace of god, I got this opportunity to work under SoHPC-2020. This was a blessing in disguise for me.

This project will be completed under the supervision of Project Mentor Siegfried Hoefinger.

I was intended to join the Summer training in Vienna but due to pandemic we were told that the program has been shifted to online mode. I was happy and sad at the same time because I was getting to learn new skills by experts with comfort of home, on the other side I will be missing a chance to visit new places and meet people from diverse culture.

My first exposure to HPC began during my training at VECC (Variable Energy Cyclotron Centre) Kolkata. Their I learned fundamentals of Parallel processing using pthreads and worked on a project for Compressed Baryonic Matter (CBM) experiment at Facility for Antiproton and Ion Research (FAIR) Darmstadt, Germany.

As a kid, I visited International Center of Theoretical Physics (ICTP), Trieste, Italy with my parents for a month. I would like to share a video of ICTP where I got an award for being a good listener. Watch the video below and try to recognise me :-).

If you are still here, I would like to share my hobbies which are photography and swimming.

Picture that I have clicked today at my home for this blog (Nikon D5600)

Thank you for reading, and I will keep you updated about my project here.

I don’t remember exactly what I typed in the Google search bar that day in the middle of February, but one thing’s for sure: I came across the PRACE website and I’m glad I did. Not much time before I discovered that all the topics I’ve been putting all of my efforts into, converge to a single discipline, and this discipline is called HPC. Ah! Acronyms! Here we go again. So I immediately applied for Summer of HPC, the internship I was looking for without knowing.

Now let me introduce myself. My name is Federico Sossai, I’m an Italian Computer Engineering student at the University of Padua, currently at the last year of my Master’s degree. My main interests concern the software side of Computer Science such as algorithm design, analysis, optimization and high performance computing of course. I’m attracted by any form of computation in general but, as you probably have guessed, parallel computing is one of my favourite.

This is me trying to keep a straight face. Not an easy task sometimes!

Personalities are anything but simple to be described so let me give you a hint: I love C programming. Just a few words to enclose so many aspects of my identity. I like to spend hours thinking, paying attention to little details, not taking anything for granted and doing all the heavy lifting behind the scenes.

It’s not a matter of interest, it’s a matter of passion.

I live in a small town in the northeastern Italy of about 5,000 residents, which might not offer the same services as a metropolitan city (well, of course it doesn’t!), but we all know that every Con has its Pro: I can reach either the sea, the mountains or Venice in about an hour! I wasn’t expecting to spend this much time in here, but due to the lockdown my home has become my workplace, so let’s see what I will be working on besides playing electric guitar..

When it comes to big computing infrastructures it is not possible to keep everything in a shared memory fashion and that’s when distributed memory comes into play. This is what my project, named Improving performance with hybrid programming, is all about: learn how to efficiently mix MPI and OpenMP to get the best performance out of the VSC3 supercomputer (at Vienna Scientific Cluster), exploiting both shared and distributed memory architectures all in one software.

Now let’s hunt some GFlops!

Follow by Email