As the internship has come to an end and my time in Paris is over, I find it as good a time as any to reflect on my experience. I first came across the Summer of HPC program as I was looking for something to keep me occupied over the summer. Being halfway through my Master’s in Computational Science and Engineering, I was looking forward to get some hands-on experience in the field. SoHPC seemed like the perfect opportunity – even though I was a little disappointed that it would be organized online. Nevertheless, some projects switched to on-site mode, and my favorite project just so happened to be in Paris, a city that will always have a bit of extra significance to me as it was the start of my international journey back in 2015.
This internship has given me the opportunity to dive into and apply my HPC knowledge to a completely new field. Unlike most other internships, we as participants have had good insight into each other’s projects which has really given me a better intuition and insight into HPC and its endless applications. By being on-site, I got to know many of the amazing people at CentraleSupélec. Besides providing many laughs during lunch and coffee breaks, I really enjoyed getting some insight into their work and hearing their perspectives on a career in research.
My project focused on seismic wave propagation, more precisely, improving the stability condition of the existing numerical wave propagation solver SEM3D, and thereby improving performance as fewer computations are required to obtain the results. I’ve summarized my project and its significance in a short and (hopefully) comprehensible presentation that you can watch below. I found it really motivating to work on a project with such a clear and important use case with the support and guidance from my two amazing supervisors, Filippo Gatti and Régis Cottereau. My inner software nerd was also thrilled getting introduced to the not-so-new language Fortran 90 and having to work directly with the large, highly optimized HPC code that is SEM3D. If future editions of SoHPC offer projects on SEM3D, I really urge students to apply – I could not have been luckier with my project and supervisors!
As if the internship in itself wasn’t enough, I got to spend my entire summer in Paris, for which I’m incredibly grateful. I’ve had a summer packed with good food and wine, and spent most of my spare time taking long Paris walks, soaking in the liveliness of the city and enjoying the ambiance. Moving abroad is an incredibly empowering adventure that I wish everyone gets to experience at some point in their life. Navigating through a new country and culture without your normal safety net is both exhilarating and scary, especially when you’re on your own. It pushes you outside your comfort zone and opens new doors to form unexpected friendships and pick up hobbies you otherwise might not try, leaving room for personal growth in every aspect of your life!
With these (slightly cheesy) words my Summer of HPC has come to an end. If anyone has further questions about my project/experience, or simply want to stay in touch, you’re very welcome to shoot me a message on LinkedIn. And if you’re a student interested in getting into HPC: apply to next year’s edition of Summer of HPC – you won’t regret it!
In my previous blog post, I gave an overview of the file types used to handle the large genomic data for analysis (which we use as the inputs for the toolkit JLOH), the resulting output files, and how these output files would look when visualized. As promised, I will now go into how JLOH does what it does, why it is efficient at what it does, and what I have been doing to test this efficiency and scalability.
JLOH, the algorithm
Armed with the input VCF, FASTA, and BAM files, JLOH first separates the homozygous regions of the hybrid genome from the heterozygous ones and prints them on two separate VCF files. The variants in the input VCF files have a ‘format’ field that shows whether they are heterozygous or homozygous. JLOH then calculates the heterozygous and homozygous SNP densities of both parental genomes and clusters the regions of high homozygosity and heterozygosity together. The remaining regions with low SNP density are where some information has been lost through LOH and these are the ones we are interested in. They are labeled REF and ALT blocks and are compared to the heterozygous regions, trimming any overlaps.
At this point, JLOH can now analyze these REF and ALT blocks for their read coverage. Read coverage is the number of unique reads with a specific nucleotide base in the sequence. The blocks whose read coverage is above a specified threshold are discarded along with any uncovered regions. Using these read coverage profiles, JLOH then determines each block’s zygosity, whether ‘Hemi’ or ‘Homo’. Homozygous regions display a uniform read coverage while hemizygous regions show a reduced coverage as compared to their neighboring regions.
JLOH then outputs a table of all the LOH blocks found, their zygosity (homo or hemi), and their allele type (ALT or REF), among the other stats established along the way. With this information, I bet this visualization of the output files that I shared in the previous article now makes sense.
JLOH output viewed in a genomics viewer (source: JLOH GitHub)
Given the scale of information that we are dealing with, this analysis should take a long time to complete. However, JLOH carries out most of its operations in parallel, cutting down on the overall run time.
Parallel computing
If the computing problem you are trying to solve is too big to be handled by a single computer or would take too long when handled by a single computer, one way to solve it would be to split the problem into several sub-tasks that can be executed in many processors concurrently to save time. This is essentially what supercomputing is. Cloud computing is a commercial version of this, for those of you familiar with it.
The steps followed by JLOH, as described above, are carried out by functions in JLOH’s extract module. How fast these steps run is dependent on how fast the data can be processed. To leverage the power of multiple processors, JLOH uses Python’s inbuilt multiprocessing module to analyze the data. Multiprocessing opens a ‘pool’ with multiple processes that can run separately on separate processors at the same time. JLOH splits some of its more resource-intensive tasks, runs them on separate processors and when they are complete, the main process resumes.
To test this, I split the genome samples that we are using into subsets of incremental size with 1, 4, 8, 12, and 17 chromosomes respectively. I then ran JLOH using these sets as job scripts on a single node of the Marenostrum4 cluster keeping the amount of allocated memory and processing cores constant. For each run, I recorded the time JLOH took to complete the analysis. The results are graphed as shown.
A graph of the time JLOH takes to analyze the subsets of the Saccharomyces cerevisiae data set. The complete data set is 17 chromosomes.
The red line represents the real time which is the actual time taken by a program to run from the time we run the command up until we get the output as if timed using a stopwatch. The blue line, on the other hand, is the user time or the cumulative time taken by the processors to run the program till completion. This is the amount of time it would take JLOH to complete its analysis had it been running serially on a single processor which is not the case thanks to parallelization. Notice that the overall time taken increases steadily as more input data is passed, showing JLOH’s scalability with increasing input size.
I took this further. Using a single data set, I ran JLOH but this time I varied the number of processor cores allocated for each run. The results for the time taken graphed against the number of processor cores are shown below. For this set, the real time decreases steadily from about 16 minutes using 1 core until about four minutes when it flattens out at about 4 cores, which are sufficient to run the analysis on this set. User time stays constant but using multiple processors cuts down the real time. Running on a CPU-based parallelization strategy, JLOH consumes little memory to run. Overall, JLOH performed well in most of the test cases that I ran, proving to be scalable with increasing input and processor cores.
JLOH’s run time when using a varied number of CPU cores and the 12 chromosome dataset of Saccharomyces cerevisiae.
That is it for this post and my Summer of HPC project. This has been an enriching, although fast-paced experience, where I have learned a lot about genomics and high-performance computing. I am grateful to my mentors from the BSC Comparative Genomics Lab for their guidance, to everyone at BSC for hosting me, and to PRACE for making all of this possible and allowing me to work on this project. I would highly recommend SoHPC to any university students interested in research computing.
It’s unfortunately the end of the Summer of HPC 2022. It was an absolute blast to work on High Performance Quantum Fields for the past two months!.
As mentioned in the previous blog post, the aim of this project was to benchmark the Dirac operator from Lattice Quantum Chromodynamics (Lattice QCD). We showed that the Staggered Dirac Operator had the equation for each element of the output source term χi(n)
Therefore, for each point n on the lattice, we perform 8 matrix-vector multiplications between an element in SU(3) and a complex vector with 3 elements as the lattice is a 4-dimensional hypercube. This forms the basis of the benchmark kernel.
Implementation in Kokkos
The single node benchmark was developed using the stream convention, which sums the total bytes read in by the Dirac kernel and the total bytes read out. This was straightforward to implement across both the CPU and GPU using Kokkos, thanks to it’s API that was discussed in my first post. The results of the benchmark are included in the figure below:
Staggered Dirac Kernel for multiple Execution Spaces.
It can be clearly seen that ,especially for the A100 GPU, that as the lattice volume increases, memory bandwidth utilization increases. This indicates that the Staggered Dirac Kernel scales well with volume. It is notable that a drop in memory bandwidth utlisation is seen for the CPU. This indicates that the memory channels have become saturated. One way to alleviate these issues is to set OMP_PROC_BIND=spread and OMP_PLACES=threads for OpenMP version 4.0 or greater.
Adding distributed memory communication with MPI was a much simpler process than anticipated. As Kokkos is a shared memory library, it compliments MPI quite nicely. Internode communication with the CPUs and GPUs (along with intranode communication for the GPUs) was done using the halo exchange method. However, since Kokkos::subviews are guaranteed to be contiguous, Kokkos::deep_copy was used to copy into and out of the halo buffers, without having to implement custom MPI_Datatypes.
A Final Recommendation
Please don’t forget to check the following video presentation I made with my colleague Apostolos Giannousas about the project. I highly recommend reading Apostolos’ work regarding extraction of the mass of the pion from Lattice simulations.
Welcome to my final blog post! It was a great pleasure to share you all my progress in SoHPC 2022. It has truly been a wonderful journey both learning wise and building connections. In this blog post, I will share with you the final results and concluding remarks from my SoHPC 2022 project “Neural networks in chemistry – search for potential drugs for COVID-19“. At the end, I will share with you my whole experience in SoHPC 2022, working with my SoHPC partner, and project mentors. Finally, I will answer you whether it is worth participating in SoHPC or not.
Tuning of the Parameters of the Molecular Descriptors and Neural Networks
Coulomb matrix for methanol
In my previous blog post, I mentioned that we will work with the molecular descriptors to transform the candidate molecules to mathematical representation. During the project, I worked with two molecular descriptors: Coulomb Matrix (CM) and Many Body Tensor Representation (MBTR). CM is a simple matrix representation of the molecular interaction, which can be used to describe candidate compounds for drug development whereas MBTR is a more complex molecular descriptor which depends on a lot of parameters such as angle, grid size, maximum and minimum sizes of the grid etc.
To obtain the best model, both, the parameters of the descriptors and the neural networks (NNs) need to be tuned. Once the best parameters were obtained for molecular descriptors, the NNs were tuned for a set of values of each parameter. Due to a large number parameters, tuning the hyperparameters is time-consuming and cumbersome. The final training of the COVID-19 data was done with the best optimized parameters. Even though the final training did not produce a perfect model, it still predicted a reasonably good docking score of the molecular compounds.
Furthermore, we also tested the speed up of the creation of the molecular descriptors. Speed up is defined as the time taken by a program when run using one processor to the time taken by the program when run using ‘n’ number of processors. The script of different descriptors were ran with different number of processors (n=1, 2, 4, 8, 16, 32). We plotted the speed up versus the number of processors, which showed us that complex descriptors such as ACSF and MBTR could be parallelized and the speed can be increased by x2.
For a complete review of our project, do not forget to watch our video presentation here or on Youtube.
Conclusions
From the final results, it is clear that the final model obtained after tuning and training was not the best. More work is required to understand the molecular descriptors and their parameters in details, and to obtain the best model for the COVID-19 problem. There is a great scope for drug development using advanced tools such as Artificial Intelligence and Machine Learning.
Worth it or not?
During the SoHPC 2022, I learned many new skills and polished my old skills. Not only I improved my coding skills and gained knowledge in MPI and GPU, but I also learnt how to work collaboratively. Working with Gabriel Cathoud has been amazing. We had many discussions and meetings during the SoHPC journey in which we discussed not only the things related to the project but also the future of drug design using AI. In a collaborative work, the partners should know the skills their partners excel at. We shared our work in a way which was not only best for us but also progressive for the project. Furthermore, our mentor Dr. Marián Gall and Co-mentor Dr. Michal Pitoňák guided and supported us throughout the project. Dr. Marián taught us the concepts of Neural networks, and instructed us how to code AI and ML scripts in Python. We were going at a pace which was just perfect for any student. Dr. Michal helped us in writing the final report and making the video presentation. I very much appreciate his valuable feedback.
Coming from a Physics background, and more specifically Astrophysics, I had zero experience of drug design, and that, even AI can be used for such a purpose. In the beginning, my only goal was to better my Python skills and to be able to code AI scripts in Python. But now, after doing this project, I can say loudly that designing a new drug for a disease is as interesting as finding an Earth-like planet. I am overjoyed to do this project, and will certainly learn more about it in the future. So for me, from the SoHPC 2022 end, it is the beginning of learning more about drug design.
I will take the skills, I have learnt in this journey to my next journey i.e., doing a PhD in the UK, starting next month. Although I’ll be working on AGB stars but who knows when the complexities I faced in drug designing can help me tackle a problem in AGB stars. I am very thankful to PRACE and Dr. Leon for all the support. Now, to answer the question I asked in the beginning of this blog post, whether to apply to SoHPC or not. Yes, It is 100% worth it, you will learn new skills, and may learn about a completely new subject with which you will be in love with in two months.
Thank you for reading my post. I hope you enjoyed my blog posts and found the subject interesting. In case you have queries regarding the project or my SoHPC journey, feel free to contact me or write a comment below.
In my previous blog post, I gave you a quick crash course on how to simulate seismic wave propagation in an HPC environment. This is done through semi-discretization, meaning that we solve the spatial variables using a form of Finite Element Method called Spectral Element Method (SEM), and reformulate the solution to a (simpler) ordinary differential equation.
To solve the ODE, we use a time-stepping scheme, where we encounter some problems with stability. This issue can be solved by calculating the eigenvalue of the matrix M-1K obtained from the SEM formulation. This may sound straight-forward, but as always when it comes to HPC, there are a few important aspects we need to take into account, which I will discuss in this blog post!
The Problem with Matrix Representations
There exist several numerical methods that solve the eigenvalue problem (power iteration, the QR algorithm, the Jacobi eigenvalue algorithm). In practice, however, we would typically like to use an existing library such as LAPACK, Eigen, or SciPy. Not only will these spare us tedious hours of implementing our own method, but they have also been highly optimized to an extent that our homemade implementations won’t be able to compete with. Unfortunately, the above-mentioned libraries aren’t very appropriate for HPC applications.
The problem with most existing libraries is that they require an explicitly defined matrix: either a 2D array or a library-specific data structure. It might sound strange that we’d even consider storing our matrices any other way, but in HPC it’s not very common to store full matrices. One of the biggest reasons for this is that most HPC applications work with large sparse matrices. This means that most matrix elements are zeros. If we’re working with a 5×5 matrix, storing a few zeros might not be a problem, but when we’re working with larger matrices (say 100 000×100 000), these zeros represent a lot of useless data and wasted memory. Memory usage is a very important aspect in HPC as memory I/O is one of the slowest operations and greatest bottlenecks in most programs, so never ask the HPC-coder to switch to a full matrix representation!
Does this really matter to us? Of course!
The wave propagation solver that my project revolves around, SEM3D, is not an exception when it comes to strange matrix representations. When choosing an eigenvalue library for my project, there were two important aspects to take into consideration:
SEM3D is a large software project with over 650 000 lines of highly optimized HPC code. Adapting the existing implementation is out of the question, so my solution to this problem must be compatible with the existing code (mainly the data structures and communication interfaces).
All the essential code is written in Fortran 90, so my solution as well as the used library must also be available for Fortran.
Example of alternative matrix representations in HPC applications, specifically in the wave propagation software SEM3D
FORTRAN fun fact for electronics nerds like myself: FORTRAN has been around since 1957 and was first run on vacuum-tube computers – the predecessor to the modern-day transistor computers we know today!
Not All Heroes Wear Capes: Introducing ARPACK!
The magic component in this entire project is ARPACK, the amazing library that fulfills both our criteria in an elegant way. ARPACK is written in FORTRAN 77 and was developed throughout the 90s to solve large-scale eigenvalue problems without requiring explicit matrix representation. The library implements the Implicitly Restarted Arnoldi Method and is used by many popular software and computing environments such as SciPy, MATLAB, and R.
One of the most important features of the ARPACK library is what’s called a Reverse Communication Interface which means that the library doesn’t act directly on the matrix. Instead, it requires the user to implement a matrix-vector product that’s called between each iteration step using data from the previous iteration-step:
Pseudocode of the reverse communication interface used for the ARPACK eigenvalue solver
At first glance, this might not seem so remarkable, but this means that the library doesn’t care how (or even if) the matrix is stored. Everything related to the matrix is kept far from the library! In SEM3D where the diagonal mass matrix M is stored as a vector, and the stiffness matrix K not stored at all (but implemented as a function that calculates the internal forces Fint=Ku), the reverse communication interface lets us pass the relevant data to the ARPACK solver without modifying the existing data structures or matrix representation!
That’s it for this blog post! Although each problem is unique, I hope that this gave some insight into the special type of obstacles one may face when working with software for HPC applications!
In this blog post, I will explain the fundamental theory behing simulating fermions on the lattice. In particular, I will discuss the application of staggered fermions in creating a kernel for the Staggered Dirac Operator using the Kokkos C++ template library.
Fermions on the Lattice
Firstly, let us say a few words about the setup from the perspective of Lattice Quantum chromodynamics (Lattice QCD). Instead of using the continuum description of 4-dimensional spacetime, we instead choose to represent spacetime as a 4-dimensional hyper-cubic lattice, with each lattice point n having integer coordinates nμ, μ = 0,1,2,3.
Quark fields φ are defined on every lattice x, which are described using Dirac spinors φ(x). Component wise, the spinor is denoted as ψαi(n) with the Dirac/”spin” index α = 1,2,3,4 and gauge/”colour” index i=0,1,2,…, N-1, with N being the dimension of the gauge group. For Lattice QCD, the gauge group is SU(3), therefore i=0,1,2. Thus, each spinor φ(n) can be thought of as having 12 independent components that are Grassmann variables.
Applying the staggered transformation of the quark fields, the action is diagonalized in the spin indices, therefore splitting it into 4 parts (1 for each Dirac index). Therefore, in the staggered action, we can reduce the quark field into a 3 component vector ψi(n), with complex entries.
Gauge link variables between two lattice sites.
To couple the quarks fields to a gauge field U, we introduce directed links, which connect a site n with a site n+̂μ. Thus, associated to each φ(n) are link variables Uμ(n) along the ̂μ direction emanating from site n. These link variables are elements of the gauge group, therefore for Lattice QCD are 3×3 complex matrices. In component form this is Uμab(n). Using these, we can write the Staggered Dirac Operator D acting on a quark field φ as
Staggered Dirac Operator. Summation over repeated indices is implied.
Implementation
2D slice of the 4D lattice. These indicate the two fundamental fields of LQCD.
As we can see in the above equation, for each lattice site n we perform 8 matrix-vector multiplications, Uφ (8 as we have eight nearest neighbours in a 4-dimensional hypercube). Therefore, in the Kokkos kernel, we must perform these multiplications over N lattice sites to get an output quark field χ(n).
With Kokkos, we can make many choices about how we perform these operations in parallel. The choice of execution policies is critical in maximizing bandwidth. In the next blog post I will post some results from runs performed at the Jülich Supercomputing Centre.
References
[1] C. Gattringer, C.B. Lang, Quantum Chromodynamics on the Lattice, Springer Berlin, Heidelberg, 2010
As it’s time to say goodbye to my journey with PRACE, I would like to talk about my final days here and my experience with quantum algorithms. Quantum algorithms are algorithms that can be used in quantum computers. We are interested in studying them because they can solve a problem faster than any classical algorithm. Several quantum algorithms have been developed now. Grover’s algorithm is one of them which is used to search in an unsorted list of numbers. Let’s see how it solves the search problem.
How does the algorithm work?
Without introducing any mathematical details, I can simply say that Grover’s algorithm works in two steps. Let’s say, you are looking for a number w, in a list of N numbers. To do so, the algorithm will first flip the phase of the w and then it will amplify the amplitude of the quantum state corresponding to w. The resulting effect would be a higher probability of getting w than the other numbers. More mathematical details are provided here. For now, let’s understand it graphically as in the picture below. As you see, in the second picture below, w has the highest amplitude.
A simple illustration of how Grover’s algorithm works.
Now, to execute the algorithm, we need to build a quantum circuit. The two main sections of the circuit are the oracle and the diffuser. Oracle, executes the first step of Grover’s algorithm, and Diffuser, executes the second step. Let’s see how we did it in our project.
The final goal of my project
The goal of my project was to implement Grover’s algorithm to search for 13- a 4-bit number. In this case, our ‘full unsorted list’ is all the possible qubit states for 4 qubits, i.e., all the states |0000⟩, |0001⟩, ……, |1111⟩. Therefore, my task had to construct a circuit that implements both the oracle and diffuser in such a way that the state |1101> has the highest probability. After some effort, I was able to construct one.
A Grover circuit to search for a 4-bit number.
You may notice the two mysterious boxes labeled as oracle and diffuser. How these are built or what’s inside them isn’t important for now. So, I’ll skip it and would rather turn your focus on something more interesting.
Can we run a test on a quantum computer?
Current days, IBM is providing free access to several simulators and quantum computers. So, you literally have access to quantum computers! We run several tests with our circuit in the simulator and real quantum computer. The results are usually obtained in the form of a histogram which shows the probability of all possible outcomes. Here is what we obtained with qasm_simulator.
A simulation obtained for Grover’s algorithm.
As you probably can see state |1101> has more than 90% probability and that’s what we expected! While the simulators were working quite fine, the difficult part was the test with real quantum computers. We tried testing with several ones from IBM’s computing cloud and noticed that the results were affected by noise. However, It’s still an issue with today’s quantum computers. But we can surely hope for a better computing cloud in near future.
I believe that you learned something from my blogs. Don’t hesitate to ask questions in the comments if you have any.
Goodbye HPC!
Goodbye and thanks to those who followed my posts.
My Summer of HPC has been centered around the FMM algorithm which has been roughly explained in a previous post. Studying its fundamentals and its sequential implementation was very interesting but it’s time to say goodbye. I will use this post to give a brief overview of my experience.
Getting started
The first task was to get familiar with FMM terminology and the basic principles that inspired it, some of which we’ve already covered. After that, we studied the layout of the octree that enables the reduced complexity that FMM offers. This was specially important because our goal was to come up with a MPI version in order to leverage a large amount of nodes.
Some difficulties such as dealing with their complex representation parametrized by many template parameters, or deciding how to distribute those octree nodes in a sensible manner motivated us to focus on the near field pass of the algorithm.
Speeding up the near field pass
The near field pass is pretty similar to a naive Coulomb solver, they only differ in that the near field pass focuses on a box and its close neighbours. Our first approach was to come up with the simplest implementation possible and measure how big the communication times grew as we scaled up in the number of nodes.
The results showed that the simplest schema performed surprisingly well on the data distribution part but extremely badly in the results collection. This indicates that broadcasting is very well optimized and that if we want to scale up in number of nodes we have to fix the reduction times.
At this point the hypothesis were three:
The message size is too big
Perhaps using a Reduce operation somehow chained the delays eventually adding up to a significant amount when the number of nodes grew.
There is a load imbalance in spite of the workload being highly homogeneous
Sadly every explanation I came up with proved to be wrong: there wasn’t load imbalance and greatly reducing message size and communicating it directly to the master node didn’t provide any significant speedup.
Conclusions
Our results show that our goal is not as easy as one might think at first. This is not good news but it also means that there is much work left to be done, and that should make you happy! Weird and complex problems leave you room to try crazy ideas and that is definitely fun.
On a personal note, Summer of HPC showed me the difference of tackling a very complex problem without a clear path versus solving a guided university assignment. This change can sometimes be frustrating but it is what it is. Doing research is difficult and you can’t expect to solve everything at first sight.
The experience has been great and much fun, and I would like to thank Ivo and the SoHPC team for making this possible.
Hello everyone! It’s been a while since my first post. And I look forward to giving you more detailed information about the data I’m working on.
How is the data structured?
There are multiple resources that can be used as data resources in the system analytics project. These are EAR, XALT, SLURM and Prometheus. I am working with Prometheus data. Prometheus is an open-source systems monitoring and alerting toolkit. The data coming from Prometheus contains a lot of information for HPC systems at the node level. Also, Prometheus creates a new datapoint about the node every 30 seconds. This means that we can get information about the nodes in the Lisa cluster every half minute. In order to obtain a uniform dataset, I worked on 6 days of data obtained from more than 200 CPU nodes. This means more than 4 million datapoints. Each datapoint contains 70 different features from file system usage to core temperatures and many more. This data, of course, was waiting for me in a compressed state in a database. (I am sharing a small section of the first datapoint for you below.) I needed to decompress this data first and then make it suitable for PCA analysis, which would allow us to extract useful features for our project.
Yes, this is the small portion of a single datapoint!
A bit of parallelization
In order to use the data in PCA analysis, all of the features had to be numerical values and had to be normalized. First of all, I wrote a serial code for this. It took approximately 20 hours for this code to run in a database containing more than 4 million datapoints in total. But of course, our time is precious! So I parallelized the code and made it run 4x faster even on my local computer. (I’m going to be a little proud of myself here because I’m doing parallel programming for the first time :))
PCA Results
After pre-processing, I got 4787342 datapoints, each containing 74 features. Then I normalized this dataset and applied PCA. For ease of visualization in PCA, I primarily used 3 components. While doing this study, we expected to see some clusterings as a result of PCA, but the result has a linear structure. Could it be that it somehow made this inference even though we didn’t include the timestamps in our PCA analysis?
We also obtained other interesting results. Looking at the PCA components, we noticed that nothing about the file system is useful for explaining the data (i.e. almost), apart from that, things that produce output with similar units such as node1, node5 and node15 are almost equally important and rank higher when we order the components.
What is next?
Based on this study, we decided to continue with a single node, to get rid of some outliers with PCA analysis, and then to continue with a new dataset. I will share the details with you in my next post!
Welcome to my third and final blog post. At the end of the previous post, I briefly explained the goals of this project. I talked about making LQCD simulations to extract a pion mass and find an input quark mass that would give a pion mass value close to the experimental one. Of course, all of these are once again ambiguous so let me describe them step by step.
What Is Hadron Spectroscopy?
In QCD theory, almost every problem has to do with the calculation of various physical quantities using the path integral. However, those integrals are what perturbative techniques fail to calculate, which forces us to use the LQCD formalism. In our case, we will focus on a class of those problems called hadron spectroscopy. In these problems, we intend to find the mass of a hadron(e.g. a pion) based on specific input parameters.
The most essential input parameters are the gauge field configuration and the quark mass. To obtain a solution that converges with this of the path integral, we must carefully select a quark mass and conduct the same experiment with many different gauge configurations. Next, we can take the average of all outputs, which should lead to a relatively accurate solution depending on how well we chose those input parameters.
Path Towards Pion Mass Extraction
Since we discussed the necessary theoretical background, we can move forward with the procedure I followed to complete my project. Initially, I selected a quark mass equal to -0.01 and did simulations with many slightly different gauge configurations. Before you ask, yes the mass can be negative here! That happens due to an additive mass shift introduced during discretization, but it shouldn’t bother us. Also, I should clarify that all masses during the experiments are expressed in lattice units, which makes them dimensionless. That is yet another mathematical trick to make the calculations easier for the computer.
Upon completing those simulations, I extracted their pion correlators and calculated their average and standard error. A correlator is a function describing the relation of microscopic variables(e.g. spin) at different positions. The statistical observables we calculated are the key to the final step of this process. That step is to perform exponential fitting and acquire the pion mass. To achieve this, I utilized the tool called gnuplot. However, its pion mass error estimate was slightly inaccurate as it doesn’t consider the correlation among configurations. To remedy this, I replaced it by calculating the jackknife error, which is much more precise in such cases.
Figure 2: Graph with the line representing the relation of the pion mass square and quark mass. The horizontal blue line is at the desired pion mass square
Finally, despite the already decent estimate of the pion mass value, I tried to get closer to the experimental value. That amounts to 135 MeV and translates into a lattice mass of 0.129. To this end, I exploited the approximately linear relation between the square of the pion mass and the quark mass to find an ideal target quark mass. I started by repeating the simulation with another quark mass equal to -0.05. Then, I created the line defined by the pairs we found, plugged in the desired pion mass square, and ended up with a quark mass of -0.087.
Final Results
Figure 3: Table with results extracted from experi- ments with 4 different quark masses
Using such a low quark mass as -0.087 dramatically increased the complexity of the underlying linear systems, and the solvers in chromaform struggled to make the solutions converge. For this reason, the lowest mass I could try was equal to -0.07, though it still provided a satisfactory outcome as seen in Figure 3. If there was more time, I would probably search for the solution to this problem by modifying other aspects of the chromaform tool (e.g. the precision, max number of iterations of linear system solvers), but this can be an interesting future project!
A Journey That Came To An End
As all of us know, every good thing ends one day. Reaching the end of my journey with PRACE SoHPC, I reflect on the whole experience and consider it one of the most precious moments in my academic life. During the program, I learned many new things and broadened my horizons. It is an undeniable fact that PRACE SoHPC is a worthwhile experience, which I suggest to everyone.
P.S. Don’t forget to check the following video presentation we made with my colleague Christopher Kirwan about the project. Also, I strongly recommend checking out Christopher’s really interesting work about benchmarking with Kokkos API.
High Performance Quantum Fields Video Presentation
In my last update, I mentioned that I applied PCA for all datapoints and the results. But I have hidden an important detail for this post 😉 When we added the explained variance percentages as a result of PCA, we could only approach 40%. This means that we were only able to obtain 40% of the information about all datapoints.
Less Data, More Components
Then, in order to get more successful results, we decided to shrink the dataset we applied PCA and increase the number of components until the explained variance percentage increased. Thus, I started working on a dataset containing nearly 200 thousand datapoints containing information about 10 nodes. We chose these nodes according to their CPU usage rates. We tried to select the nodes with the most job submissions within the selected 6 days.
Afterwards, I increased the number of components one by one and made PCA analysis. As a result, we reached a 98% variance explained using 15 components.
Outlier Detection by PCA
Error rates for one of the nodes. The nodes has 16 thousand datapoints which are on x-axis, while y-axis shows the error rate.
After that, we used 15 components in all PCA analyzes we did. We decided to clean up the dataset a bit with the help of PCA analysis. For this, we used the reverse transforms of the principal components, which we obtained as a result of the PCA analysis. Then, using the distance between these values and we removed some of the outliers from the dataset. While doing this, we set the maximum error rate as 5, in this way, we kept 95% of the data in the dataset and eliminated 5%.
Count of error rates in certain ranges. Mostly datapoints have error between 0-1 or 1-2. The x-axis shows the error ranges, while y-axis shows count of datapoints that have error rate between that interval.
Didn’t We Already Find the Outliers?
But of course, the question here is why did we clean this dataset? Wasn’t our goal to find outliers anyway? Yes, our goal is to find an outlier. But our plan for doing this is as follows: Let’s train a machine learning method on timeseries data and that model can predict the features in the next step. We will use this clean dataset to train this model. Next, we will find the anomaly using the model’s predictions.
Next Step
We’ve come to the coolest part of the project! Selecting a machine learning model and training it with this dataset that we cleaned. If you’re waiting for GPUs to mess with this, see you in the next post!
LSTM, or long-short term memory, is an artificial neural network model that can process data sequences. This is especially important for time series prediction, as LSTMs can remember long term observations. But since LSTMs are powerful models, we have to be careful when using the model. The given input and hyperparameters are very important. Therefore, it is necessary not to forget the possibility of not getting proper results even though the problem is the time series.
What I want to do with LSTM?
We have a dataset of 10 nodes’ Prometheus information cleared of outliers. Using this dataset, we want to make predictions with the help of LSTM, and to find anomalies in the system with the predictions we made.
Some of the questions we will face when making predictions with LSTM will be: How much historical data will we feed and how far away do we want to predict data? Is the dataset we have enough, will it cause overfitting? What should the hidden layer size be? What should our batch size be?
Here we first decided to predict only cpu usage. Because cpu usage usually follows a trend. So if it’s high in the current step, it will likely be high in the next step. Therefore, we primarily focused on this feature only to get relatively good accuracy.
Look at the Results!
And when we started our experiments, we only used the data of one node (the node with the most job run). Since this node was in constant use, it also included unexpected spikes. So it was a node that was not easy to predict. Afterwards, we tried to obtain the next 50 predictions by feeding the last 100 observations. We fed the entire train dataset for each epoch and trained approximately 200 epochs of LSTM.
Of course, the first results we got were not quite as we wanted. But among the many parameters we can change, which one should we start with?
First of all, we stopped feeding the last 100 data, because it still showed all the information of the node in the last 50 minutes, we reduced it to 8. We also started estimating the next second CPU usage to reduce the likelihood of the model overfitting. We experimented with changing our hidden layer size from 2 to 64. We started using dataloaders. We got results by changing our batch size from 16 to 1024. Again, to prevent overfitting, we added dropouts, applied dropouts at different rates, and experimented.
In short, we did experiments using different hyperparameters hundreds of times and finally we reached some baseline results :))
The baseline results of LSTM on test dataset
Looking at the results, we see that the model is able to predict up and down trends in CPU usage. This is good news for us! So is there still room for improvement? I wonder if we can achieve this accuracy by using less features? Is it possible to get better test loss results by changing the fed sequence length or playing with hidden size? Wait for the next post for the answers to these questions!
Hello, and welcome back to my blog! Since I published my first blog post, where I introduced this project, I have been learning about genomics and how to analyze DNA on a High Performance Computing cluster. My work has revolved around JLOH, a toolkit for analyzing genetic data. In this blog post, I will go deeper into how this works and tell you more about JLOH.
A DNA strand is a double-stranded sequence of the molecules: Adenine (A), Thymine (T), Guanine (G), and Cytosine(C), which are paired in a double helix as AT and GC and are collectively called nucleotides. Just like the order of letters in a word determines its meaning, the order of these bases encodes instructions for the cell machinery to build proteins, and is what we call a “gene”. Genes make up a chromosome, and several chromosomes constitute a genome.
Working with DNA is complicated because DNA sequences tend to be long. For context, every single cell in the human body contains a complete copy of the human genome, whose haploid size is approximately 3 billion base pairs (Gbp). Humans are diploid, so the actual DNA content is double (6 Gbp). To handle this complexity, scientists have come up with different file formats for representing the correspondingly large genomic data so that it can be analyzed in an HPC environment.
File Formats
To analyze a genome, we must first establish its exact genome sequence through DNA sequencing. For this project, we are working with hybrid yeasts, sequenced using Illumina paired-end sequencing. This technology produces a pair of sequences for each DNA fragment called “reads”, which come as a pair of files in FASTQ format with each file containing millions of reads. FASTQ is a text-based format containing the information related to each sequencing read organized in four lines: 1) an identifier, 2) the actual sequence, 3) a separator “+” sign, and 4) an ASCII encoded quality score for each sequenced nucleotide.
FASTQ is ideal for representing short sequences and their quality scores. A leaner and simpler method to represent longer DNA sequences without any cumbersome quality score is the FASTA format. This format contains only the sequence name and the sequence itself. Just like FASTQ, FASTA is a text-based format that uses single letter codes: A, C, T, and G to represent the bases in a DNA sequence. Before high-throughput sequencing technologies such as Illumina, this was the only way to represent a sequence, as traditional sequencing methods e.g. Sanger sequencing could not produce the base quality. Now it is mostly used to represent reference sequences.
The first five lines of the FASTA for the Saccharomyces cerevisiae reference genome. The first line, starting with a greater than sign “>” contains a description of the sequence. The other lines contain the sequence.
The next step is establishing the location of the sequencing reads within the genome. To do this, the raw reads are aligned onto a reference sequence. This generates a tab-delimited text file called a Sequence Alignment Map (SAM). SAM files tend to be enormous so they occupy a lot of disk space and may take a long time to process. For easier analysis, SAM files are converted to a compressed binary format called BAM (Binary Alignment Map), which despite not being human-readable is smaller in size and easier to access using a computer.
The SAM format (Source: Samtools Documentation)
Often when studying the genetics of a population, most positions in the genomes of the individuals under study will contain the same base. Such fixed sites are not very informative and can be dropped from the analysis. Instead, we keep track of how many differences exist between the reads and the genome (i.e. “variants”), and where they occur within our reference genome. This is done using another tab-delimited text format known as the Variant Call Format (VCF), generated from mapping the reads to a reference sequence and identifying the variants.
An example VCF file viewed in a spreadsheet. VCF enables us to record variants (single nucleotide polymorphisms or SNPs, insertions, and deletions), and efficiently compress the data so that it can be indexed, accessed, and extracted quickly.
(Source: EMBL-EBI Website)
Now that the data is in text format, it can be parsed and manipulated using a scripting language such as Python.
Enter JLOH…
JLOH from the block
Developed at my host lab, JLOH identifies regions in a genome where stretches of the differences between two genomes have been lost during evolution through a process called “loss of heterozygosity”.
The sample we are using for testing is a hybrid of the yeast species Saccharomyces cerevisiae and Saccharomyces uvarum. When used with a hybrid, the JLOH algorithm needs three inputs:
Two reference sequences in FASTA format, containing the genomes of the parents of the hybrid species.
Two BAM files with the mapping records of the hybrid genome reads onto the two parental genomes
Two VCF files containing the SNPs or single base pair substitutions, representing the differences between the sequenced reads of the hybrid and the two parental genomes.
With these files, JLOH outputs blocks of the two parental genomes where some alleles have been lost, i.e. LOH blocks. JLOH outputs several files, the most important one being a TSV file containing the LOH blocks, aptly named jloh.LOH_blocks.tsv. These output files together with the input BAM files can be viewed in a genomic viewer and would look something like the image below and the results interpreted.
JLOH output viewed in a genomics viewer (source: JLOH GitHub)
That is it for this post. It is a long read so if you have read it all through, you deserve a pat on the back. I hope you now have a basic understanding of how a genome can be analyzed so stay tuned for the next post where I will tell you a little more about JLOH’s core functionality and what I have been doing to make sure it is fast, resource-efficient, and can scale on an HPC cluster. In case anything is unclear, please leave your question in the comment section below and we can discuss it.
Hello everyone. It’s been a while since my previous post since PRACE SoHPC kept me busy! Before introducing my project to you, I would like to explain some basic concepts to clarify in advance some ambiguous scientific stuff you are going to come across by following my posts. So grab your favourite snack or beverage and let me talk about some cool physics concepts!
Particle Physics Basics
All the existent particles are divided into elementary particles and composite particles. The former ones are not comprised of other particles while the latter ones are. All charged particles are considered to have an antiparticle that has the same properties but an opposite charge. In order to classify the elementary particles based on their properties and introduce the rules of their interaction, we resolve to a theory called the Standard Model. This theory divides them into fermions and bosons depending on the particle property called spin. Fermions have a half-integer spin and obey the Pauli exclusion principle. In contrast, bosons have integer spin and don’t obey the Pauli exclusion principle.
The bosons are what we call force mediators. This means, that every particle interaction can be viewed as an exchange of a boson. The only exception is the newly discovered Higgs boson whose field is associated with giving other particles their inertial mass. Other than that, the remaining bosons are the photon, the gluon, and the W & Z bosons which are the carriers of the electromagnetic, the strong, and the weak force respectively. Moreover, each of those forces has an associated charge which particles must have in order to participate in that interaction. Therefore, there is the electric charge for the electromagnetic force, the colour charge for the strong force, and the weak charge for the weak force. An interesting fact is that, unlike electrically neutral photons, gluons and W & Z bosons carry the charge corresponding to the force they mediate. As a result, they can interact with themselves.
The fermions are the particles responsible for making up the matter and are in turn separated into quarks and leptons. Their main difference is that quarks have colour charge while leptons do not. However, both of them come up in six flavors that can be categorised into three generations, where particles in different generations have similar properties but different masses. Specifically, the existent quarks are: down, up, strange, charm, bottom, and top. The existent leptons are: electron, muon, tau, electron neutrino, muon neutrino, and tau neutrino. All those particles except for the neutrinos have an electric charge.
Basichadroncategorisation
Since we got introduced to the elementary particles, we should also make a quick reference to the composite particles. Even though they consist of hadrons, atomic nuclei, atoms, and molecules, we will be mainly interested in hadrons since the rest are a combination of hadrons and elementary particles. Hadrons are made of quarks and are divided into baryons, which have an odd number of quarks(usually three quarks), and mesons, which are made of an even number of quarks(usually one quark and one antiquark) and are relatively unstable. Due to the spin resulting from their number of quarks, baryons are fermions while mesons are bosons. The most popular baryons are protons and neutrons whereas the most popular mesons are pions and kaons.
What is Lattice Quantum Chromodynamics?
Allneutralcolourchargecombinations
In the above section, we mentioned a new type of charge called colour charge and associated it with the strong force. Initially, we should clarify what exactly the strong force does. As we discussed, quarks are the main constituents of matter since they form the protons and neutrons, while gluons mediate the strong force just like photons mediate the electromagnetic force. However, one should wonder how quarks are bound together since their small mass could enable them to move constantly at speeds close to the speed of light. The answer to this question is that they are held together by the strong force, which is indeed the strongest of the four forces!
Now that we have understood the strong force, we need to shed some light on the colour charge. This charge has three colours red, green, blue, and three anticolours. A particle made up of all three colours or anticolours or one colour and its corresponding anticolour has a neutral colour and electric charge. The colour charge can also explain the existence of some composite particles such as the delta baryon which is constructed by the combination of three up or down quarks. Without the property of the colour charge, there would be two quarks with the same spin, which would violate the Pauli exclusion principle. Consequently, just like Quantum Electrodynamics(QED) is used to explain electromagnetic interactions, there is the theory of Quantum Chromodynamics(QCD) which describes the action of the strong force.
In physics problems, it is quite common to rely on what is referred to as perturbative solutions. This entails finding an approximate solution to a problem, by starting from the exact solution of a related, simpler problem. For instance, you can think of the Taylor series expansion of a challenging function. That would be a great way to solve QCD problems but there’s a catch. The highly nonlinear nature of the strong force and the large coupling constant at low energies makes finding perturbative solutions extremely complicated. For this reason, physicists tend to formulate QCD in discrete rather than in continuous spacetime by using a lattice. Lattice QCD(LQCD) is a non-perturbative approach that simplifies the daunting task of solving QCD problems.
What is my goal with the project?
To conclude our extensive discussion about the fascinating world of LQCD, it is about time that I describe my project’s objective. That is to efficiently use a tool called chromaform with the help of HPC resources and make several QCD simulations. The target of them is to extract successfully the mass of a pion and then find an input quark mass that would result in a pion mass close to the experimental one. Those experiments will give us a good idea of how much the input quark mass and some omitted quark-loop effects influence the resulting pion mass. In my next blog post, I will explain the whole process much more thoroughly and provide my results in its step.
Modern high performance computers have diverse and heterogeneous architectures. For applications to scale and perform well on these modern architectures, they must be re-designed with thread scalability and performance portability as a priority. Orchestrating data structures and execution patterns between these diverse architectures is a difficult problem.
The Kokkos programming model, broken into constituent parts. [Source: kokkos.org]
Kokkos is a C++ based programming model which provides methods that abstract away parallel execution and memory management. Therefore, the user interacts with backend shared-memory programming models (such as CUDA,OpenMP, C++ threads, etc.) in a unified manner. This minimizes the amount of architecture-specific implementation details a programmer must be aware of.
So, how is this done?
Kokkos defines 3 main objects to aid in this abstraction: Views, Memory Spaces and Execution Spaces.
Views: A templated C++ class that is a pointer to array data (plus some meta-data). The rank/dimension of the view is fixed at compile-time, but the size of each dimension can be set at compile-time or runtime. Each view also has it’s own layout, either column-major or row-major. This depends on which memory space the view stores it’s data on. (Eg: the host CPU, CUDA etc.).
The view dev allocated to the CUDA memory space. A shallow copy of the metadata is made from the default host space (CPU). [Source: Kokkos Lectures, module 2]
Execution Space: This is a homogeneous set of cores with an execution method, or in other words a “place to run code”. Examples include Serial, OpenMP, CUDA, HIP, etc. The execution patterns (such as Kokkos::parallel_for) are executed on an execution space, usually using the DefaultExecutionSpace, which is set at compile time passing the -DKOKKOS_ENABLE_XXX parameter to the compiler. However, one can change the space on which a pattern is executed as patterns are templated on execution spaces.
Memory Space: Each view stores it’s in a memory space, which is set at compile time. If no memory space is passed to the view, this is set to the default memory space of the default execution space.
This is only a very brief overview of the main concepts of Kokkos. A lecture series exploring these concepts and much, much more is available here.
What has been done so far?
The main aim of this project has been to explore the use case of Kokkos for Lattice Quantum Chromodynamics simulations, mainly developing staggered fermion kernels and comparing to handcrafted code for specific architectures. Working benchmarks of the staggered action and staggered Dirac operator have been developed. As Kokkos is based on shared memory, the current stage of development has been to extend these kernels to work on multiple nodes and GPUs using MPI.
Teleportation might seem a fancy expression and you might think it’s only possible in fictional world. However, it’s the reality in quantum communication. No quantum information is actually being transferred between sender and receiver during the process except the must needed classical data which is sent through a classical channel.
What is quantum teleportation?
If you already read my first blog, I promised to come back again with my first month’s experience of the HPC project. By this time, I have learned two important algorithms and one protocol as an initial training. It’s quite hard to talk about all of them in this short context, so I chose to give you a quick go through about the quantum teleportation protocol.
Quantum teleportation is basically a protocol that facilitates the transfer of quantum information i.e. an unknown quantum state,
from sender to receiver. Say, the sender is named Alice and the receiver is named Bob, then the information transfer between them requires two main components:
A source to produce an entangled qubit pair or EPR pair.
A classical communication channel to transfer classical bits.
In case you are unfamiliar with the concept of the entangled state, it’s a quantum state such as,
Clearly, this state has a 50% chance of being measured in the state |00⟩ and a 50% chance of being measured in the state |11⟩. The most important implication of such a state is that measuring one qubit will tell us the state of the other as the superposition collapses immediately after measurement. For example, if we measure the top qubit and get the state |1⟩, the overall state of the pair would be:
Why such implication is significant for quantum teleportation is the fact that, even with a separation of light-years away, measuring one qubit of the entangled pair appears to have an immediate effect on the other. Now let us focus on the protocol itself. A schematic is presented below.
An animation showing quantum teleportation [Credit: Source].
So you can see, both Alice(Source) and Bob(Destination) receives an entangled qubit pair (AB). Alice then performs some operations(Bell state measurement) on her end and sends the results to Bob over a classical channel. Upon receiving the classical information, Bob then performs some more operations on his end and finally has the state D. Let’s now look more closely what’s happening in each box.
What’s inside the black boxes?
We will now build the quantum circuit for the protocol in step-by-step. The fundamental components for any quantum circuit are quantum gates. Our protocol needs four single-qubit gates, which are, Hadamard gate, CNOT gate, X gate and Z gate. I would suggest you to look here to know more about quantum gates.
Quantum teleportation circuit using qiskit [Credit: Source].
Let’s go through the following steps and see how the circuit is built.
Step 1: The first step is to create an EPR pair of the qubits of Alice and Bob. It is done before the first dotted barrier by applying a Hadamard gate followed by a CNOT gate.
Step 2: In this step, Alice applies two gates (CNOT followed by a Hadamard gate) on both the state |φ⟩ and her own qubit. For the application of CNOT, |φ⟩ acts as control qubit and Alice’s qubit acts as target qubit.
Step 3: Now Alice applies measurement on both the qubit to be transmitted and her own qubit. She then stores the outcome in two classical bits and send them to Bob through classical channel.
Step 4: Depending on the classical bits Bob receives from Alice, he now applies following operations on his qubits:
And finally!! The state Bob receives is exactly the one Alice had. And that’s how the protocol enables a piece of quantum information to be sent to a distant receiver. If you are further interested about the mathematical proof whether the state Bob receives is the same state |φ⟩ or not, you can give this a read.
I hope you learnt something interesting. Please do let me know in comments if you have any questions. I am yet to talk about my findings for this project. So, check my upcoming blogs if you want to learn more exciting stuffs about quantum algorithms.
N body simulations are fundamental in many areas of research. For example, if we want to predict how galaxies evolve or how molecules interact with each other you have to compute the Gravitational or Coulomb interactions.
Computational complexity
The first problem with these N-body simulations is that the naive approach has O(n2) complexity as it considers every pair of particles. To solve this we can use the Fast Multipole Method, bringing the complexity down to O(n). The algorithm exploits the fact that these interactions decrease rapidly with the distance among particles.
Another important feature is that its keeps the error bounded, and given the peculiarities of floating-point arithmetic it can even get more accurate results than the naive approach in some specific situations.
Exploiting supercomputer’s hardware
Once we have a good FMM implementation the next step is to parallelize it using constructs such as threads, tasks to be executed independently or even accelerators such as GPUs. This works great and has given good results concerning efficiency and speedup, but it has a big problem: you can’t scale! Scaling is important because we want to simulate increasingly complex scenarios, so we have to figure something out.
We can solve this issue with MPI, but doing so requires very hard work because we have to explicitly distribute the workload between nodes while keeping the number of communications to the minimum.
It is definitely a challenge but we’ll do our best! I hope you enjoyed the post and thank you for reading.
It goes without saying that the physics of an earthquake is complicated. This, in combination with the complex structure of the earth’s crust, makes the effects of earthquakes difficult to predict. Luckily, high performance computing and numerical models are the perfect tools to simulate complex earthquake scenarios!
In my first blog post I briefly mentioned that I would work on this project – an HPC wave propagation solver called SEM3D. This software is developed at my host institution and efficiently solves the 3D wave-propagation problem using the Spectral Element Method (a type of Finite Element Method) together with a time-stepping algorithm.
So far so good, but how does this actually work? In this blog post, I will try to summarize my first few weeks of learning, and hopefully give a good high-level explanation of how earthquakes can be simulated in an HPC environment!
Note: The Finite Element Method (FEM) is a popular method for solving differential equations numerically by splitting a large domain of interest (such as the earth’s crust) into smaller, simpler elements (for example cubes). The solution to the differential equation is then approximated by simpler equations over the finite elements.In this blog post, I will assume that the reader has some basic knowledge of FEM and numerical methods for solving (partial) differential equations.
Semi-Discretization of the Wave Equation
The wave equation is a time-dependent second-order partial differential equation, meaning that in addition to the spatial variables, one must also take time into account. Anyone that has ever written FEM code is fully aware of the increased implementational efforts required when going from 1D to 2D, or worse, 2D to 3D.
Since we’re already working with a 3D domain (the earth’s crust), we will not treat time as the fourth dimension in our FEM code, but instead, opt for semi-discretization and a time-stepping scheme. This essentially means that we formulate the spatial dimensions of the equation as a FEM-problem, but express time propagation as an ordinary differential equation which can be solved with a simpler numerical time-stepping scheme.
Spectral Element Method
The Spectral Element Method (SEM) is very popular for simulating seismic wave propagation. Unlike the simpler finite difference method, it can handle the complexity of 3D earth models, while (for this problem) offering better spatial discretization, lower calculation costs, and a more straightforward parallel implementation than normal FEM.
SEM and FEM are very similar in many aspects – both, for instance, uses conforming meshes to map the physical grid to a reference element using nodes on the elements and basis functions. These are some of the key features that differ SEM from FEM:
SEM typically uses higher-degree polynomials for the vector fields, but lower-degree polynomials for the geometry (whereas FEM often uses the same lower-degree polynomials for both)
The interpolation functions are Legendre polynomials defined in the Gauss-Lobatto-Legendre (GLL) points (see figure).
SEM typically uses hexahedral elements in combination with the tensor product of 1D basis functions
For numerical integration, SEM uses the GLL rule instead of Gaussian quadrature. The GLL rule is not exact, but accurate enough.
Left: Lagrange interpolants in the Gauss-Lobatto-Legendre points on the reference segment [-1, 1]. Right: The (n+1)2 Gauss-Lobatto-Legendre points on a 2D face for n=4.
Although it may sound strange to limit the implementation to hexahedra, GLL points, and inexact integration, the combination of the three leads to an enormous benefit: a fully diagonal mass matrix. I really want to emphasize that this simplifies the algorithm and complexity drastically.
Time Stepping
After obtaining the mass and stiffness matrices M and K from the semi-discretization and SEM-formulation, we’re left with a second-order ordinary differential equation Mu”+Ku=F which allows us to propagate over time.
There exist several different time-stepping schemes, most of which are relatively straightforward. SEM3D uses a second-order accurate velocity Newmark scheme in which the velocity (u’) at the next time step is a function of the known velocity and displacement at the previous time step.
Stability
As in all numerical time-stepping schemes, stability (which depends on the time step, dt) is crucial. Since a large time step means less number of iterations and fewer calculations, we want to use the largest possible time step without losing stability. This is where I come in!
The classical CFL condition can be estimated for homogenous materials and structured meshes. When adapted to heterogeneous media, however, the method often becomes unstable. To avoid instabilities, SEM3D uses a safety factor which results in an extremely small time step, making propagation much slower than necessary.
The better alternative, and my job to implement, is an advanced stability condition based on the largest eigenvalue of the matrix M-1K. This advanced stability condition will allow us to calculate the maximum allowed time step without the safety factor, which means that the overall performance of the wave propagation solver will increase significantly!
That’s it for now – hopefully, you have a rough understanding of how seismic wave propagation can be simulated! In my next blog post, I will focus more on my own work and dive deeper into the implementation aspects of the advanced stability condition in Fortran 90 and an HPC environment.
Due to recent changes in travel restrictions and programme funding PRACE is planning to change fully virtual Summer of HPC 2022 into hybrid mode, where some PRACE hosting sites will provide on-site placement and mentoring face-to-face (F2F) for around 15 project, while 7 remain online.
Travel, accommodation and a stipend will be covered for one student per project as in times before Covid-19 restrictions. For PRACE hosting sites that will remain in virtual (online) mentoring of two students per project a financial support will be given to both students. Students already applied will need to consider this change of the program mode and modify the application if necessary. For his reason applications are extended for two weeks. See updated timeline for details.
Although for some sites ether mode (F2F or online) is possible, PRACE is preferring F2F and final decision will be agreed by firstly offering F2F and if not preferred two students selected will be offered online participation.
2201 and 2202 at BSC, Barcelona, Spain, F2F either way but certain preference for online
2203 at CC SAS, Bratislava, Slovakia, Online
2204 and 2205 at CINECA, Trieste and Bologna, Italy, either
2206-2211 at IDRIS and partners, France, Project 2207 online. The rest can be either way. Slight preference for the online version regarding project 2211
2212 and 2213 at IT4I, Ostrava, Czechia, online
2214 and 2215 at FZJ, Juelich, Germany online only
2216 at STFC, Warrington, UK, online
2217 at SURF, Amsterdam, Nederlands, F2F
2218 at University of Ljubljana, Slovenia, F2F
2219, 2220, and 2221 at University of Luxembourg, either way
2222 at VSC, Vienna, Austria, either (F2F on-site or online)
List of the projects 2201 Leveraging HPC to test quality and scalability of a genetic analysis tool 2202 Fusion reactor materials: Computational modelling of atomic-scale damage in irradiated metal 2203 Neural networks in chemistry – search for potential drugs for COVID-19 2204 Automated Extraction of Satellite BAthymetric data by Artificial Intelligence strategies 2205 A dashboard for on-line assessment of jobs execution efficiency 2206 Designing a Julia Parallel code for adaptive numerical simulation of a transport problem 2207 Assessment of the parallel performances of permaFoam up to the tens of thousands of cores and new architectures 2208 Optimization of neural networks to predict results of mechanical models 2209 Implementation of an advanced STAbility condition of explicit high-order Spectral Element Method for Elastoacoustics in Heterogeneous media 2210 Turbulence Simulations with Accelerators 2211 High Performance Data Analysis: global simulations of the interaction between the solar wind and a planetary magnetosphere 2212 Fundamentals of quantum algorithms and their implementation 2213 Heat transport in novel nuclear fuels 2214 High Performance Quantum Fields 2215 Chitchat, Gossip & Chatter,- How to efficiently deal with communication 2216 Scaling HMC on large multi-CPU and/or multi-GPGPUs architectures 2217 High Performance System Analytics 2218 Parallel big data analysis within R for better electricity consumption prediction 2219 Computational Fluid Dynamics 2220 Performance Comparison and Regression for XDEM Multi-Physics Application 2221 Designing Scientific Applications on GPUs 2222 HPC-Derived Affinity Enhancement of Antiviral Drugs
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe to late-stage undergraduate and master’s students. Up to 44 top applicants from across Europe will be selected to participate in pairs on 22 projects supported and mentored online and on-site at 11 PRACE hosting sites. Participants will spend two months working on projects related to PRACE technical or industrial work and produce a report and video of their results. See recent update on “Hybrid 2022” mode.
Up to 44 top applicants from across Europe will be selected to participate in pairs on 22 projects supported and mentored online from 11 PRACE hosting sites.
Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. PRACE will be financially supporting the selected participants during the programme that will run from 4th July to 31th August 2022.
Late-stage undergraduate and master’s students are invited to apply for the PRACE Summer of HPC 2022 programme, to be held in July & August 2021. Consisting of a training week and two months with onlone participation at top HPC centres around Europe, the programme offers participants the opportunity to share their experience and learn more about PRACE and HPC.
Applications are open until 12 26 April 2022. Applications are welcome from all disciplines. Previous experience in HPC is not required as training will be provided. Some coding knowledge is a prerequisite, but the most important attribute is a desire to learn, and share experiences with HPC. A visual flair and an interest in blogging, video blogging or social media are desirable.
The programme will run from 4 July to 31 August 2022. It will begin with a kick-off online training week organised by University of Ljubljana and PRACE Summer of HPC hosting sites – to be attended by all participants.
Applications will be open in February 2022. See the Timeline for more details.
PRACE Summer of HPC programme is announcing projects for 2022 for preview and comments by students. Please send questions to coordinators directly by the mid February. Clarifications will be posted near the projects in question or in FAQ.
About the Summer of HPC programme:
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 40 top applicants from across Europe will be selected to participate in pairs on 24 projects supported and mentored online from 11 PRACE hosting sites. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. Programme will run virtually from 29th June to 31th August 2022.
For more information, check out our About page and the FAQ!
Ready to apply? (Note, not available until mid February, 2022)
As you can tell from the title, this will be my final blog post submission. I will be giving more details about the MCPU, as promised on my previous blog post which you can read here if you haven’t.
So let’s get started!
The Memory Tile
Illustrated above is the basic structure of the memory tile I worked on simulating using Coyote. The memory tile houses the MCPU (Memory Central Processing Unit), which can loosely be described as the ‘intelligence’ of the memory tile, responsible for organizing resources that are needed to perform the different memory operations.These resources are obtained from the microengine, the vector address generator (VAG) and within the MCPU itself. The microengine is responsible for generating transactions for the instructions, whereas the vector address generator generates the memory requests. Another impressive feature of this memory tile is that it allows the re-usability of some already implemented functionalities. For example, a scalar load operation is handled like a unit stride vector load with a loop iteration of 1.
My primary objective was to understand how instructions, commands and data packets are to be received into the memory tile. Once an overall understanding of the architecture was established, our goal was to simulate the different load and store operations (shown below) and analyse their output and performance.
Below is the video presentation my partner Aneta and I submitted for our project presentation. In it , we have discussed how the memory operations and scheduling processes are implemented in Coyote:
Before I leave …
The past two months have been the most exciting for me this year. I have had the opportunity to learn a lot from the internship and my able mentors. I have also established an interest in high-performance computing and I look forward to exploring this interest further in the near future. To you who is reading, I hope you enjoyed reading my blogs as much as I enjoyed writing them, and I also hope you learnt something new. I would like to thank my mentors at BSC, my partner Aneta, PRACE and you ,the readers for the guidance and support 🙂
Here we are again for the third time! Last time! What an amazing summer was! Now it is September, new projects will start as soon as possible and new adventures are waiting for me. But before going on I would like to talk in my last blog post about something useful I learned and why you should apply for this program.
Computational Fluid Dynamics Tools
In general, a fluid dynamics computation is made up of three different steps: pre-processing, simulations, post-processing.
Pre-processing: in this part, we performed our mesh, composed of up to 46 million fo tetrahedral elements. To create the mesh, we used ANSA. Unfortunately, this is not an open-source tool, but it is available in the student version. My mate, Benet Eiximeno Franch, who has been a student during this program, provides to the team the meshes for the three different geometries.
Simulations: this is the crucial part of the work. We selected to use OpenFOAM, which is an open-source CFD solver widely used for research activities and industry projects. We used the simpleFoam and pimpleFoam algorithm in order to evaluate the solution for the steady-state and the transient simulations, respectively. We selected to use, according to our Supervisor Ezhilmathi Krishnasamy, the RANS model based on the two equations k-Omega SST turbulence models. We also tested other turbulence models and, in the final paper, we had compared the results. What we noticed is the according to in the wake region of the k-Epsilon and the k-Omega SST. Instead, the k-Omega differed a lot from the other as a consequence that it is used to predict well the flow near the wall, while the wake region is a free zone far from the wall, indeed the k-Epsilon compute well the solution.
Post-processing: like in all the HPC simulations, we used Paraview, which is a free open-source tool. With it, you can obtain some nice images. But pay attention! CFD is Computational Fluid Dynamics, not Color Fluid Dynamics! Each color must have a physical meaning.
Q-criterion
This is one of the typical visualization used to figure and draw the vortices. To understand the main idea I will explain how we computed it. Starting from the definition of the gradient of the velocity field, we can split it into two tensors: one symmetric and one antisymmetric. The first one is associated with the strain rate, while the second one is linked with the rotational capabilities of the fluid element that we are considering
Splitting the velocity gradient.
From these definitions, we can build the scalar Q value as shown
Scalar Q-value
where “tr” stands for the sum of all the diagonal elements of the matrix. In this way, it is easy to understand what it shows when it is positive or negative. In particular, we have
Q<0: areas of higher strain rate than vorticity in the flow
Q>0: areas of higher vorticity than strain rate in the flow
The figure below shows what is explained for Q=10 for all the three configurations: Fast-, Estate-, and Notch- -Back.
Steady-states Q-criterion for the three back geometries.
Overview of turbulence modeling
I already discussed it in my previous blog post. You can check it here.
Blunt body incompressible aerodynamics
Cars are blunt bodies. For that reason, when the Reynolds number is very high and we are considering an incompressible turbulent regime, a particular phenomenon appears. This is called the crisis of resistance, in which suddenly the CD decreases. This is also the reason why the golf balls have small pits in order to accelerate the transition of the flow from laminar to turbulent making appeared the previously mentioned effect and reducing the drag. In addition, for that reason, the CD remains almost constant with the variation of the Reynolds number. We performed several simulations for all the configurations for three different Reynolds numbers. In the picture, you can see these effects. However, if you are interested, you can ask us for the final report paper and we will send you all the details.
CD-Reynolds graph. Comparison of F, E, and N.
Rear Mirror streamlines
We analyze also the rear mirror effects on the three different geometries. You can follow the video to understand better. Anyway, I would like to give you a short introduction with the following image.
Rear mirror streamlines.
Final results
Our final results are published in our final report that you can find on the main page of the PRACE Summer of HPC programme. Here, I would like to share with you our recorded final presentation that we made on the 31st of August 2021.
Final presentation.
Why should you join SoHPC?
Generally speaking, SoHPC is an enjoyable program. In this experience, you will learn in the first week the main concepts of High-Performance Computing with lectures hosted by one of the most important HPC research institutes around Europe. In our case, we have the pleasure to be welcomed by the ICHEC, in Ireland. We learned many important things in such a short time, as different techniques to program, python, key access.
Then, you will be divided into different teams, one for each project. You will have the pleasure of working and meeting other guys from all over Europe with passions similar to yours: HPC, coding, and, in my case, fluid dynamics. But the HPC projects that SoHPC offers are regarded in many scientific areas of interest: fluid dynamics, big data, machine learning, FEA, Earth observation, parallelization of codes, etc.
In the project you will work on, you will have the possibility to have the access to one of the HPC clusters in Europe. You can work on what you love and learn much surprising stuff. Apply and let me know!
You can contact me on LinkedIn: Paolo Scuderi. I am looking forward to talking with you!
After two months of hard work, the time has come to say goodbye to this project. I have had the opportunity to work on the visualisation of one of the most powerful supercomputers in Europe and I could not be more grateful for that.
Since I started my Multimedia Engineering degree I have been fascinated by the different ways to generate graphics and how useful it can be, from creating a web page, a mobile application, to a C program. This time I have used Python and the VTK package to generate these graphics and I am very happy with the result as it was a technology I had not tried before and I have achieved the objectives. Here is the final presentation video so you can take a look at it:
Finally, since it has not been possible to do so for the last two months, I am going to visit CINECA and I will be able to see in person the supercomputer I have been working with. I think there can be no better ending than this farewell trip.
So, before beginning, I want to remind you that my project involves the convergence of HPC and Big data, and more details can be found here. The project’s main aim is to explore the benefits of each system (i.e., Big data and HPC) in such a manner that both will benefit from each other. So, several frameworks are developed to meet this, and the most popular two of them are Hadoop and Spark, and for my use case, I have worked with Spark.
Case study (IMERG precipitation datasets)
At the same, I have a case study from my field of interest, i.e., precipitation. The use case will deal with analyzing the Integrated Multi-Satellite Retrievals for GPM (IMERG), a satellite precipitation product from the Global Precipitation Measurement (GPM) mission by the National Aeronautics and Space Administration (NASA) and Japan Aerospace Exploration Agency (JAXA). The dataset comes with multiple temporal resolutions such as 0.5 hourly, daily, and monthly. Here, I have used the daily dataset from 2000 – 2020, which is approximately 230 GB (each daily data is around 30 MB). The main reason why I’m using this dataset is that I have faced several problems with working these large amounts of data, especially in R.
Although HPC is often equipped with enough memory to load and analyse the datasets, using R with HPC for big data analysis is not often the first choice. But, as I have my script in R and I’m very convenient with R, I just want to stick with R to analyse the datasets. Here, where Spark will be useful. Although Spark initially started with Scala, Java, and Python, it later introduced the R interface. So, to work with Spark using R, we have two main libraries. One is SparkR (developed by the spark core team), and another is SparklyR (developed by the R user community). Both libraries share some similar functions and will converge into one soon. So, we are using both libraries.
The bottleneck and result
The main bottleneck of dealing with Netcdf datasets is that the data is stored in multiple dimensions. Therefore, before starting any analysis, the datasets should be stored in a more user-friendly format such as an R data-frame or CSV (I prefer an .RDS format). The R script to convert Netcdf to an R data-frame was slow, which takes approximately 235 seconds for 30 files, each with 30MB on the Little Big Data Cluster (LBD), which has 18 nodes each with 48 cores. However, as the R script was designed for a single machine, it does not really use the benefits of multi-core clusters.
Therefore, the same R script was applied through the spark.lapply() function, which parallelizes and distribute the computations among the nodes and their cores. Initially, we started with a small sample (30 files), and the bench-mark (run for 5 times) results were shown in Table.1. Comparatively, SparkR has significantly faster than R. For instance, on average, R takes 235 seconds to complete the process (i.e., write an .RDS). In contrast, for SparkR (i.e., write into parquet format), it was just 5.58 seconds, which is approximately 42 times as much faster than R.
Method
Minimum
Mean
Maximum
R
231.17
235.34
239.57
SparkR
5.39
5.58
5.7
Table 1:Benchmarking R versus SparkR in reading and extracting one month of IMERG daily datasets (units are in seconds)
So, that’s it for blog#2!
Hope you enjoyed reading it and if you have any question, please comment I will happy to answer.
As of yesterday we presented our final results and submitted the report on the project we have been working in during this summer. It was a very fun day as we also got to see what other groups have done, and it feels nice to be able to relax a bit before the studies start again in the autumn. Let’s talk abut how the final weeks went and pick up the loose ends from the last blog post!
This project has been a lesson in how to readjust your project in line with new results, and from the last blog post we changed the goalpost quite a bit. As it turned out we could not extrapolate much between different parameter values for different matrices as the correlations simply were not present in the small data set we analysed. Because of this we turned our focus to instead finding optimal parameters for single matrices.
What we then found was three problems with the Bayesian sampling technique which we had chosen:
The sampling of the real results where sometimes uneven in workload which gave raise to outliers with inaccurate reward values and a skewed model as a result.
Because the algorithm always sampled the theoretical maxima of the number of points it tended to get stuck early in a local maxima, pointing toward a need of
The function surface of the statistical model was sometimes poorly fitted for our needs as fitting one good reward value was less prioritized than fitting many clustered samples.
As the project had already undergone a couple of changes when this was discovered and since gathering data was such a long process, we could not look further into these issues more than pointing them out to give further research some thoughts to start with. While a more thoroughly positive result would have been a fun way to end the summer, we are still happy with what we managed to do given the time frame. For a more detailed summary of the final algorithms and its issues, please see the video linked above with our final presentation.
If you have followed the soHPC blogs during this summer, I hope you have had an as intersting journey as we have had and that you have learned a lot!
After this two month being part of the PRACE Summer of HPC, contributing to the MPAS (Model Prediction Across Scales) together with Jonas Eschenfelder, I not only learned how to run MPAS atmospheres, on the national Supercomputer: ARCHER2, how to investigate performance of simulation runs, depending on the cores and set-ups like simulation time and physical parameters, and how to process the output to visualize them with python, but also experiencing how well this remote collaboration across Europe with our project mentors Evgenij Belikov and Mario Antonioletti from the EPCC in Edinburgh and with my project partner Jonas Eschenfelder from the Imperial College London, worked out.
At the beginning, the excitement about having the possibility to work on ARCHER2, and to run a new atmosphere model on this Supercomputer, where over the top – also when there came some installation and set-up problems up, I was always optimistic overcome those, because the hope, that after break down this barriers the results will be worth the persistence. And it has proven true – we were not just able to run performance tests, but also write an visualisation script in python. The latter process was my favourite part of our project, and thereforeI will shortly describe the steps to get this visualisation script.
As an output from the MPAS simulation runs, we get this compressed netCDF files, I plottet the containing physical parameters directly to a sphere (see also my second post blog for that). But in this script it was necessary to differ between the different locations (cell, vertex, edge). But there is the possibility to transform this three different locations to longitude and latitude, and then you have a consistent connection between the output parameters and the locations. Furthermore it is easier using common python maps, like Cartopy, to plot them in the Background for a better orientation. Not only for global simulations, but especially for regional simulations, see figures below.
Temperature plot over 8 hours as a gif. The higher temperature drop in the alpine and pyranaese region can be seen as a validation that the grid is correctly centered and also indicates that the simulated data are realistic.}Meridional Wind plot over 8 hours as a gif. The dark spot in the middle of the figure, is a well know wind pattern, the so-called Mistral Wind. That this is visible in our output data, again reflexes the validity of our MPAS run.
During PRACE Summer of HPC, I not only learned a lot of new skills, it was also enriching on a personal level and made me more self-confident – because experiencing the success when you work on things, which are for you completely new, but you stay tuned to, I am sure, will also encourage me for future projects.
We have come to an end now and we all gained a lot for both hard and soft skills. We’ve done so many things and I will try to cover all together in this post.
At first, as I said in my last post, I want to tell about Precision Based Differential Checkpointing (PBDCP) technique which is the method I’ve implemented through the project.
PBDCP is implemented for float and double type data sets. Floating point numbers map decimal numbers to a unique bit representation and they have 3 parts: sign bit, mantissa and exponent. The idea of Precision Based Differential Checkpointing is to cap the last bits of mantissa that are the least significant bits according to the given precision value. As an example in Figure below , if the precision value given by user is sixteen, it truncates the last seven bits of mantissa, (i.e. makes them 0).
Example of pbdcp operation for ieee754 floating point representation
Therefore, PBDCP allows us to take benefits from dcp share which shows the difference between checkpoints, those would be stored, for such small changes.The advantage is transferring less data and preventing from creating continuous traffic for such minor changes.
Usage
To use this mechanism, there are some basic additions in the configuration file. Firstly, enable_pbdcp value should be set to 1 and pbdcp_precision value should be entered. Then, in the application when calling FTI_Checkpoint function, user should specify the level for pbdcp which is FTI_L4_PBDCP.
Experiment Results
As expected, the smaller the precision the most benefit we can get from dcp share since the value remains unchanged for a certain interval but also the larger the rmse value which is route means square error for uncapped and capped values and it can also be seen in charts left Precision against DCP share and RMSE on top on the page in which: dcp blocksize=1024, Iteration= 200, Checkpoint interval= 5 (which means there are 40 checkpoints) values are used.
And for comparing pure dcp and precision based dcp, again as expected dcp share is lower in pbdcp and an example result of an execution with values as: Block size= 1024, Precision= 4, Iteration= 200, Ckpt Interval= 5 is shown in graph below.
So this is the work I studied on during this internship. My project mate Thanos and I prepared a video presentation for SoHPC which you can watch here…
Ending
I had a great two months. I was a bit nervous in the beginning but it lasted only till I met with mentors, coordinators, BSC group, other students so on and so forth. They are all such helpful, warmhearted and sincere that no need to hesitate at all. I am greatful to all for giving us the chance to be part of this awesome adventure together.
Today I want to tell you what I was working on the last few weeks. As I have mentioned in my previous blog post (https://summerofhpc.prace-ri.eu/how-to-distinguish-different-surfaces-of-molecules/), one goal of this year’s project is to distinguish different disconnected surfaces automatically. It can happen that the computation yields several very small surface components which are physically irrelevant. For example, in the image above we see one large surface (grey), and several tiny surface components inside of the large surface. The goal is to sort out these small surfaces automatically. For doing that we need a criterium when a surface should be sorted out and when not. The approach that we used in our implementation is the computation of the volume of each single surface component. If the volume falls off a certain threshold value, the corresponding surface can be sorted out.
In this blog post, I want to describe how we computed the volume of a molecule. The divergence theorem, also known as the Gauss-Ostrogradsky theorem (https://mathworld.wolfram.com/DivergenceTheorem.html), gives us a formula for doing this.
Let S be the surface of a molecule. Then, the theorem leads to the following formula:
Volume of the molecule =
Here, for each point (x,y,z) on S, \vec{N} is the unit normal pointing outwards of the surface S at (x,y,z). To use this formula for our computations, we approximate the integral by a finite sum:
Volume of the molecule
Our program computes a triangulation that approximates the surface of a molecule. Each is a triangle vertex of this triangulation, is the average normal of all the triangles containing , and is one-third of the sum of the areas of the triangles containing (see the image below).
The average normal (yellow) is the average of the triangle normals (pink) of those triangles that contain the vertex
In our program, for each disconnected surface, we compute its volume. As a threshold value, we chose the volume of a molecule, which is approximately 14 Ångström^3. We tested the computation of the volume for several molecules. For example, the grey outer surface of the molecule in the title image has a volume of 4642 and the volumes of the small surfaces inside lie within the range of 0 and 0.7. So, with our computation, we can sort out these small surfaces automatically.
This was a rough overview of the computation of the volumes. If you are interested in more details, feel free to ask your questions in the comments below.
So it actually happened. I’m sitting here, on a rainy day wearing a sweater and thinking summer is over. The summer of HPC is over at least, if the sun ever comes out again is questionable as well (thank you to London weather). But what have we been up to in the last few weeks of the project?
Performance of MPAS atmosphere for a regional cut. Notable is the superlinear speedup and efficiency above 60% even after the run time plateaus.
We ran our final experimental set up of MPAS. This time instead of running full global simulations we only used a regional cut, this is a fairly common method in atmospheric modelling and allows for fast running of very detailed set ups. We chose a cut above the western Mediterranean, with varying cell sizes between 60 and 3km, running between 1 and 8 hours. The results of this showed were very promising for MPAS. we saw both good strong (see the figure to the right) and weak scaling behaviour, which allowed us to simulate 8 hours of weather in under 20 minutes! We go into more detail in our final presentation that you can watch here.
Animation of changing wind speeds for a 8 hour simulation starting at 6pm on January 1st 2010. Note: negative wind speed meaning southward direction
This run was also our chance to test out the new visualisation script we developed in our project. MPAS used to rely on NCL, a plotting language created by NCAR who also developed MPAS, for any visualisation. They recently announced however a stop of development on NCL to move all plotting towards Python. Since a model is of limited use without the possibility of visualising the outputs in a good way that humans understand, we decided to write visualisation scripts that can be used for MPAS. This worked very well and allowed us to create beautiful animations of the regional experiment. My favourite one is shown here, this is the wind speeds of an 8 hour run. The main feature is a persistently strong wind in the Gulf of Lion. We believe this is the Mistral, a persistent wind pattern bringing cold strong winds to the South of France. Moreover, when looking at where the Alps should be, the wind almost seems blocked by them. As an Earth Scientist, I really enjoy staring at this, finding patterns and trying to explain what they show. It also shows very well just how powerful these kind of weather models are.
Overall, I really enjoyed my time working on this project. I have never imagined myself working on a supercomputer, especially not on one of the most powerful machines in Europe. Wanting to go into research in my later career, this helped me discover a whole new side of research and I’m very thankful for getting this opportunity. Having had only very little knowledge about computing, it was at times quite challenging but I learned so much about programming. Also, whenever I truly didn’t know how to solve any issues, our amazing mentors Evgenij Belikov and Mario Antonioletti helped guide us through it all. Even though we sadly weren’t able to work on this project in Edinburgh, but had to work remotely it was a lot of fun and a great show of how important cooperation across Europe is for research. So this is it from me for the Summer of HPC, thank you to all of you who have read through my articles and came with me on this journey!
Welcome to my third blog post, where I will be talking about my expirience with Summer of High Performance Computing (SoHPC). Once again my name is Ioannis Savvidis and if you would like to get to know me a little better and see what we have done up until now go check my first post which is an introductory blog of who Ioannis is, and my second post where I write a bit more about the project i work.
So in my last post I explained how we managed to find a working implementation of sparse BLAS library, and we successfully implemented it to our existing code. After some testing to a methotrexate molecule, the results that came back were not looking good. The new code with the sparse BLAS implementation was slower than the old code.
In order for our code to shine we knew that we have to go for bigger molecules where the matrices are even more sparse. That’s how started testing to 5 different alkanes molecules by running the old and new code and checking if the run time is getting better or even if we are having faster run times. The alkenes that we used as i said in my second post were C6H14, C12H26, C18H38, C24H50, C30H62, C36H74. A schematic together with methotrexate can be seen on the left.
Normalized run times of the new code for different alkanes on different number of cores
At first a single node of a supercomputer was used exclusively to do our testings. Each node of the supercomputer has 4 CPUS @ 8 cores. So we tested the timings of the codes for different number of cores, specifically 1, 2, 4, 8, 16, 32. From our testing alkane molecules, almost all of them had faster run times at either 4 or 8 core with the majority of them at 8 cores, while the slowest timing were coming in the 32 cores runs.
From all the data we collected, we were able to determine that for C36 why had an average of 4.2% improvement of runtime on our new code.
After that point and with more nodes to our disposal, we tried an implementation of parallelism by testing how the run times scale with 1 to 4 nodes that are set at 4 and 8 cores. Unfortunately, the timings weren’t consistent and further troubleshooting is needed to be done. Finally, on our last week on the project we increased the number of basis set function for C30 and C36 molecules and tested the new code. Again at the end we run to some numerical errors and we don’t have the time to run more tests.
That concluded our journey with SoHPC. I want to say that I’m really happy with the outcome of our project and i feel really greatfull for the opportunity of participating on this year’s programme. I learned a lot of new things under the guidance of our mentor Dr. Jan Simunek and help from my colleague Eduard Zurka. Also, big shout out to Dr. Leon Kos and PRACE for organizing this programme and everyone that helped us from the training week until the end. Ultimately, I want to say to everyone that reads this post and has an interest in HPC or wants to learn more about HPC and has an interest in computational materials science/chemistry, to apply to next year’s SoHPC.
As stated previously, after searching for possibilities to accelerate our Python programs, we now summarized the results for you! First, in a serial form and second, in parallel. Come and check it out! It could provide you with information you need to combine HPC and Python
Serial programming and the time in seconds per iteration
Scale factor 256! Huge problem size with a Reynolds Number of 2.0 for the viscosity
Look how slow the naive Python version is, we even had to adjust the graph. The NumPy version is nearly 800 times faster than naive Python, and the performance of NumPy version after NumExpr optimization improves greatly, even compared to the performance of C and Fortran versions. This indicates that NumExpr has a good optimization for NumPy arrays calculation at this problem size. In comparison to the serial CPU versions of the CFD program, the implementation of the GPU-based Python Numba is 6 times faster than the NumPy version and even slightly faster than the C and Fortran versions. It is important to ensure that the data copying from CPU to GPU and back is minimised.
The boundaries for the serial programming are now clearer. Due to the fact that we work on a computer cluster, we should use its potential and spread out our code to multiple cores and maybe also nodes!
Split the work and run in parallel!
Speed up on one ARCHERE2’s node (up to 128 processors)
Overall, the speed-up of all parallel programs increases with the number of ranks. For the Python programs, the performance improvement is small with a small number of ranks (1 to 16), as well as for C and Fortran. The NumExpr optimized version is close to C and Fortran at 16 ranks. In the case of a large number of ranks (32 to 128), the performance of all programs improves significantly. The Python NumPy MPI consistently lags behind the other three, while the performance of NumExpr optimized version is still close to C and Fortran. We see that after more than 16 ranks, the scalability of the NumExpr version is better than non-NumExpr version.
.. going even further with unreleased information (only here):
Running MPI CFD versions on ARCHER2 nodes; SF=512, Re=0
The results are given in comparison to the time spent for one iteration with Python MPI on just one full node (128 processors). Interesting is that the speed up is super-linear for all the versions. With using 64 nodes on ARCHER2, we accelerate Python MPI by a factor of 356x in relation to using one node. Fortran and C are around 750 times faster! We still think the Python MPI results are quite convincing. If you are interested in more results, you are welcome to leave a comment!
After all the reading, check out our short summary video:
Thanks for enjoying the blog. See you next time to more HPC events?!
Thanks to David Henty and Stephen Farr for the huge support and making the project so interesting!And of course to Jiahua, my partner during the Summer of HPC!
Hello again. This is a little retrospective blog, as my project (2102) winds up and we finalize our report. The Covid-19 pandemic has obviously dictated alot of how the project played out, but despite this I’ve had a brilliant time. I’ve spent it working with the fusion group at the Barcelona Supercomputing Center (BSC), on a project modelling defect cascades in tungsten during nuclear fusion (see previous blog here).
The fusion group were really supportive and let us join their weekly meetings, which really helped us feel part of something other than for reports or presentations! It also allowed me insights into the different directions the group pursued. The project supervisors were Julio Gutiérrez Moreno and the group organizer Mervi Mantsinen and were both fantastic, endlessly patient with our issues and quick with clever solutions.
From the get-go we were introduced to the HPC at the BSC, MareNostrom, and the LAMMPS setup Julio had prepared to get us started. I had never used LAMMPS before so it was fun to get to grips with it as I began to plot out my project. Over the weeks we were exposed to the full gambit of research work from testing code and debugging our simulations to literature review and diving into the current research to situate our work. It gave me a really great insight into the roles individuals play in a large research group.
The other project student Paolo and I have gotten on well which really helped to make the project a happy one. We approached the project from different ends, him working on the calculations of thermal conductivity while I produced defect structures. We were able to support each other quite well when issues arose and the cooperation made the presentations and video we produced very enjoyable.
In terms of results, I was able to establish a successful procedure for the cascade. A picture of it in process is shown below (Figure 1). Initially many atoms are displaced in a wave rippling across the structure but most of these will settle back into usually occupied positions, leaving a lesser number of permanent defects, after having time to settle. In line with literature, the number of defects formed was proportional to the energy of the cascade up to 200 keV. Various potentials for tungsten were tested and showed slight differences in the number of defects formed, but as the differences are small more repeats are being performed to give more statistically validated results. This then allows future work to tie the projects together and calculate thermal conductivity for these cascades, and move onto various alloys of tungsten which are also of interest e.g. WTa.
Figure 1 – Rounded picture of 60,000 atom cascade simulation in the middle of a cascade (left) and after being allowed to settle (right). the larger atom is the initial cascade atom, given high velocity.
A side of things I really enjoyed was getting to grips with the literature on fusion, such as learning about the state of the art in fusion technology and the position of various large experimental reactors in their long-running timetables. The overlap of these massive engineering projects with our atomic level theoretical chemistry, is a fascinating area to study.
Wrapping things up has left me a bit surprised, the months went by so quickly! Overall it was a great way to spend the summer!
by Mario Gaimann & Raska Soemantoro (joint blog post)
Hi everybody! It’s Raska and Mario here, back for a joint blog post. As we’re nearing the end of Summer of HPC 2021, we’ll be talking about how the project has gone since the beginning.
If you’ve been following our previous updates, you’ll know that Raska’s given a quick overview of our project and how our novel seabed classification system works (if you haven’t, be sure to check it out here). Since then, we’ve come up with newer features and developments to our software.
Previously, we’ve explained that we use a convolutional neural network (CNN) to perform classification tasks on a set of labelled training data. To do this, we use Marconi100’s GPUs (Graphics Processing Units) as explained by Mario in his last post (check it out here). As explained in their name, GPUs were designed to process graphics such as videos and images – two types of data that can be quite heavy to operate on. For this reason, GPUs perform extremely well with large bits of information, such as our training data.
Typical output for training a model using the Tensorflow framework.
Our system would have not been possible without the power of such a computing device; our training data initially consisted of a database of arbitrary image ‘cuts’ from the provided map, each containing a seabed relief. We decided to go even further with these cuts in our latest development. We employed a method called Selective Search which scans for proposed areas where seabed reliefs may exist. This gives us much more data with much higher accuracy. Because these areas equate to regions of interest, this technology is also known as Region Based Convolutional Neural Networks (R-CNN).
Furthermore, we also decided to train two models – one that learns whether a relief exists, and one that learns the type of a relief (should they exist). This helps because relief recognition on a seabed map actually consists of two tasks; relief detection and relief classification. The classification task only applies for training data that has been detected as a relief in the previous task. In programming terms, we effectively nest the second model within the other. For both of these tasks, we managed to achieve accuracies of over 90%, which enabled us to make geohazard predictions with high confidence.
Lessons (machine) learned from our AI & HPC adventure
Coming up with our seabed relief recognition architecture has been a tough but exciting challenge. We’re very glad that our physicist-engineer partnership worked very well – we could always depend on each other’s support. For most of the project, everyone involved in were situated all across Europe; two of our mentors in Italy, one mentor in Southampton (Veerle subsequently went on a marine science expedition to Cabo Verde — check it out here), and us mentees separated between Manchester and Munich. Communication via Slack and Zoom was just not the same as talking from person to person – sometimes, there were misunderstandings about the scope of what needs to be implemented by whom. Even so, we were able to tackle this by providing weekly catch-ups with our mentors and daily updates between the both of us.
Zoom call with our mentors Veerle Huvenne (NOC) and Silvia Ceramicola (OGS); Massimiliano Guarrasi (CINECA) is missing here.
Defining the scope of our project – devising a software that is able to automatically recognize subsea structures, based on the MaGIC dataset for the Calabrian coast – helped us to stay focused towards our goal. With our work plan, we knew how much time we had for each phase of the project and how realistic progress would look like. This was particularly useful to assess how much time we could spend on exploring architectures and tuning hyperparameters, for example. When there were any issues, we met up and managed to resolve them quickly. In the end, our project management enabled us to deliver on the target set out in the project description: to build an automated tool to recognise many different types of seabed structures form oceanographic data (click here for the full project description).
The tight time frame of the Summer of HPC (only eight weeks!) influenced how we approached our project. Once we had researched the relevant technologies to use, we were quick to get our hands on the implementation, knowing that composing the complete machine learning pipeline from scratch would consume quite some time. Our working mode was quite dynamic: we discussed our strategies and subsequently implemented our code independently, which was followed by a phase of integrating all parts into one unified software.
With more time and people involved, doing a more fundamental literature review as well as defining the software design before implementing it would be certainly useful.
Sightseeing in Manchester: Manchester Gay Village and the Old Quadrangle at Manchester University.
In Manchester, United — Our Real-life Meetup!
In the end, our Summer of HPC was a summer full of coding, learning and of course fun. We even got to meet up in real life in Manchester at the end of August! (Seeing what each other looked like in full after all these Zoom sessions where only our heads are displayed was in fact really interesting). During this time, we worked hard on our final video in and around Manchester University Library. For this, we even acted play-acted as marine geographers, quarrelling where geohazards in a subsea map would be located. We also explain step-by-step why you should care about locating geohazards, so be sure to check it out here!
In Manchester Museum we marveled at giant fossils and tiny frogs.
Besides work, we spent some time exploring Manchester. We visited the University of Manchester’s historic Old Quadrangle with its beautiful, ivy-covered buildings, and checked out some other faculties and departments. During one of our lunch breaks, we visited Manchester Museum, a university museum dedicated to natural history. Apart from dinosaur skeletons and minerals we also visited the Vivarium, a section focused on conserving reptiles and amphibians, where we enjoyed watching little, colorful poison-dart frogs, iguanas, and other species. Not to forget, we lived with our very own reptile friend, the lizard Dana. Carrying out a lizard-sitting mission for one of Raska’s friends, we can say that Dana became the mascot of our project, and we had lots of fun playing with her (check our video to see more of Dana).
Our project mascot, the lizard Dana.
More to come!
With a finalized tool for the automated recognition of seabed structures, lots of HPC impressions and hands-on experience, our Summer of HPC came to a happy end. But is this really the end of our submarine geology adventure? Well, maybe not! There is still a lot to do: we would really like to explore the full potential of our tool, fine-tune the parameters of our models, try out different architectures, and improve plotting our map of geohazard predictions, just to name some of our ideas. Both of us are ready and motivated to write the next chapter of our “AI geologist” story, together with our mentors Silvia, Veerle and Massimiliano.
At this point we would like to thank you for following our HPC journey! We hope that you enjoyed it and that we passed some of our enthusiasm for AI and HPC on to you. Stay curious!
The summer of HPC is coming to an end and this will be my last post. Therefore, I think this is the perfect occasion to present the results of my work on the Boltzmann-Nordheim equation and to summarize my experiences of the last two months. As mentioned in my first blog post the goal of my project was to improve the computation of the collision term in the simulation of the equation.
How to measure improvement
“Improving the computation” is very unspecific, hence we will have a closer look at what we were aiming for in our project. When running the computation of the collision term, we can measure the time the computer needs to execute the code. This is one thing we wanted to improve. Additionally, we can also have a look at the scaling of the code. In my second post, I explained, that we want to use multiple processes of a computer to speed up the computation. When investigating the scaling of a code, we investigate how effective it is to use more processes (or “a bigger computer”) to execute the code faster.
Good vs. bad scaling
If we code a program that behaves like building Lego-Grogu as in the video, we have achieved a good scaling. Doubling the number of builders halves the building time. We expect half the execution time of a program when we double the number of processes used. We try to avoid problems, which do not have an execution time proportional to its resources. An example of such behaviour can be playing Symphony No. 1 from Beethoven. No matter how many musicians are playing it, it will always take the same amount of time.
During the Summer of HPC, I tried to improve the scaling of a part of a program, such that it scales more like the gardeners’ problem.
About communicating processes
In my project, I focussed on the term Q1q, which currently takes the longest to compute. The key idea to improve it is to use a new pattern to communicate between processes and to distribute the data in a new way between the processes. We assume, that we have MxN processes, and similar to an MxN matrix, we can split it along the rows and the columns. Moving data along the rows and columns leads to an even and quick distribution of the data. In the following, we analyze the results in more detail.
To analyze the results of two months of coding, I run two experiments! For the first experiment, I run a simulation on a grid with 64×64 grid points and initially use 2 processes. For the following tests, I double the number of processes and each time I measure the time needed to compute the term Q1q for the new (hybrid) and the old (classical) computation.
In the second test, I fixed the number of processes to 8 but increased the grid size, starting from a 16×16 grid up to 128×128. That way, we can analyze two influences. For the first, how effective the scaling of the computation is, e.g. how using more processes helps to speed up the computation. For the second test, we can see how increasing the grid size slows the computation down.
The results of the experiments. On the left, the results from a simulation on a 64×64 grid with a different number of processes. On the right are the results of the test with 8 processes but a variation in the size of the grid.
In both graphs, we can see, that the new hybrid method is an improvement of the initial version(classical). We see, that the hybrid method benefits more from using more processes than the initial version. For the two processes, the runtimes are very similar, but by increasing the number of processes we can compute the Q1q-term faster.
Having a look at the second graph, we see, that the grid size still has a huge impact on the computation, but we still manage to be faster than the classical computation.
A resumee
As this will probably be my last post, I also want to draw a personal conclusion about the Summer of HPC. For me, it was two months of intensive coding, which I enjoyed very much. I want to thank my two supervisors, Alexandre Mouton and Thomas Rey, who supported me during this project. It was a great opportunity for me to improve my skills in MPI programming and coding in general. Apart from the Coding part of the Summer of HPC, I had a lot of fun writing these blog posts and getting creative with the videos. I hope you enjoyed them as much as I did. If you are thinking about taking part in the next Summer of HPC, I can recommend it to you! It was a great experience for me.
If you want to read more about the Boltzmann-Nordheim equation, have a look at the Blog of my project Partner Artem!
Goodbye (and good luck on your project, if you are applying for Summer of HPC 2022).
Hi there! If you are here it is probably because you read my previous blog posts or perhaps you are in search for a tutorial on how to calculate the thermal conductivity of a crystal… in any case you are in the right place, because in the following I will explain the simulations I performed for my Summer of HPC project ‘Computational atomic-scale modelling of materials for fusion reactors‘. To understand the context of these simulations I would highly suggest reading by previous blog post if you’re interested.
First of all we need to understand what thermal conductivity is. Usually expressed with the letter ‘k’ the thermal conductivity of a material is a measure of the ability of that material to conduct heat, and is defined by the Fourier’s law for heat conduction q=-k∇T, where q is the heat flux and ∇T the temperature gradient. A material with an high thermal conductivity has an high efficiency in the conduction of heat, and that’s why it is important that a material like Tungsten has an high thermal conductivity for its application in a nuclear fusion reactor.
In this simple image the relation between the heat flux and the temperature difference at the two extremes of the solid is given by the thermal conductivity of the conducting solid.
How can we calculate the thermal conductivity of a material? The relatively easiest way is to do it experimentally, applying an heat flux to the material and calculating the resulting temperature gradient; but this approach does not allow to investigate the effect of vacancies and defects at an atomic level, and that’s what we are interested in, since this defects while be formed by neutron bombardment. And it’s here that molecular dynamics simulations come into place, with an atomistic simulations it is in fact possible to recreate this kind of configurations and investigate how they effect the thermal conductivity of the material. But first we need a reference value, so the the thermal conductivity of a perfect Tungsten crystal must be computed.
To do so in this project the LAMMPS code has been used, and the simulations have been performed on MareNostrum supercomputer of the Barcelona Supercomputing Center (BSC). A parallelepiped simulation box is defined and Tungsten atoms are placed in a body centered cubic (BCC) crystalline structure, then an equilibration run is performed at a given system temperature, we used 300K to compare the results with previous papers. To have a constant temperature system LAMMPS uses a modified version of the equation of motion, the procedure goes under the name of Nose-Hoover algorithm. Once equilibrium is reached two regions are defined along one direction of the simulation box, at equal distance from the box borders and of equal volume, using an appropriate command a positive heat flux q starts heating one of the two regions, while an equal but negative heat flux–q starts cooling the other one.
Scheme of the simulation box and the two, positive and negative, heat fluxes.
In this way the average temperature of the system stays constant, but the first region will locally have an higher temperature while the second one a lower one, we then have a temperature gradient∇T between these two regions; knowing the value of the heat flux q and calculating ∇T we obtain the value for the thermal conductivity k from the Fourier’s law. This procedure has been performed using different potentials and system sizes obtaining results agreeing with previous papers.
Example of a temperature profile of a system of tungsten atoms. The black is the average temperature, while the green one is a fit of the central region from which the temperature gradient is obtained.
We then proceeded to add an empty bubble at the center of the system with radius r, resembling the damage caused by neutron bombardment, and repeated the same procedure obtaining a new result for the thermal conductivity of the system. We investigated the relative variation of the thermal conductivity with respect to the value for a perfect crystal increasing the size of the empty sphere at the box center; the obtained result is a decreasing linear behaviour of the thermal conductivity as a function of the transverse (with respect to the heat flux direction) area of the sphereπr2 ; the result is shown in the following graph. For high radius spheres we start to see the effect of the finite size of the simulation box, which is not infinite, and the behaviour shifts away from the linear one.
Variation of the thermal conductivity, with respect to the perfect crystal value, as a function of the empty sphere cross area relative to the simulation box cross area.
These vacancies have an important effect on a crystal system like this one, to give you an idea the first point of the graph corresponds to the removal of 11 atoms from an half million atoms system, and the thermal conductivity already drops by 5%; this is the same reason why it is so crucial to study and understand the effect that these vacancies cause on a system, since these structures will be formed also in real life conditions with real effects. The results obtained in this project can be used as a basis to further explore this type of defects, for example injecting inside the empty bubbles Hydrogen or Helium atoms, resembling what really happens when this atoms escape from the nuclear fusion reactor core.
I’m really grateful to every person that made my partecipation for the Summer of HPC 2021 possible, it has been a wonderful experience, learning new computational skills and techniques and meeting amazing people; it is a fantastic experience that I would recommend to anyone interested in computational science and its applications. I hope you found my blog pages interesting and helpful, if so you will sure like the posts my colleague Eoin did on the formation of defect cascades. Have a wonderful day!
Hi stranger on the internet! If you clicked on this webpage you’re probably interested in material science, or maybe you were just intrigued by the title, or perhaps you just misclicked… Anyway, in the following I will briefly explain how my project ‘Computational atomic-scale modelling of materials for fusion reactors’ is structured and what are my objectives, so feel free to check it out if you’re interested.
You probably heard before in a way or another about nuclear fusion, it’s the main process that happens inside our Sun, and all the other stars, and generates the energy that makes life possible here on Earth. Today different projects involving the world most powerful countries are trying to reproduce nuclear fusion on earth and make it a new source of energy, possibly changing forever our energy production methods. In the context of this nuclear fusion reactors different branches of physics, chemistry and engineering come together to achieve this incredibly difficult task, but that could lead to one of the greatest achievement of mankind; one of these branches is material science. Specific materials must be used in a nuclear fusion reactor, that can sustain the incredibly harsh conditions present; to give an example ITER is currently the largest experiment of nuclear fusion under construction and at its core it will reach temperatures of around 150 million degrees, ten times higher than the temperature of the core of the Sun!
Concept image of the ITER nuclear fusion reactor
Source: ITER official web page https://www.iter.org/proj/inafewlines
One of the materials that is used in a nuclear fusion reactor is Tungsten, due to its high thermal conductivity and highest melting point of all metals. Which makes it perfect for the harsh conditions it is subject to, but it faces a critical problem: the damage caused by the particles present in the fusion reactor: the two hydrogen isotopes, deuterium and tritium, which are the input particles of the fusion reaction, helium atoms and neutrons which are the products of the reaction alongside energy. Some of these particles can in fact hit the material causing damage at an atomic level because of their high energy, given the temperature of the core. For example neutrons can initiate chain events from an initial collision with a tungsten atom, which are called defect cascades, that will result in the displacement of several atoms from the lattice structure and the formation of vacancies.
This changes in the structure will then affect the properties of the material, and such variation need to be studied to have a clear picture of what will happen in nuclear fusion experiments like ITER. While the work of my colleague Eoin was focused on reproducing cascades of defects inside tungsten; mine was centered around the thermal conductivity of tungsten and in particular the effect of defects on such property, like the presence of large vacancies, which can be also filled by Hydrogen and Helium atoms.To do so molecular dynamic simulations, using the LAMMPS code, have been performed. The basic idea of molecular dynamics is to create a system in which single atoms are present and each one evolves in time using the classical Newton’s law of motion and a potential that is given depending of the material of interest. Millions of atoms can be simulated using such simulations, and even if quantum effects are not considered, the great number of particles, which could not be possible in a quantum simulation, allows to study very well properties like thermal conductivity.
Example of an empty bubble inside a Tungsten atoms system in a molecular dynamics simulation. Structures of this kind will change the material properties!
(the colouring of the image is used to give a sense of depth)
If you are interested in the topic you can give a look at my next blog post, which will be focused on a procedure to calculate the thermal conductivity of a crystal using molecular dynamics and my results with Tungsten. If the idea of the defects cascades has caught your attention my colleague Eoin has written some blog posts on the topic on his page. In any case, thanks for your attention and I hope you enjoyed this blog post. Have a good day, dear internet stranger!
Welcome back to my third and final blog post for the Summer of HPC. This is the final week of SoHPC and between the presentations and the report writing, I decided to write my final blog post to explain my work in more depth. In the previous blog post, I explained what fault tolerance is and why it is essential. Also, I introduced the FTI library and mentioned lossy compression as a technique to limit the IO bottleneck. In this blog post, I will explain more extensively the implementation of lossy compression in the FTI, the results of the until now experiments, and my future expectations. If you haven’t, I suggest reading my previous blog post so that you can understand the basic concepts of the FTI library in which the implementation was made.
Implementation
To execute FTI one must specify certain parameters in the configuration file (configuration example). To use the lossy compression feature, three extra parameters must be specified: compression_enabled, cpc_block_size, and cpc_tolerance. The first one is simply a boolean value which, when set to 1, allows the checkpoints to be compressed. The second one specifies the maximum memory size which can be allocated for the compression and is used to protect memory restricted systems. It is suggested that the block size is not significantly lower than the actual data size, because then it will interfere with the compression and may decrease the compression ratio. The last parameter defines the absolute error tolerance for the compression. The actual error tolerance is given by the function 10-t where t is the integer cpc_tolerance specified at the configuration file. The user must specify a tolerance suited for their application, high enough to maximize the compression rate, but low enough so the results won’t be altered.
Experiments
Figure 1: In this graph, we can see the size of the checkpoint file of rank zero for different tolerances in comparison with the checkpoint size without compression.
In this graph, we can see the write time of the checkpoints for different tolerances in comparison with the checkpoint write time without compression.
As a principle, lossy compression targets large-scale applications. Although the performance of lossy compression differs for different data sets, in programs with small memory, using lossy compression may even slow down checkpointing due to the time needed to perform the compression and the possibility of hard to compress data. Therefore, to test the application we have to employ a very scalable application. For our experiments, we run LULESH on BSC’s Marenostrum cluster, with a size of 615,85 Megabytes per process, for 512 processes in 16 different nodes, using different values for cpc_tolerance. The experiment can be considered successful since a measurable decrease in the time to write the checkpoint was observed. As you can see in figures 1 and 2, the write times of the checkpoints for any value of tolerance between zero and eight seem to be around half the write time when the checkpoint isn’t compressed. Also, we can see that tolerance doesn’t really affect write time, since for different tolerances we don’t observe high differences at the checkpoint file. This may change in other experiments with different data consistency or size.
Video Presentation
For the closing meeting of SoHPC, my partner Kevser and I created a five minute video presentation explaining our project.
Project video presentation
Finally
That summarizes our work in the SoHPC, I hope you found it interesting. As this experience comes to an end I would like to express my gratitude to the organizers and project supervisors of SoHPC for the great work they are doing. During the summer I learned a whole lot of new things, gained experience in working as a researcher, and collaborated with many inspiring people. Finally, I undoubtedly encourage every student interested in HPC, and computing in general, to apply to SoHPC because it is a very positive and unforgettable experience, which I am grateful that I had during my studies.
Welcome to my second post about my experience with Summer of High Performance Computing. First of all I hope that you are doing well and you had an enjoyable summer. If you haven’t already, go check my previews post, so you can get to know me a little better.
Hartree-Fock Algorithm
As I mention on my previous post, this summer I was working together with Eduard Zsurka under RNDr. Ján Simunek guidance, in order to implement and test a BLAS (Basic Linear Algebra Subprograms) version written in Fortran, to an existing code of Hartree-Fock algorithm written by Ján Simunek et al. What is Hartree-Fock you may ask. Simply put Hartree-Fock is a method that it can give us an approximation for the determination of the wave function and energy of a quantum N-body system in its ground state. Even simplier put, what this method does is to do an approximation solution of Schrödinger equation.
In this method there are two time consuming steps in the algorithm. One is the formation of Fock matrix which is a approximation through iterations to the single-electron energy operator for a given basis set and derives from integrals and density matrix. The second step is the diagonalization of the Fock matrix, which for linearly scaling methods and parallelization is an unwanted step. Dr Simunek et al. managed to replace the diagonalization step by replacing it with matrix-matrix multiplications. For larger molecules the electron integral matrices are sparse. Sparse matrices have the advantage that only the non-zero elements need to been stored and thus the matrices are occupying less memory.
Our task was to find libraries that can handle sparse matrix multiplications written in Fortran. The first two candidates where two libraries, one from National Institute of Standards and Technology (NIST) and the second one from netlib. With both of these libraries we had difficulty to successfully implement them to existing code, because of lacking documentation or not compiling with newer versions of Fortran. By that time we were searching for another way to implement from those libraries a sparse matrix-matrix multiplication or even to use intel’s Sparse BLAS library, we found a stackoverflow post with a guide for a working implemantation of netlib’s library.
5 Alkanes and Methotrexate used as testing molecules.
After we managed to have a working test code from the said library we tested to methane molecule, where we found that our test code was working. Succeeding that we started testing the code on methotrexate (C20H22N8O5), and 5 different alkanes: C6H14, C12H26, C18H38, C24H50, C30H62, C36H74. Starting with methotrexate the results weren’t that great since our code was actually slower than the original code. These results didn’t disappoint us at all since we knew we had to go for bigger molecules, where we have even bigger sparse matrices.
This is the end of this post. We managed to have a working code and we are looking forward for more results. In my next post, we are going to see if for bigger molecules we can decrease the running time of the code. Thank you for reading my post.
Navier-Stokes equations are a set of PDEs not analytically solved. They are one of the unsolved mathematical problems of the Millenium, which resolution will give the award from the Clay Mathematics Institute. However, they completely describe the fluid’s behavior in all situations.Navier-Stokes equations are a system of three laws of equilibrium: mass, momentum, and energy. The first one is also called the equation of continuity, the second corresponds to the First Newton’s Law, while the last one is the First principle of Thermodynamics. They can be written in different forms, referring to a finite control volume (integral form) or an infinitesimal control volume (differential form). In addition, this control volume can be considered fixed in space and time (Eulerian approach, conservative form) or moving with the fluid (Lagrangian point of view, non-conservative form).
Incompressible Navier-Stokes equations. d is the dimension of the problem, i.e. one-dimensional problems d=1, bidimensional problems d=2, three-dimensional problems d=3.
To understand and analyze the behavior of the external or internal fluid flow, we have two different possibilities: experimental investigation or numerical approach. The first one was the most used in the first years of studies when computer architectures were not so advanced. Nowadays, with the usage of the HPC systems, the numerical approach is becoming the most important. In this context, the CFD is placed. Computational Fluid Dynamics uses numerical techniques to discretize the Navier-Stokes equations. In other words, with the CFD, we are searching for the solution at a specific point, which depends on the typical numerical method that we are using, then, we reconstruct the solution somewhere else with a technique that, again, depends on the numerical method selected. There are a lot of numerical methods: finite difference, finite elements, finite volume, spectral methods, characteristic lines, hybrid discontinuous Galerkin method, etc. All these methods use the same already explained idea. All these are already implemented in different solvers. The one that we are using is the free open-source OpenFOAM.
The usage of the HPC system in fluid dynamics has become interesting with the studies of turbulent flows. Turbulence is described by the Navier-Stokes equations and it is a regime of fluid motion. Even if there is not a specific definition of the turbulent regime, we can describe it as a flow with three-dimensional, unsteady, and non-linearly interacting vorticity.
Effect of the shape on the drag aerodynamics of the car. Sketching of vortices and turbulence in the wakes. The picture was taken from https://www.reddit.com/r/Design/comments/krgg3o/advertisement_from_the_1930s_showing_the_advanced/
Modeling the turbulence, several methods are present in the literature. Generally speaking, the main attention must be focused on which details we want to analyze. An increment in this will introduce an increment in the computational cost. For these reasons, the DNS is not used in the industries. In general, they reach a maximum the LES, even if in the last period most of them are considering the DES. Let’s describe these from the bottom to the top:
Turbulence modeling
RANS: it is the acronym for Reynolds Average Navier-Stokes equations. With this strategy, we are decomposing the entire flow field as a sum of two different contributions: the first one is constant in time and it is associated with the mean flow. The second one, instead, is time-dependent and is associated with fluctuations. Using this decomposition for all the quantities present in the equations (velocity, density, pressure, etc.) we can see that in the momentum equation a new term appears. It is similar to stress and, for this reason, is called Reynolds’ stress. This term needs to be modeled. Hence, new equations must be added. Many numerical methods try to close the system. The classical numerical method based on one equation is the Spalart-Allmaras, while based on two equations there are the k-Epsilon, k-Omega, and k-Omega SST. URANS stands for Unsteady RANS.
DES: it stands for Detached Eddy simulation. It is a hybrid technique that considers both the RANS and the LES approach. The idea of the DES is to reduce the computational cost of the LES but increasing the description of the RANS. It fixes a turbulent length based on the TKE and the turbulent dissipation rate. Then it compares the turbulent length with the mesh size: if the mesh size is lower than the turbulent length means that we are far from the wall, so we are interested in solving all the detached eddies, i.e. LES in applied. Instead, if the mesh size is greater than the turbulent length means that we are near the wall, thus no detached eddies are present, i.e. RANS is used.
LES: in the Large Eddy Simulations, we are interested in solving all the large eddies, without any approximation. Then, for the subgrid contribution, we need to introduce a model. One of the most important scientists that gave a big contribution to the LES was Massimo Germano, from my home University, and Smagorinsky.
DNS: the Direct Numerical Simulation solves all the scales, from the integral scales to the Kolmogorov scales, where the dissipation will take place. No approximation is needed, but we need to consider an increment in the computational cost. In these problems, nowadays, DNS is only used for academic works with simplified geometry, to understand the behavior of the turbulent flow under particular conditions.
In our project, we focused our attention on the RANS approach with different turbulence models: Spalart-Allmaras, k-Omega, k-Omega SST, and k-Epsilon. We use the OpenFOAM solver, which is based on finite volume discretization.
Streamlines and coefficient of pressure. You can see our final report for more detailed numerical results.
In the picture, you can see the Q-criterion applied for the three different geometries (fastback, estateback, and notchback) of the DrivAer, all without wheels following our Supervisor’s suggestion to reduce the complexity of the simulations. The simulations were running for 10 m/s inlet velocity, zero pressure gradient in the outlet region, k-Omega SST turbulence model, and for steady-states. The Q-criterion and the vorticity were computed directly on Paraview. It is simple to notice that the three different geometries provide three different turbulent structures in the wake. Obviously, big wakes mean big drag, i.e. higher request of fuel. Hence, more pollution.
Q-criterion. You can see our final report for more detailed numerical results.
Hi everyone! So yeah, this will be my final post on the SoHPC page. It’s been quite fun participating in this program, but before saying farewell let’s talk about what I and @ioanniss managed to do this summer.
So, if you’ve read my previous post Making it work! you might have a rough idea what Project 2108 is about. In a nutshell, we have to modify an implementation of the Hartree-Fock algorithm by replacing a dense matrix-dense matrix multiplication, with a special sparse matrix-dense matrix multiplication. We expect that this step will have some effect on the runtime of the code.
We are doing the testing using a supercomputer from Žilina, Slovakia and we played around with different settings to see how this affect the runtime. The molecules used in these calculations are 5 different alkenes: C6H14, C12H26, C18H38, C24H50, C30H62, C36H74.
Figure. 1
Firstly, we observed that the runtime is proportional to the cube of the size of the molecule, where we assume that the number of carbon atoms is equivalent with the size of the molecule. This is to be expected, since the most numerically intensive operations in the code are matrix-matrix multiplication, which scale with the cube of the size of the system.
Following this step we decided to see if our new code is faster than the old one. We had 4 computing nodes at our disposal, each equipped with 32 cpu cores. We tested the runtime of the codes for 1, 2, 4, 8, 16 and 32 cores on one node. In the Figure 1. we can see the improvement in runtime for each molecule and in Figure 2. we can see the runtimes with different numbers of cores averaged for each molecule.
The results are not conclusive, but there’s a possibility that for large enough systems the runtime is slightly smaller. For this reason, we are still doing test, but with more basis functions than before. This calculations take quite a long time, but we hope we will have data on these systems to preset in the final report.
Figure. 2
We decided to see also, how using different numbers of cores for the parallel calculation would affect the runtime of the code. As expected the code ran slowly on only a few cores, but it ran fastest not on 32, but 4 or 8 cores. This can be attributed to all the different cores trying to access the memory of the node, thus slowing down the code. Also the 32 cores are arranged in 4 cpus, so it’s not surprising that the runtime is very good for 8 cores. These results are condensed in Figure 3.
Figure. 3
Finally we are also studying how using multiple nodes for the same calculation affects the runtime, and how the sparsity is linked to the performance of our new code. Hopefully we can do everything we set out to do until the deadline arrives, and I hope I will be able to show you some interesting results.
It was very pleasant taking part in this programme. My team was great, we managed to communicate quite often, and go forward with the project in quite a fast pace. I think we never really had any problems dividing our tasks, and Ján was always ready to help us. All in all it was a fun experience and I recommend SoHPC to anyone wanting to take part in a short scientific project for the summer. Finall thank you, the organizers for making SoHPC possible. Good luck with whatever you are doing and have a nice day!
Hello everyone! I’m back (albeit quite late) to tell you a bit more about the project the I and @ioanniss have been working on. In my previous post, I gave a rough outline of the project, and what our task was, now I will go in a bit more detail.
Within the project we are tasked with modifying an implementation of the Hartree-Fock algorithm, written by Noga and Šimunek [1]. In the simplest words, the Hartree-Fock algorithm give us an approximation of the ground state wave function, and energy of a molecule. The implementation by Noga and Šimunek [1] provides a new way of obtaining the solution, by removing the step of diagonalizing the Fock matrix. This has the affect of speeding up the calculation since, for linearly scaling methods and/or parallelization, the diagonalization is an unwanted step [1]. Thus the diagonalization is replaced by a matrix-matrix multiplication. One of the matricies involved in this multiplication is the electron density matrix, which for larger systems can be classified as a sparse matrix. Therefore, a natural step in further speeding up the algorithm, would be replacing the matrix-matrix multiplication, with a sparse matrix-dense matrix multiplication.
The first step in solving this problem was getting acquainted with Fortran, which I haven’t used before. This was followed by a prolonged search of an implementation of the sparse matrix-dense matrix multiplication in Fortran. After a few weeks of unsuccessfully trying different implementations of the Sparse BLAS library, we stumbled upon a StackOverflow post which contained a working copy of Sparse BLAS for Fortran. The Sparse BLAS library contained the multiplication algorithm we were searching for. Replacing the matrix multiplication was quite an easy step. We had to replace this
with, (side a few matrix allocations, the spare matrix creation and a few matrix transpositions) a function that looks like this:
call usmm(YDEN_SPARSE,A_T,RES,istat)
Very exciting, right? This step was pretty straight forward. We had a lot more difficulties understanding the intricacies of the code that preforms the calculations. We had a hard time understanding the myriad of parameters and functions involved at each step. Eventually we managed to arrive at a working code, that contained the mutliplication function from the Sparse BLAS library. We used a methane molecule to see is the code is even working, and after that we started looking at other molecules. We use 5 different alkenes: C6H14, C12H26, C18H38, C24H50, C30H62, C36H74 and a molecule called methotrexate (C20H22N8O5), which is shown in Fig 1.
Fig 1. Methotrexate (C20H22N8O5).
In my next post I will tell you how we used these molecules to test the performance of the new code, and how we compared it to the original code. I also plan to have a retrospective look at what was like taking part in Summer of HPC and what was like working in a team in a “home office” setup. Good luck to everyone for the final presentation!
[1] Jozef Noga, Ján Šimunek, J. Chem. Theory Comput.2010, 6, 9, 2706–2713
Hello, my name is Omar, I am 23 years old, I am a studious and enterprising boy. I did different studies in the computer area, in fact, I have a degree in computer engineering and a specialized Master in the same sector, I am currently doing the second Master in cybersecurity and data intelligence in conjunction with the PhD.
From Canarias (Spain) to the Irish Center for High End Computing
In this blog, I come to tell you about my experience in the HPC summer course, which I have had the privilege of taking from the comfort of my island, specifically, from Tenerife, in the Canary Islands (Spain), within which it is enjoyed of a totally paradisiacal climate and, which I recommend entirely. Likewise, I must emphasize that this course has been carried out through different online platforms such as, for example, the zoom platform, through which the teacher has met every day with the best disposition and encouraging us more and more, not only to internalize that knowledge but to be part of it. This experience was quite enriching, since as my university instructors previously commented to me, this opportunity allowed me to broaden my range of knowledge about the computer science area and go a little deeper into certain knowledge related to my PhD.
Little over halfway into the project I’ve finally gotten to grips with it. In this piece I’m going to introduce my project titled ‘Computational atomic-scale modelling of materials for fusion reactors.’
Its all comes down to energy generation. Climate change presents a growing global issue and requires a complete shift in the current energy production systems which rely on fossil fuels. Nuclear fusion was one of the ideal options to solve this problem, as it offers a way to generate carbon free power. It has been one of the ongoing international goals of science since the 1970s (Knaster et al, 2016) as a replacement for fission. The big advantage of fusion is that its major waste product is harmless helium gas. As well as this it is a controlled process without the dangerous chain reaction seen in nuclear fission. The process occurs in the sun, in which hydrogen atoms are brought together with enough force to fuse , releasing large amounts of energy (Figure 1).
Figure 1 – Simple equation for nuclear fusion using deuterium and tritium with mass numbers shown. Other fusion routes exist, for example using lithium.
A practical reactor (i.e. one which produces more energy then it consumes) has so far remained out of reach. While fusion has been performed, it is not sustainable, and as the world lacks a comparable high energy neutron source, there has been only limited work to investigate material performance under prolonged fusion (Chapman, 2021). Of course here a HPC can provide great predictive work.
Our project aims to support these efforts. The temperature and pressure conditions of such industrial reactors will be extreme and require high performance materials. Evaluation of such materials cannot currently take place as there is a lack of sufficiently high energy neutron sources (Knaster et al, 2016), limiting testing. Here computational modelling offers a way to perform these investigations in advance of real world testing, helping to speed development. Computational modelling can occur at a number of levels of scale; from ‘coarse’ representations presenting materials as continuous blocks, to full depth representations using quantum mechanical calculations to include the effects of electrons in calculations and therefore include chemical bonding. Our work takes place on the atomic level using so-called molecular dynamics. This scale allows atomic level resolution while avoiding computationally costly electronic structure calculations. It does this by representing atoms using classical physics, viewing atoms as a series of spheres which have a number of parameters controlling the forces they exert on each other, which are then calculated to give trajectories over a series of time steps. This allows large systems (Millions of atoms or micrometer scale) relevant to radiation damage to be represented.
One of the favored materials for fusion reactor components is crystalline tungsten. It has the highest melting point of any element and a high resistance to sputtering (breaking apart) due to energetic particles. Molecular dynamics will allow investigation of damage cascades under neutron bombardment; the results of atoms receiving energy from oncoming neutrons and flying out of position in the crystal disrupting and mobilizing increasing numbers of adjacent atoms. This gives cascades similar to those seen in Figure 2.
Figure 2 – Evolution of two cascades in tungsten at given energies with time. Sourced from, J. Knaster, A Moeslang & T. Muroga Nature Physics, 12, 424–434 (2016)
My colleague Paolo is working on thermal conductivity measurements of such systems, and its been really interesting to see his work and perspective on things. My side of things has been to generate cascades such as those shown in the header. Its been a challenging but enjoyable project. My personal experience has been one dominated by problem solving, attempting to find out how a simulation set-up has gone wrong. Though at times this is frustrating, it is rewarding to find solutions and to expand my knowledge of the HPCs I am using. Especially as I’ve finally been able to execute defect cascades to the million atom scale.
Moving forward, Paolo and I hope to further our analysis of these reactor components and defect cascades. I will focus on comparing cascades at different energies and using various potentials (determiners of tungsten interactions), to be presented at the end of the project. This will be an opportunity to really show how our work has developed and contributes to the existing research on the massive undertaking of developing nuclear fusion. Overall, I am happy to be contributing to this area of investigation and I am hopeful for the future of nuclear fusion and how it can improve the sustainability of our world.
Bibliography
J. Knaster, A Moeslang & T. Muroga, Nature Physics, 12, 424–434 (2016)
I. Chapman, Putting the sun in a bottle: the path to delivering sustainable fusion power | The Royal Society, https://www.youtube.com/watch?v=eYbNSgUQhdY, 2021.
Using the HPC clusters is not rocket science, but still you have to know how to do it. That’s exactly where we started before going into examining the velocity of our code and the two HPC systems ARCHER2 and Cirrus.
What code did we use to investigate its performance and to test the two HPC systems?
Our performance benchmark is an example from the Computational Fluid Dynamics (CFD). It simulates a fluid flow in cavity and shows the flow pattern. On a two-dimensional grid, the computer discretises partial differential equations to approximate the flow pattern. The partial differential equations, describing a steam function, are solved by using the Jacobi algorithm. It can be described as an iterative process which converges to a state with an approximated solution. Every iteration until the stable state adds precision to the results. Additionally, increasing the grid size leads to a result with higher precision but is also more computational intensive.
How does is it look like?
Approximation of the flow through a square. The flow comes from right centre and leaves at bottom centre. With an increasing number of iterations the approximation is becoming more accurate.
What language did we deal with and what did we focus on?
C, Fortran and Python. The code in C and Fortran was already there. Hence we had to translate C or Fortran into Python and make sure that they do the same. We focused on Python as it is a widely used language in science and more and more often used on HPC systems for running code. In comparison to C and Fortran as you might know, Python is slow: The readability and the dynamic typing – defining the variable type in the runtime – slows down the performance and makes Python unattractive for complex calculations with big problem sizes. At the same time, Python is popular and has a big community which means that there are solutions!
What did we use to improve the performance of Python?
Optimise: Numerical Python, named Numpy, is a specialized package which is optimized. (Surely, in the end of the project we can post some evidence!)
Divide your problem: MPI is a typical way to parallelize code. mpi4py is a package made for this purpose in Python.
Parallelize it but with much more cores: Apply your problem to GPUs
What did we measure and how?
The core of the algorithm is an iterative process. The algorithm improves with every iteration the approximation of the fluid dynamic, meaning that every iteration processes the same numeric calculation. Therefore the mean of the time of iteration provides us with information of the performance.
We coded. We tried out. We played around with alternatives. What to expect from coding on HPC Systems and what performance improve you can achieve with Python, NEXT TIME we will provide you with all the insights!
Greetings everyone. My name is Ioannis Savvidis and I come from a beautiful city named Kavala in Greece where I am writing this post. I am a Materials Science and Engineering undergraduate student at University of Ioannina with an interest in computational materials science. Growing up I had an interest in programming and its applications and I enjoyed a lot playing video games on computer with friends. For my studies I didn’t choose a related degree to follow because I had a bigger interest in chemistry, physics and engineering. That lead me to MSE department, were in my last year I chose to follow the path of computational materials science. Even though this is not a “mainstream” pathway for my department, I decided to follow it, because I see a big opportunity in advancement of new material discovery and applications.
Picture of me!
Other interests that I have are related to sports and cooking. Before the pandemic of COVID19, I used to workout a lot by going to a gym and train on Brazilian Jiu Jitsu, a ground based martial art that is focused on grappling and leverages that will lead to submissions. Other sports that I enjoy doing are snowboard in winter and sailing in summer. After the pandemic, we had to stay in home so most of my free time I would use it on cooking and watching TV series like Peaky Blinders or Better call Saul, two of the best series I have ever watched to tell the truth.
In those times a teacher associated on computational materials science suggested to me to look for SoHPC and if I am interested to any project to apply for internship. That was the first time I learned about HPC and its applications and I was intrigued and wanted to participate and learn more.
When I got the email of acceptance I was very happy that I got the opportunity to work on a supercomputer and learn a lot of new things. And so after the first training week that I found it very informative, I learned about applications on supercomputers and parallel programming. So for this summer I will be working as a team with Eduard Zsurka for Project 2108:Efficient Fock matrix construction in localized Hartree-Fock method under the guidance of Dr. Jan Simunek.
In my last post I had mentioned that NASA’s OSIRIS-REx, one-of-its-kind mission to collect samples from an asteroid and return to Earth, and the mission that forms the crux of my project. Personally, it is surreal to imagine such a feat being accomplished in our time. With forays such as these, space exploration surely seems to be on an exciting trajectory once again. I am writing this post to provide some context.
Asteroids offer a fascinating insight into early-universe. They are thought to contain substances from the formation of our solar system, possibly providing additional bit of information than one hopes to obtain from planets. Bennu, one of the primitive and carbon-rich asteroids, is thought to contain organic materials that could very well provide new insight into origin of life. Situated around 200 million miles away from Earth, it is classified as a Near-Earth Object (NEO). Interestingly, it is also thought to have a 1/2700 chance of colliding with Earth in 22nd Century. A combination of such factors led to it being the focus of OSIRIS-REx mission.
In our project, we focus on the material that makes up the surface of such asteroids. Assumed to be a loose, grainy material, it is thought to preserve information regarding a celestial body’s geophysical history. Chunks of such a surface could be likened to randomly packed spheres. A study of thermal properties in such a model could be related to the physical properties of the regolith, which in turn provides a better understanding of the celestial body in focus — Bennu in OSIRIS-REx mission’s case. This crux of our project is the finite element method, which is a numerical technique to solve governing differential equations of a physical model. Using this, we can predict the thermal properties such as temperature distribution, or conductivities and so on.
This mosaic of Bennu, created using observations made by NASA’s OSIRIS-REx spacecraft. Credit: NASA/Goddard/University of Arizona
Launched in 2016 by NASA’s New Frontiers program, OSIRIS-REx first reached Bennu in 2018. Initially, the probe mapped the asteroid extensively, while the team analyzed these observations to choose an appropriate site to collect samples from. The team shortlisted at least four such sites. One such sites, termed Nightingale, was earmarked to grab material (Regolith) from the asteroid. In October 2020 the probe performed a sample Touch-and-Go (TAG) event, which meant that the probe collected debris after impacting it for 5 seconds.
The TAG event consisted of a series of maneuvers, and was specified to take 4.5 hours. After descending from an altitude of 770m from Bennu’s surface, the probe fired one of its pressurized Nitrogen bottles to agitate the asteroid’s surface. The resultant dust is then collected using probe’s collector head. Once this sequence was complete, the probe fired its thrusters to reach a safe altitude again. It was expected to collect 60 grams of sample, and the extra pressurized nitrogen containers were provided in case the sample collected was insufficient. However, this isn’t the case. The sampler is thought to have impacted the asteroid too hard, leaving particles wedged around the rim of its container. The mission later reported that some of the material may have flown off into space, and that the quantity of sample collected is now unknown, as they had to abort a scheduled spin maneuver that would’ve determined the mass of collected material. The sample collected remains safely sealed now, and the probe had departed from the asteroid.
OSIRIS-REx in the clean room at Lockheed Martin in April 2016 after the completion of testing. Credit: University of Arizona/Christine Hoekenga
One of the more recent updates from the mission places the spacecraft at a distance of 528,000km from Earth. It had fired its engines on May 10 that initiated the return trip, and is expected to be on-course for a scheduled arrival on 24th September, 2023. Additionally, it had been recently revealed that the data collected by this spacecraft helped refine orbital models. Before OSIRIS-REx, scientists had expressed uncertainty owing to an unclear assessment of influence of Earth’s gravity on the asteroid. The study suggested that Bennu could make a close approach with Earth in 2135, and identified Sept. 24, 2182, as a significant date in terms of potential impact. Nevertheless, it must be stressed that there exists no apparent threat of collision.
Hi everyone! My name is Edi, I just finished my Master’s degree in condensed matter physics and I was lucky enough to get accepted to project 2108. I was born in Romania, but I did my studies in Budapest, Hungary at ELTE University. Above you can see a view of the city from the Gellért Hill. On the right, in the distance you might recognize the iconic Parliament building.
During my studies I had the opportunity to get involved in 2D materials research, I was always interested in the crossing point of physics and chemistry, where chemistry can help us design materials with specific physical properties. This can also go the other way round, when we use physics to determine the structure and properties of different molecules. The latter is the case of project 2108: Efficient Fock matrix construction in localized Hartree-Fock method. I know, I know, the name is quite intimidating, but the principle is quite simple: take a guess for the electron density of a molecule, do some simple algebra like matrix-multiplications and matrix diagonalization and in some magical way you obtain a guess for the ground state energy of the molecule.
In reality this is quite complicated due to the electron-electron interaction part of the energy: simply put it’s basically impossible to guess how multiple electrons will arrange themselves, when each of the interacts with the other. In this aspect, our situation is similar to the 3 body problem. We have to first guess the electron density of the system, and verify if it is energetically favorable, if not, we modify it and we rinse and repeat. Fortunately, we don’t have to actually understand all the steps in the algorithm ( which is an implementation of the Hartree-Fock method ) in the project, instead we are looking at how different numerical algorithms can help speed up the calculations, and how we can use the architecture of the supercomputer in Zilina, Slovakia to get the fastest results. In short, the task in not the easiest, since we must have a basic understanding of the algorithm we are modifying, but we can doing without having an in-depth knowledge of the physics. We have already come quite a long way, we managed to create a new working code, but we still have ahead all kinds of testing and optimizing steps. I plan to write about this in more detail in the next post.
And for closure, I will add a photo of myself below. I hope you are doing well with your projects, and I wait to see what we manage to achieve during SoHPC 2021. Have a nice summer!
Last year, a first program has already been implemented. This program computes a triangulation that approximates the surface of a molecule. This year we are focusing on improving efficiency and adding new features. One of these new features I will present to you now:
Figure 1: The surface is a composition of an inner surface (red) and an outer surface (blue).
It may happen that the result of the computation yields several disconnected surfaces (see figure 1). However, in applications, one is often only interested in the outer surface. If we look at the picture, it is easy for us humans, to say which triangle on the triangulation belongs to which surface. However, for a computer, it isn’t as easy as it might seem at first sight. For applications, it is important to distinguish different connected surfaces automatically. Therefore, we need an algorithm to tackle this problem (and this algorithm should be ideally efficient).
For the computer, the computed triangulation of the program is just a bunch of triangles. An example, how a triangulation looks like, is depicted in figure 2. The computer has knowledge of local connectivity in the sense that the vertices of a single triangle are connected to each other and must therefore lie on the same surface. But if we look at two arbitrarily chosen vertices of the triangulation (for example vertex 1 and vertex 9 in figure 2) it is not apparent if these two vertices lie on the same surface or not, by looking only at the list of triangles and vertices which are stored in memory. But can the computer combine the local knowledge to obtain global information about the surfaces? And if so, how is this done most efficiently?
Triangulation stored in Computer
Visualised Triangulation
Figure 2: Example of a triangulation
One solution to this problem is to use the Disjoint-set data structure (also know as Union-find data structure or merge-find set, for more details see https://en.wikipedia.org/wiki/Disjoint-set_data_structure). This data structure stores a collection of non-overlapping sets. In our case, each set consists of all vertices of one connected surface. If we look again to figure 2, we would have the sets {1, 4, 5, 7, 8, 9} and {2, 3, 6}.
In our program, the disjoint-set structure is implemented with a so-called “disjoint-set forest”. A forest is a set of trees (see figure 3). In our algorithm, we start with an “empty” forest (i.e. we have no trees at all). By going through the list of triangles sequentially, the trees are constructed little by little. If you are interested in more details I can recommend reading https://www.geeksforgeeks.org/disjoint-set-data-structures/ and http://cs.haifa.ac.il/~gordon/voxel-sweep.pdf. In our example from figure 2, the resulting forest consists of two trees. Each tree contains the vertices of one surface. As soon as we have constructed those trees, distinguishing different surfaces is relatively easy. If we look at a vertex on the surface, we can traverse up the corresponding tree up to the root vertex. If two vertices lie on the same surface, the corresponding root vertices match. Conversely, if the vertices do not lie on the same surface, the corresponding root vertices are different.
Figure 3: Example of a forest consisting of two trees.
I hope you enjoyed reading this blog post. Describing all the details of how the algorithms works would blow up the length of this post, but if you are interested in further details you are welcome to ask questions in the comments section.
Hello again! I am halfway through my Summer of HPC adventure and would want to update you on what I’ve been doing during last few weeks. If you read my introductory post (if you didn’t, I encourage you to do so), you already know I am working on project focused on deploying machine learning applications on edge devices.
What are edge devices?
There are multiple different (but similar) definitions of the edge devices but the one I am using is that this is device which is physically located close to the end user, so it can be laptop, tablet, smartphone, single-board computer like Arduino or even simple sensor like hygrometer which is capable of saving and sending measured data. In our project, we aim to deploy machine learning application on an average smartphone.
What are deep neural networks?
Machine learning model we chose to deploy is a deep neural network (DNN). Neural network is a model which is loosely inspired by human brain. It consists of multiple layers of nodes (called neurons) which are connected with each other. Adjective “deep” means that network consists of multiple layers (usually five or even more). Each neuron in each layer has weights, which are trainable parameters representing the knowledge that neuron gained from data during training process. Picture above represents simplified schema of shallow neural network. Schema of DNN wouldn’t be much different, it would just have more layers in hidden layer(s) part.
Deep neural networks are nowadays the most popular and usually the best performing machine learning models. They aren’t ideal though. Some of the cons of neural networks are lack of interpretability, very long training time, long inference time and very often huge size of files containing pre-trained network. Latter two cons are the biggest problems when it comes to deploying neural networks on edge devices.
Why do we want to deploy machine learning apps on edge devices?
Theoretically, we could leave pre-trained models on HPC clusters or cloud servers but practically it’s often not a good choice. For example, if we want to pass sensitive data to the model deployed in cloud, we would have to send it through the internet what always causes security concerns and risk of sensitive data leak. Moreover, storing the model on edge device allows to use it even when we don’t have internet connection available. Even if internet connection is available, local inference avoids latency, which is inherent part of communicating with server through the internet.
So, what can we do to deploy DNN on the edge?
There are multiple techniques allowing to compress and optimize DNNs for the inference process. While during first three weeks I was focusing on the basics of training the DNNs and introduction to GPU and distributed training, lately I was trying and investigating one of the most popular ways to both compress the model, and make the inference faster – pruning. In its simplest form, model pruning sets specified fraction of DNN weights with smallest absolute value to 0. This makes perfect sense, as when we pass the data into the DNN, data is multiplied by neuron weights to get the output. Obviously, multiplying number by value (weight) almost equal to 0 yields result which is almost equal to 0, so by setting such a weights to 0, we are not losing much information and we don’t change our model too much. On the other hand, model which contains a lot of zeros is easier to compress and its inference process can be optimized – many multiplications by 0 are fast to perform as their result is always 0. Picture below shows the idea of pruning by visualization of model presented on picture above after pruning which set to 0 roughly 50% of neurons’ weights.
Below, you can see some charts from conducted experiments with pruning. First of all, I have to mention that I experimented with shallow model (less than 10 layers) and small dataset, as I couldn’t afford learning deep model on proper data as it would take a lot more time, and I wouldn’t manage to do much besides it. That’s why training of a single model took only around 11 minutes and unpruned model had size of around 10 MB. Plots below visualize change of compressed model size, model accuracy and training time with change of final model sparsity – final sparsity is the fraction of model weights set to 0, e.g. sparsity 0.75 means that 75% of the weights were set to 0.
I think that’s everything I wanted to write in this post! It became pretty long anyway (I tried to be as concise as possible but as you see a lot is happening), so I hope you didn’t get bored reading it and you reached this point. If you did, thank you for your time and stay tuned, as in upcoming weeks I will train deep model on big dataset using Power9 cluster and experiment with other methods of model compression & optimization!
Welcome to my second post, we are now in the fifth week of the program and everything is going so well for now. They always watch our progress and ready to solve any problems we have faced through.
Coming to the project I am working on. Since the first week, we have been assigned to some tasks and tried to complete them in time duration given to us. In first couple weeks, we studied on Fault Tolerance Interface (FTI) library and did some implementations and tests to get familiar with it. We have also connected to the supercomputer and worked on it. We run our implemented applications on MareNostrum4 and made tests. Then, we discussed about process and decided to search and implement lossy compression and approximate computing techniques on checkpoints and compare them in next step. Firstly, we searched and prepared a presentation about topics, me on approximate computing and my project mate Thanos Kastoras on lossy compression. Next, in last and this weeks, we are working on implementing selected strategies to the library.
In my first post, which is here in case you haven’t read it yet, I have mentioned about the project generally. Now, let me tell a bit deeper.
FTI is application-level checkpointing which allows users to select datasets to protect for efficiency in case of space, time and energy. There are different checkpoint levels which are:
Level 1: Local checkpoint on the nodes. Fast and efficient against soft and transient errors.
Level 2: Local checkpoint on the nodes + copy to the neighbor node. Can tolerate any single node crash in the system.
Level 3: Local checkpoint on the nodes + copy to the neighbor node + RS encoding. Can tolerate correlated failures affecting multiple nodes.
Level 4: Flush of the checkpoints to the Parallel File System (PFS). Tolerates catastrophic failures such as power failures.
There is also one for differential checkpoint (dcp) which is L4-dcp. In differential checkpointing, only the differences from the previous one are held in checkpoints, thus decreasing the space usage. The difference between without checkpoint, classical checkpoint and differential checkpoint mechanisms can be seen in the simplified figure above.
Now, let me tell a bit about important functions which should be implemented in the application to provide FTI.
Firstly, at the beginning of the code FTI_Init(const char *configFile, MPI_Comm globalComm) function should be placed which initializes FTI context and prepares heads to wait for checkpoints. It takes two parameters which are FTI configuration file and main MPI communicator of the application.
Then, user should use FTI_Protect(int id, void *ptr, int32_t count, fti_id_t tid) to protect, store the data during a checkpoint and load during a recovery in case of a failure. It stores a pointer to a data structure, its size, its ID, its number of elements and the type of the elements.
There are two functions to provide checkpointing. When using FTI_Snapshot(), it makes a checkpoint by given time and also recovers data after a failure and it takes information about checkpoint level, interval etc. from configuration file. But when using FTI_Checkpoint(int id, int level), which level to use should be given as parameter. This function writes down the checkpoint data, creates the metadata and the post-processing work and as it is only for checkpointing before using this function, there are two functions to add for recovery part which are FTI_Status() that returns the current status of the recovery flag and FTI_Recover() function which loads the checkpoint data from the checkpoint file and updates some basic checkpoint information. These functions allow to see if there is a failure and start recovery if that is the case.
Finally, after all is done, FTI_Finalize() should be used to close FTI properly on the application processes.
These were some of important functions to implement FTI to an application and if you are seeking for more information, you can find it here.
Coming to the new functions we would add to the library:
Approximate computing promises high computational performance combined with low resource requirements like very low power and energy consumption. This goal is achieved by relaxing the strict requirements on accuracy and precision and allowing a deviating behavior from exact boolean specifications to a certain extent.
Checkpoints store a subset of the application’s state but sometimes reductions in checkpoint size are needed. Lossy compression can reduce checkpoint sizes significantly and it offers a potential to reduce computational cost in checkpoint restart technique.
That is all for now, I will tell more about precision based differential checkpointing which is the functionality I am trying to implement right now in my next post.
With a few weeks of progress into the program its high time to explain the project more in depth. The project title has given away a few key details – AI, linear algebra, and advanced algorithms – but how does it all fit together in the end, and how does it connect to the turtles from the title? Read on to find out.
Background
Assume that you are working on an advanced problem, perhaps the strength properties of a mechanical component which is in it self a part of an even larger problem, and you have through some amount of simplification and parameterization turned this into the problem of solving a large system of linear equations. Now, if this system is sufficiently large (our starting toy example is a sparse matrix in the range of 500k by 500k entries or 2.5e11 entries in total), even super computers have difficulties solving such systems in reasonable time. For this reason a number of algorithms have been developed to first approximate the solution as something known as a pseudo inverse or preconditioner, all in the name of just speeding up the solving step: We are specifically going to cover a Monte Carlo and stochastic gradient descent algorithm, but as the algorithms are not the focus of the project the specifics are left as an exercise.
The important detail is that the performance of these algorithms is entirely dependent on the choice of a couple of parameters, creating yet another layer of complexity. Our job is now to approximately find the best parameter choice for the approximate solution to an approximate system. This is where we start to approach the famous mythological concept of the world as resting on the back of a turtle, a turtle which as to not collapse must rest on another turtle, which of course must rest on another turtle and so on… It’s turtles (algorithms) all the way down, as commonly referred to.
Bayesian sampling based parameter search
We must stop this infinite chain of parameter search algorithms, and we do this with what is essentially a final through parameter search for all possible matrix configurations, finding the function of optimal parameters given a set of matrix features. This is something close to simply running a bunch of experiments to generate a lookup table, but as each evaluation takes so much time it is not possible to do this in an exhaustive way, leading to the need of a more in depth approach.
The solution is to build the model of the aforementioned function at the same time as you are doing the sampling. This allows you in each iterative step to find the preliminary optimal parameters, try sampling this configuration, and then refining the model specifically in that area updating it with the new data point, a technique called Bayesian sampling. The algorithm can also be seen as a form of reinforcement learning as it learns how to respond with an action (set of parameters) to an environment (matrix features), although I would argue this is more typically used in a more iterative process than the one studied here.
The solution however also introduces a set of sub-problems in what is hopefully the last complexity layer in the problem: We need to choose suitable matrix features which really are really closely correlated to the parameters, a suitable statistical model for regression, and a sampling strategy based on the statistic model for deciding on the next points to study. Finally we also need to do all of this in an efficient way to train the model with enough data points. These questions will all be revealed in the third blog post, so stay tuned!
Hello everyone, welcome to my second blog post for the SoHPC 2021! I hope you are fine and enjoying the summer. It’s been a while since my first post so I will try to briefly inform you about what I have been working on for the past few weeks. For starters, I will introduce the concept of fault tolerance and the checkpoint-restart method. Then, I will explain lossy compression’s role in accelerating said method, which is the goal of my work. Enjoy!
The Need for Fault Tolerance
Current HPC platforms, consisting of thousands or even millions of processing cores, are incredibly complex. As a result, failures become inescapable. There are many types of failures depending on whether they affect one or several nodes, and whether they are caused by a software or a hardware issue, or by a third factor, but this digresses from this blog post’s scope. The mean time between failures (MTBF) of a large-scale HPC system is about a day. This means that approximately once a day the system faces a failure. Of course, all processes running during the failure will be killed and their data will be lost. This can be devastating on large-scale applications such as complex numerical simulations which may execute for days, weeks, or even months. Also, while HPC systems increase in scale (see exascale systems), MPBF decreases rapidly, creating many problems. Therefore, the development of Fault Tolerance mechanisms is necessary to be able to continue improving supercomputers’ performance.
This is a model of MTBF as a function of the number of nodes, with values from 1.000 to 10.000 nodes. The MTBF for one node is assumed to be 20 years, which means that a single node will fail once every 20 years. We can see that when approaching 10.000 nodes (a usual number for current HPC systems), the MTBF is less than 20 hours.
Checkpoint/Restart Method
Until this moment, the most common Fault Tolerance method is Checkpoint/Restart. Specifically, an application stores its progress (i.e. the important data) periodically during execution, so that it can restore said data when a failure occurs. Typically the data is stored at the parallel file system (PFS). The disadvantage of this method is that in large-scale applications there is an enormous number of data to be stored, and due to the file system’s bandwidth restrictions there is a high risk of creating a bottleneck and dramatically increase the processing time.
To fight such limitations, certain mechanisms have been developed in order to minimize the data which is written to the PFS, easing the file system’s work. Some of those mechanisms are listed below:
Multilevel checkpointing. This method uses different levels of checkpointing, exploiting the memory hierarchy. This way fewer checkpoints are stored at the PFS, while the others are stored locally on the nodes.
Differential checkpointing. This method instead of storing all the data in every checkpoint, changes only the parts of the already stored checkpoint which have been changed, decreasing the number of data that need to be written at the PFS.
Compressed checkpoints. This method compresses the checkpoints before storing them, again decreasing the number of the stored data. Usually lossless compression isn’t effective since it doesn’t perform a large reduction on the data size. Lossy compression, on the other hand, is very promissing and will be tested during our project.
FTI: Fault Tolerance Interface Library
Our project is based on FTI, a library made to implement four levels of Fault Tolerance, for C/C++ and FORTRAN applications. The first level stores each node’s data to local SSD’s. The second level uses adjacent checkpointing, namely pairing the nodes and storing a node’s data to both itself and its neighbor. This level reduces the possibility of losing the data since for that to happen two nodes of the same pair must die. The third level uses Reed-Solomon encoding, which makes sure that if we lose less than half of our nodes, doesn’t matter which, we can still restore our data. And the fourth and last level stores the data at the PFS. The goal of this implementation is that since the first three levels cover a large variety of failures, we can minimize the fourth-level checkpoints. Also, a differential checkpointing mechanism is implemented at the fourth level of FTI.
Lossy Compression’s Role
At Wikipedia, lossy compression is defined as the class of data encoding methods that uses inexact approximations and partial data discarding to represent the content. Namely, it’s a category of compression algorithms that do not store all the information of the compressed data, but a good approximation of it. The main advantage of lossy compression is that its error tolerance is a variable. Thus we can adjust the compression’s error tolerance to the accuracy of the application. While the error tolerance is lower than the application’s accuracy the application’s results remain correct. For higher error tolerances, we can achieve higher compression rates and increased efficiency. My job is to implement and test lossy compression in the FTI, using the lossy compression library zfp (by Peter Lindstrom, LLNL). After that, we will compare my results with the ones from my partner’s implementation of precision-based differential checkpointing, and gain conclusions.
A Quiz for You!
At the end of this blog post, I have a quiz to challenge you! Which of the aforementioned methods do you think is more efficient in decreasing the bottleneck of HPC applications and why? Multilevel checkpointing, Differential checkpointing, or Compressing the checkpoints? Provide your answers in the comments, and don’t forget to explain your thoughts! No answer is wrong, so don’t be afraid!
One step forward, two steps back. The reality in the world of programming: At one moment you are making progress, as easily and efficiently as standing your dominoes end-to-end and then you just nudge your first domino and hundreds of dominoes follow by themselves, and at another moment you try to fix a problem for three days, where you are not even sure if there is anything there to fix at all.
Me in the ‘Computerstraße’ (=computer street) in Vienna.
After the Training week I started with the system set ups on the UK national Supercomputer: ARCHER2, required for the MPAS atmosphere model runs. This has entailed, installing new modules, setting up directories for the libraries, like netCDF, or HDF5 for a parallel environment and preparing the Makefile for the MPAS installation. A lot of Errors came up, but I persevered on, and so I was able to go on to the next step: preparing MPAS for small runs, which means taking a small input mesh and a short simulation time. In principle there are two steps, that need to be done for this, the first, is to create from a mesh and from the input parameters, an initial condition file. And the second is, based on the initial file you can then run the simulation itself. To accomplish this well, and also with a view to future model runs, you try to write scripts for everything where it is possible. So of course, you need a job script, where you tell the cluster the details, like on how many nodes you like to run your simulation on, how and where the output should be stored, but especially also the run command itself. But it is also very useful to write a pre-processing, a run and a data visualization script, because it would be very time consuming to do this for every run ‘by hand’. In the previous week I was concentrated on the visualization script, which I decided to write in Python. Because on the one hand, Python is a widely used and easy to handle, and the other hand, it provides an interface to the netCDF library.
netCDF archieve structure. Credit: Hoyer, S. & Hamman, J.. (2017). xarray: N-D labeled Arrays and Datasets in Python. Journal of Open Research Software. 5. 10.5334/jors.148.
This is important, because the output of the MPAS runs are netCDF files, where you have the 3d mesh coordinates, and physical relevant output parameters, like temperature at different pressures, wind velocity, vorticity etc. stored in n-dimensional arrays.
In my visualization script I load these Datasets, assign them to one dimensional Python arrays, and I am then able to plot them with matplotlib, like every other array. Of course, this is only one part of the story, the further step is connecting the visualisation script to the pre-processing and running script and that you enable the script in a way, that you just have to type in which parameters you want to have plotted and the rest is automated.
First visualisation attempt: spherical mpas mesh with 2562 grid cells, color-coded vorticity at 200 hPa.
Before we can visualize any data we have to produce a data set. We structure our test runs into small, medium and large ones. Where small/medium/large runs, are not only characterized by the mesh size of the model run, but also on simulation time and the core count we use for those runs. Additionally how different run parameters and some physical parameters influence the calculation time of our simulations will be investigated, as part of understanding the puzzle of MPAS and computational efficiency on ARCHER2.
Top left: me (Carla N. Schoder), Top right: Co-mentor Dr. Mario Antonioletti, Bottom left: Mentor Dr. Evgenij Belikov; Bottom right: Project partner Jonas Eschenfelder
My job is to create a virtual replica of the Marconi 100 supercomputer in which you can visualise data such as the status and temperature of each node. To do this, we first need to know the technical specifications of this machine.
Architecture of Marconi 100
Marconi 100 is a supercomputer created by IBM and launched in 2020. It is composed of 55 racks, 49 used for computing nodes. Each rack has 20 nodes, named using the rack number followed by the node number. For example, node 5 in rack 210 is named r210n05. In turn, each node is composed of 2 CPUs of 16 cores each and 4 GPUs. As a curiosity, with its 32 Pflop/s, it was the ninth largest supercomputer in the world in 2020 and uses the Red Hat Linux distribution as its operating system.
The data is contained in an ExamonDB database. Since it is a very large amount of data and we want to facilitate its management, it is stored in a different file for each node, that means 980 files.
The database has information since the supercomputer was started. Data such as the temperatures of both the CPU cores and the GPUs. The power, voltage and memory consumed by each CPU. And one of the most important data for us is the status, which indicates whether the node is working properly or has a problem. A value of 0 means that there are no problems and any other value means that there is a problem.
Creating the 3D visualisation
To create this visualisation we used the VTK (Visualization Toolkit) package for Python and Pandas to handle the data in table form. The CINEMA visualisation team provided me with a 3D model of Marconi 100 which I exported in STL format, as it is very easy to load with VTK’s vtkSTLReader class. (Provigil)
The VTK visualisation pipeline consists of the following parts: Sources, used to read raw data. Filters can then be used to modify, transform and simplify that data. Mappers are responsible for creating tangible objects with the data. And at the end of this pipeline are the Actors, which are responsible for encapsulating these objects in a common interface that has many freely modifiable properties. These Actors are the ones that are passed to our renderer to show them in the scene. In addition to all this, other much more specialised classes can be created that allow us to develop a more interactive environment, among other things. (messinascatering) In my case, I have used this to create keyboard shortcuts that I use to move around the time axis when displaying data.
VTK Pipeline
Using all this I have created a first 3D visualisation version of Marconi 100. In this version, I show the status of each node with a coloured plane (green if it is working properly and red otherwise). A text with the consumed power is also shown. And for quicker identification of each node, there is a text with the rack number above it. You can see it in the video below.
The next step is to improve the graphics by adding textures and new data such as core temperatures, where they could be interpolated with a temperature point cloud. In addition, data loading can be improved by parallelising the process.
Hello ! Welcome to my blog post #2 where we get deep into the MEEP! If you have not yet read #1 you can do so here . In this post , I will give more comprehensive outlook to the project and talk about my progress as well.
The Memory Hierarchy
Before delving into the nitty – gritties of the project, it is important to understand what the memory hierarchy is and why it is a fundamental factor that determines the performance of any computing system.
Imagine there’s a large store in your city (This is obviously before the times of online shopping). İt is well stocked and has got everything you can think of , from a lawnmower to kitchen napkins. But there’s a catch. Shopping in this store takes quite some time because :
Since it has so much stuff, you have to go through so many aisles to get to what you need.
The store is located further away from your home .
Time is money, so your city’s municipality decides to build a smaller supermarket near your neighborhood. It is not as equipped as the mega store , but it has most of the stuff that you need frequently. Furthermore, if you ever needed anything that is only in the larger store , it could be fetched and brought to the small supermarket. Pretty cool, right?
This is basically how the memory hierarchy operates. We have smaller, faster memory levels placed closer to the CPU (Central Processing Unit) and they contain only the data and code we need at the moment, hence faster speed in processing.
Need for Speed
This project proposes various placement and management policies that optimize the movement of data and instructions through the memory hierarchy. These novel policies would have to be tested and experimented upon before casting them onto silicon and MEEP offers an excellent platform to do this.
MEEP (MareNostrum Experimental Exascale Platform), is an FPGA (Field Programmable Gate Array) – based platform that enables us to evaluate new architectural ideas at speed and scale level and also enables software readiness for new hardware. It would be easier to think of MEEP as a foundational prototype that can be used to test the viability of a certain framework or architecture. One of the unique features of MEEP is a self-hosted accelerator. This means that data for the calculations can reside in the accelerator memory and does not have to be copied to/from the host CPU. This accelerator is part of the chiplet architecture called the ACME (Accelerated compute and Memory Engine). A key differentiator of ACME is that it employs a decoupled architecture in which the memory operations are separated from the computation. While dense HPC workloads are compute-bound in which the accelerator tiles represent the bottleneck, sparse workloads are memory-bound as the vector elements need to be gathered/scattered using multiple memory requests.
Coyote Simulator
The cherry-on-top to this accelerator of the future is the Coyote simulator that is responsible for its performance modelling and analysis. My role in this project is centered around this simulator, the MCPU simulation specifically. Coyote is founded on existing simulators (Spike and Sparta) and is being improved by catering for their shortcomings,especially in the HPC domain in which the number of resources to be simulated is high, hence making it a powerful modelling tool. The name “Coyote” was adapted from the Looney Tunes cartoon series “Wile E. Coyote and the Road Runner”.
These past two weeks, I was busy setting up the simulator repository and dependencies in my PC and working on scheduling policies of load and store instructions within the MCPU . The latter is still a work in progress and I plan to expound on it in my next blog post.
Emulation … Simulation … what’s the difference?
Simulation involves creating an environment that mimics the behavior of the actual environment and is usually, only software oriented. Emulation, on the other hand involves duplicating the actual environment both in the hardware and software spectrum. Below is an illustration of how the MEEP project implements both emulation and simulation.
And that’s it for blog post #2. This was a long one so congrats if you made it this far. I hope you have learnt something new or gained a clearer perspective of our project. I’d be happy to answer any questions you may have about the project so feel free to comment below.
With the third week of work in the books, it’s high time for another update here. We’ve been busy getting everything set up on ARCHER2 to be ready to run proper simulations and get into the performance measurement of the MPAS atmosphere model.
After some technical issues with downloading and compiling MPAS during the first week, we set our eyes on writing scripts to help automate future runs. As we intend on simulating a wide variety of scenarios using different parameters such as mesh size, model run time and number of nodes used, not having to manually change these for each run will save a lot of time in the future.
Figure 1: An idealized baroclinic wave simulated for 150 hours propagating through the world’s atmosphere
Last week we were finally were able to start running the first simulations. This wasn’t a real test scenario yet but a simple idealized case. Figure 1 shows the result of a Jablonowski & Williamson baroclinic wave modelled for 150 hours (just over 6 days) propagating across the Earth. This is a very simple test often used to see whether an atmospheric model fundamentally works. It models a single pressure wave propagating zonally across the Earth.
Figure 2: Plot of average run time by number of cores for a 150-hour simulation of a JW baroclinic wave.
While not very exciting to look at, these kinds of tests help us to see whether MPAS is working properly and also gives us an important insight into how we can best design our upcoming experiments to make the best use of our time and computing budget. It also gives us a first look at the performance of the MPAS atmosphere model. As seen in the Figure, the average run time dropped from 1 to 3 cores, showing better performance with more computing power, but increased again when 4 cores were being used. This is most likely due to any performance gain by further parallelization of the model being diminished by more time spent in communication between the tasks. These are the kind of insights that we hope to find later in our performance analysis, but on a much bigger scale, and will investigate using profiling tools to pinpoint where performance is lost.
With the experiences gained from the last few weeks, we are now ready to start the proper performance analysis of MPAS, next up is debugging some of the automation scripts and running small scale simulations for our first real scenario. So, stay tuned for more updates!
Hi everyone, my name is Rajani Kumar Pradhan, and I am from India (Andhra Pradesh). Currently, I am pursuing my PhD in the Department of Water Resource and Environmental Modelling, at the Czech University of Life Sciences Prague. My research interest is centred around satellite precipitation, climate change, hydroclimatic variability and the global water cycle.
Before starting my PhD, I have worked as a ‘Junior Research Fellow’ at Banaras Hindu University, India. Even before that, I have completed my master’s in environmental science, from the Central University of Rajasthan, India. Although I have some general idea about High-performance Computing and Big data, however, the real encounter started when I faced some huge datasets. This is the moment when I realized the importance of HPC in data analysis and thanks to my supervisor Yannis Markonis, who introduced and shared the information regarding the PRACE Summer of HPC 2021.
The journey has been started with a one-week training programme from introducing the basics of HPC to some advanced topics. It is unfortunate that we cannot meet or attend in person considering the pandemic, however, the online meeting with zoom also has some unique experience. During the training week, I have introduced to several new terms like MapReduce, OpenMP, MPI and most of them I never heard before. To bring some more fun into the programme they have also scheduled some hands-on exercises, following the lectures.
And coming to our main project, the training programmes offered several excellent diverge projects. Among them, I have selected project number 2133 The convergence of HPC and Big data/HDPA, which I find more interesting and very relevant to my needs. The main motivation behind the project is how to tackle the challenges in the convergence of Big data analysis with HPC.
During the following week, I have started our main projects with Giovana Roda, and college Pedro Hernandez Gelado. We have started with another training week and some basic introduction about Hadoop, HDFS, and Spark. At the same time, we have some exercises to practice on the Vienna Scientific Cluster(VSC-4). Initially, we have started with the Little Big Data (LBD) and later we have worked on the VSC-4. In the upcoming weeks, we will work on some case study datasets on the cluster and cannot be more excited about it! Stay tuned, will come with new updates in the upcoming weeks.
In my last blog post, I presented the Boltzmann-Nordheim equation and that Supercomputer can help us to solve this equation. This time, we have a look at the Supercomputer itself and how we can use it.
Why do we need Supercomputers?
Before having a look at supercomputers, we might ask ourselves, for what kind of problems do we need them for? One major application of supercomputers is simulations. A current example for simulations is the COVID-19 Pandemic, where PRACE supports projects focussing on different aspects of the crises, as simulations of the spread of the disease or research to understand the virus itself. Another example is the weather forecast, where supercomputers use the current weather to predict the upcoming weather. You will find many more examples, where supercomputers play an important role.
Could my computer solve these problems?
For some of these problems, your personal computer might be able to compute a solution, but not to the precision that would satisfy the scientific needs and standards. Furthermore, your computer would just take too much time to finish the task. In the last weeks, I tried to run the code that simulates the Boltzmann-Nordheim equation on my computer at home – after an hour of waiting for a response, I just stopped the computation. Luckily, my supervisors gave me access to a more powerful computer, where I did not have to wait hours to get a result.
But what makes it faster?
It is not just, that supercomputers have a faster CPU or a bigger RAM than your PC (the Fugaku, currently the fastest Supercomputer has 64PB RAM!). Their major strength is, that they have more CPUs than your PC. This way, you can more think about multiple Computers working together rather than a single big machine! In the video below we can see how we can benefit from having multiple CPUs.
The “PRACE” mesh was generated via png2mesh, a c++-library by Johannes Holke to produce meshes given a .png files. The backbone of the project is t8code, a library to manage the adaptive mesh refinement in your project.
And how does that relate to your project?
In the video of the last post, I showed you, that I am focussing on a part of the equation, that describes the collision of the particles. My supervisor, Alexandre Mouton has developed a method that can further subdivide the collision term into pieces, such that they can be computed in parallel. My task is to get a good understanding of this method and to add it to the current state of the code. In the last weeks, I had a look at his code and installed it onto my computer. Before editing the code, I need to understand it. With thousands of lines of codes, it was not easy to get into it. Luckily, I do not have to read and understand all of them as I do not have to adapt the whole project. Recently I started to implement the new method and planned what parts of the method I will focus on in the next weeks. For now, I will head back to coding, but keep you up to date on every progress with the upcoming posts.
Hello, my name is Martin Stodůlka and I will use this blog post as a way to introduce myself. I am 25 years old and I was born and still live in town called Brno located in Czech Republic. I have earned my Master’s degree (Ing. title in my country) this year with specialization in HPC. I’ve graduated from both Bachelors and Masters on Brno University of Technology – Faculty of Information Technology.
Even prior to university I attended a grammar school Vídeňská which specialized in programming where I had my first experiences with programming. Outside of school ever since I was 18 years old I worked part time during summer holidays in a company Tescan which specializes in manufacturing and developing electron microscopes and the software that operates said microscopes. I mostly developed tools for processing large data and worked on some image processing methods as well.
Outside of programming I am just an average guy who likes: tourism, sight seeing, PC building and gaming. I used to go fishing and collect minerals when I was younger. As of late I have been thinking about getting into airsoft to get some more physical exercise and keep in touch with my childhood friends.
Why have I chosen SoHPC and what do I expect from it?
I have chosen Summer of HPC to test and expand my skills learned at my university. I am also hoping to explore new environment, people and perhaps opportunities. I have learned about Summer of HPC during one of lectures about HPC, where previous absolvents shared their experiences with Summer of HPC.
The project I am working on
I have been assigned to a project called Designing Scientific Applications on GPUs with my colleague Theodoros and my mentor for this project is Dr. Ezhilmathi Krishnasamy. For the rest of the summer I will be working on optimizing application for large scale topology optimization called TopOpt which uses library PETSc which is used for highly parallel scientific applications modeled by partial differential equations. My goal is to locate slow parts of code in TopOpt and speed them up using CUDA/GPUs and replace CPU PETSc code with GPU variant.
This is Paolo Scuderi from Italy. Well guys, I am 25 years old and actually I am not a student anymore. I have just graduated from Politecnico di Torino a few days ago at DIMEAS Department. However, let me introduce who I am and how I was put in contact with PRACE.
Something about me
I am an Aerospace Engineer, graduated cum laude on the 14th July 2021. I spent the last year developing my Master’s Thesis in collaboration with The von Karman Institute for Fluid Dynamics, in Belgium. Unfortunately, due to the strange and hard Covid period, I was not able to reach the Institute and spend some days on-site. I worked online. At the beginning of that experience, I was sad. Of course, traveling and working on the project that you love is the happiest combination that every student-researchers would like to join. Anyway, having spent many hours in front of my PC, I joined CINECA Academy. With their continuous support, I met PRACE and I discovered a new passion regarding High-Performance Computing! Which is the best combination? I studied Aerospace Engineering and I specialized in Fluid Dynamics. As you know, I think, the Navier-Stokes equations are not analytically solved. Hence, only two different approaches are possible: the experimental one and the numerical one. The former is very expensive and not usable during a period like this. In this context, my Master’s Thesis has been associated with the numerical idea. In addition, as a consequence of the high computational cost of these simulations, the HPC systems have a lot of applications in the fluid dynamic world. For these reasons, when PRACE was presented to me, I felt incredibly motivated and I started to find some projects related to my field of studies. I found Summer of HPC, an amazing and interesting project which aim is to provide good knowledge of the usage of HPC systems around Europe to all the students. What’s better than this? Well, there was only one problem: finding a project related to Fluid Dynamics. Fortunately, the idea comes from the University of Luxembourg, where I am currently working.
Aerodynamic project at University of Luxemburg
My project aims to study the behavior of the external flow over the DrivAer car model developed at TU Munich. There are several configurations based on the shape of the geometry of the top, the underbody geometry, the mirror configurations, etc. … Cars, from an aerodynamic point of view, are blunt bodies. The typical two elementary contributions to the forces are pressures and stresses. When we are talking about blunt-body the first one is bigger than the second one. Hence, this is the case of DrivAer problem. For these bodies, a big wake is present in the region behind the car. The presence of the wake is associated with the lack of momentum, which is again related to the component of the total aerodynamic force projected in the parallel direction of the infinity flow: the drag. This is only one of the contributions to the total drag of the cars. Indeed, for instance, also the wheels and the underbody region contribute to it. As a consequence, to make the cars move, they need to develop a motor force such that it can win the negative contributions related to drag. In this context, the cars use combustible fossils to develop this force. The chemical energy provided by the reactions is transformed into mechanical energy thanks to the engine. As a result, a large amount of waste is released into the environment. One of the most important aims of car aerodynamic studies is to find the mechanism to reduce the drag, to finally reduce carbon dioxide and the NOx in the Earth’s atmosphere. These concepts are not valid for motorsport cars, where the main goal is to develop an aerodynamic body to reach the highest velocity possible. Indeed, drag is the most important aerodynamic force when the car speed is very high. Therefore, talking about urban cars, drag is more relevant in the motorway instead of in the normal urban city. However, the aerodynamic studies were not important until the energetic crisis of ’70. Nowadays, the aerodynamics of urban cars is important to save our environment. I am studying all these aspects in my project!
Something about you
And now, some questions for you. Are you ready? Feel free to answer and contact me, there are no wrong answers. We’re all here to learn something new!
Who are you? What and where are you studying?
Do you have any ideas about other problems, like the Navier-Stokes equations that are not currently solved? Are you sure that the Navier-Stokes equations are not always analytically solved?
Do you have any idea how long it takes to solve all turbulence scales (DNS) for a complex geometry? Why do we have to use the HPC system?
Do you have any idea about the aerodynamics difference between a classic car and a motorsport car?
Do you have an idea about the percentage of dioxide d carbon in the atmosphere related to cars?
Hello, I’m Eoin Kearney. I’m 23, Irish, and I’ve just completed my masters in chemistry in the Erastova group [website here] at the University of Edinburgh in molecular dynamics simulation. These consist of modelling uranium adsorption in swelling clay at various pH setups, using the HPC in Edinburgh, Eddie.
I’d always enjoyed computational work, but science indulged my curiosity for the wider physical world, and so I embarked on my chemistry degree in 2015. A combination of changing trends in science and education towards computers, and COVID-19, set me up for a great computational project this year. Though unexpected initially, I genuinely enjoyed it and once I had a taste was eager to take it forward and build on my experience. On top of that nuclear science has always captivated me and so when I saw PRACE had computational opportunities attached to nuclear fusion, I was sold.
My project, with another student, is titled ‘Computational atomic-scale modelling of materials for fusion reactors.’ It is in the atomistic modelling of tungsten reactor vessel walls, especially around defect sites induced under the extreme conditions of fusion. Its at a similar scale to my masters though more collision focused then on adsorption behavior.
In my free time I enjoy hiking and camping with my friends. Living in Scotland has given me a few opportunities to see some of the lovely, rugged landscape in the highlands, and I’d like to see more of the world. My next goal there is to complete the old pilgrimage route Camino de Santiago in Spain or at least to see the Pyrenees. Otherwise I enjoy reading, though my masters has restricted non-chemistry related topics so far!
So I applied to SoHPC and was luckily accepted, and now here I am. I’ve enjoyed the training week so far. As far as text editors go I’ve always defaulted to nano and its hard to get out of. Vim is intimidating but I’ve finally learned how to exit it, so that’s progress. The other perspectives of different training courses can reinforce basic key concepts.
The next step is learning the personality of the HPC cluster in the Barcelona Supercomputing Center, Mare Nostrom4 [more info here]. Its scary just how many facets I’ve skated by, like checking the available modules on the HPC! Previously I had gotten in the habit of using only those necessary and its interesting to shake up my work and see the diversity of applications these molecular modelling programs get applied to. It will be a lot to learn, but I think that’s the strongest point of PRACE; its a fast and deep dive into a complex area. It is what I will make of it.
I had some time in Scotland to see a good bit of the country, this is up near Ullapool
“The only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle.”
– Steve Jobs
This is me in my Summer of HPC t-shirt.
Hello, I am Aneta, 24 years old, and currently I am an undergraduate computer science student at the Matej Bel University in Banská Bystrica. I decided to take a course in computer science three years ago. Before that, I was mostly focused on humanities, but I figured out that I am bad at memorizing and I wanted some change in my life. This decision, so far is one of the best that I have ever made. Even though, I am still not sure in what field of computer science I want to pursue a career, but I know that I am on the right path.
During my studies there were a few subjects where High Performance Computing (HPC) was discussed and I was introduced to parallel programming. I must state, that I enjoyed all of these subjects, so when one of my university teachers brought up Summer of HPC, it immediately piqued my interest. For those who are not familiar with the concept of HPC, it “refers to computing systems with extremely high computational power that are able to solve hugely complex and demanding problems”, as stated on European Commission’s webpage.
In May I found out that I was accepted by the committee of Summer of HPC and I felt great! I was selected for project number 2101, which is about the analysis of data management policies in HPC architectures. In charge of this project is the Barcelona Supercomputing Center which, needless to say, is located in Barcelona (Spain). This project is part of the MEEP project, which’s objective is to create an emulation software development platform for exascale systems. To put it simply, my job is going to be to analyze and compare ways of data movement, storage and access through the different levels of memory hierarchy in order to achieve high performance and efficiency.
Over the next two months, I hope to learn about how teams work on large projects, to gain new skills that I may put to use in the future and maybe to find some new friends.
If you enjoyed this article, please share it with your colleagues and friends. Also, do not hesitate to leave a comment below.
Hello there! Welcome to my blog, my name is Maria Li López Bautista and I will be posting about my journey in the Summer of HPC program.
With the end of the training week and the start of the projects, it’s time for the presentations. So, here we go…
Introducing me
To begin with, I’m a 20-year-old undergraduate student of theoretical physics. Currently, I’m in my third year and thinking about my final degree project, if it can be said, what a nightmare having to choose! But fortunately, I’ve been accepted to take part of this amazing summer program which is full of knowledge of the programming world and its scientific applications. So, I hope it helps me decide…fingers crossed!!!
Well, let’s leave my student sorrows behind and proceed to proper introduce myself. First of all, I was born and bred in Barcelona, a city with plenty of activities to do and never ending. If I had to describe myself with one word, it would be curious. Because since I was a child, I had the need to know beyond what it seems at first sight. Maybe I should also say that I’m a bit stubborn because once I have an idea fixed in my head I don’t stop until it comes true. Finally, about my hobbies I have to say that I love listening to music and reading science-fiction books as well as hanging out with my friends.
About the Project
SURF Research Cloud: a workspace’s components, including datasets
To sum up, we’ll learn how to handle storage for data processing in scientific research as well as how to apply them to cloud computing, and how to use Research Data Management principles in practice with real cases. Therefore, we’ll be working in the Research Data Management field and getting into the SURF Research Cloud.
(Virtual) Greetings from Barcelona! I hope to see you around here!
Hello, from Vienna! My name is Carla Nicolin Schoder, my master thesis project in Astrophysics, at the University of Vienna is yet not over. I study the influence of dense galaxy cluster environments on dwarf galaxy behavior, performed with simulations on the Vienna Scientific Cluster and comparisons to Virgo galaxy cluster observations later.
Background: Owl nebula in narrow-band filter (OIII & Hα) with Vienna Little Telescope (0.8m), Credit: C.Schoder. Front: Infrared-Image of me.
That my life path brought me to Astronomy, computing and nature science, started as a child in the countryside of Austria in the dark nights. Even then I was a night owl, so I packed my blanket and my dog and left into the dark, the fascination of the unbelievable vastness of space and the beauty of the stars, ignited between me and Astronomy the spark.
Significantly later, programming found the way into my heart, I still remember my tantrums with my first computer, which took half an hour to start. I have always seen them as a tool to answer my scientific questions, but in the meantime I discovered, that I love to lose myself in programming sessions.
When I started my master project a few months ago, naively thinking I am an Astronomer, I have not yet guessed that I will be soon a supercomputer programmer. As it should soon turn out, working with supercomputers will not be limited to my dwarf galaxy investigation, because I was accepted for this amazing PRACE Summer of HPC, as part of the MPAS atmosphere model project, with Jonas Eschenfelder (project partner) and Evgenij Belikov (project mentor) in collaboration.
Model Prediction Across Scales (MPAS), is able to model atmospheres globally and within small geographical regions, to investigate for example air-quality and atmospheric chemistry, or predict hurricanes and weather over the seasons. The unstructured spherical centroidal Voronoi meshes, though structured in vertical direction, is the special model capability, because it allows smooth selective mesh refinements, with increased accuracy and flexibility.
The goal of the MPAS summer project is to perform this atmosphere model on ARCHER2, the new UK National Supercomputer, and to push the MPAS model to it’s scalability limits, hopefully contributing to the models promising future. Joyfully looking forward to join the PRACE Summer of HPC for the next weeks, exited to work with UK Supercomputer, Atmosphere Model MPAS and collaborate with other computer geeks.
My name is Joseph, I am 23 years old and I am a Computational Physicist studying at the University of Edinburgh, about to undertake my final Master’s year. It has not been a typical university experience with many interruptions to our learning (strikes, extreme weather, and some sort of pandemic), but my love for the logical simplicity of coding and the mesmerising beauty of Scotland has made up for these roadblocks many times over.
I happened upon PRACE in the midst of my search for internships this summer and found the vast array of projects incredibly appealing, particularly those involving Machine Learning. Having taken an introductory course in Artificial Intelligence, my interest in training models with big data to invert the problem-solving method led to my choosing of Neural Networks in Quantum Chemistry, under the stewardship of Dr. Marián Gall at the Slovak Academy of Sciences. To elaborate on the project, typically scientists discover chemicals in nature and use their known molecular structure to analyse their chemical properties with a quantum-mechanical treatment. However, this process is computationally dense and gets exponentially more expensive the more complicated your molecular structure is. Thus, with the help of artificial neural networks, we will replace the traditional analytical method with a machine learning approach using molecular descriptors via Dscribeas input. Moreover, in the name of high-performance computing (HPC), we will aim to parallelise the neural network using both GPUs and CPUs, comparing their relative speed-ups.
Learning that we would have travelled to Dublin for our training week in a “normal” year was quite the dampener as I have always longed to visit Ireland, but this disappointment was quickly swept aside by the wisdom being presented to us remotely by the Irish Centre for High-End Computing (ICHEC). In particular, the techniques to speed up Python code using tools such as Numba and Cython were especially insightful (and would have been incredibly useful a year ago when I was conducting week-long computer simulations of DNA!). Being able to exploit the power of their supercomputer, Kay, was also very exciting and provided a great opportunity to brush up on my remote desktop Linux commands via software such as Vim, Bash and Slurm.
Once again, I am incredibly eager to begin working on my project and look forward to a very enlightening Summer of HPC!
Hey there! My name is Artem, I am a master’s student of Mathematical Engineering at the University of Padova, Italy. This year I am finishing my thesis work on parallel preconditioners, more precisely I am working on GPU implementation of FSAI preconditioner for general matrices. My story with High-Performance Computing (HPC) began with a course from my curriculum (try to guess its name..) since then I was fascinated with parallel algorithms. Moreover, I am interested in numerical methods. Good thing for me that parallel algorithms go hand in hand with numerical methods.
At the beginning of this year, I was not so sure about my future, therefore I decided to look for some opportunities and after a short search with my interests, I already bookmarked the PRACE summer school. Among many attractive projects, I found project 2122 “Numerical simulation of Boltzmann-Nordheim equation” which was in my area of interest and in correlation with my curriculum.
The Boltzmann-Norheim equation or quantum Boltzmann equation describes the time evolution of a gas. When this gas formed by bosons is cooled down to a low temperature we can observe a Bose-Einstein condensate which is one the most striking quantum phenomena in nature. It is also described as the fifth state of matter. The study of this equation is still undergoing and can bring us closer to an understanding of this quantum effect.
Since this equation is potentially written in 7D phase space (1D time + 3D space + 3D velocity), it is a challenge to come up with a numerically fast, accurate, and stable scheme that solves the problem. The purpose of this summer school is to extend the present work of KINEBEC project to a non-homogeneous case.
With the help of PRACE summer of HPC program this year I will join together with my fellow teammate David Knapp the KINEBEC project!
Hi! So you’re probably wondering why I, a guy who spent the last three years mostly looking at rocks, is now here playing around with the UK’s new national supercomputer? Well, I’m happy to tell you all about that…
But first a little bit about myself. My name is Jonas, I’m originally from Bavaria in Germany but am now in my third year studying geophysics at the Imperial College in London. My interests in science lie with impact craters around the solar system, the changing environment right here on Earth and how we can use modelling to understand both better. When I’m not fretting over school work, I like to spend my time watching fast cars go in circles in Formula 1, play Dungeons and Dragons for hours on end with my friends or go hiking if the British weather ever permits it.
As a teenager, I didn’t know what I would later want to do in life. I was interested in all sorts of things, from history and sports to science, everything seemed cool to me. But that changed when I took a year abroad to the US, where I took my first geology course. Learning how you could figure out what a place might have looked like millions of years ago, just by looking at features in the rocks there today fascinated me and I became obsessed with geology. So I applied to Imperial College London and was fortunate enough to get a place at the Earth Science and Engineering department there to pursue an MSci in Geology. Early in first year though, my interest changed from looking at the actual rocks to the ‘bigger picture’, the Earths ocean and atmosphere interacting across the globe, entire continents moving around and giant meteor crashing into planets suddenly seemed so much more exciting and so I switched to do Geophysics.
A crater cluster on Mars. This cluster also shows signs of ice on the floor Image from NASA/JPL/UArizona
Before university, I barely knew how to turn my laptop on and CPUs or GPUs only mattered when it came to gaming, but my new direction towards geophysics led me to learn how to. From the first course onwards I enjoyed it, especially seeing the intersection between the computing world and natural sciences excited me. Because of this, I did a research internship at Imperial in my second year, looking at how clusters of impact craters form from larger meteors fragmenting in Mars’ atmosphere. Seeing how the models influenced our search for patterns and vice versa was exciting and I wanted to learn more about modelling in and of itself. So for next year, my master thesis is going to model how the chemistry in rivers changes across the Clyde Basin in Scotland with the hopes of being able to figure out where pollutants are introduced into the water.
Now you might ask yourself how all of this leads to PRACE and the Summer of HPC and that is a good question. I found out about this program through a professor at university and at first was hesitant to apply since it seemed way too advanced for me. But the projects sounded interesting and so I applied afterall in the hopes to be able to challenge myself, learn about the pinnacle of computing power and understand the background of how some of these large models actually work that I so often just read the results from. I got accepted and will work with Carla Schoder on the MPAS atmosphere model at the EPCC. Sadly we won’t be able to taste test a deep-fried mars bar in Edinburgh, as the entire program iis remote. But I’m still excited about the project and to work with new and interesting people.
So what’s next for me? Well, last week we finished an intense training week and now I actually know what a supercomputer is and how to operate one. We will work together to figure out how to best test MPAS and then get to play around with it on ARCHER2, the UK’s new national supercomputer. I will of course keep you updated here with all the cool stuff we’ll be doing and I’m learning, so stay tuned.
Hello everybody! My name is Marc and I am writing from the mediterranean city of Barcelona, land of several personalities such as Antoni Gaudí, Mercè Rodoreda and Leo Messi. I was born in this city 23 years ago. Currently, I am studying physics and maths at the University of Barcelona.
About Me
I would describe myself as an active person, maybe a little bit impatient too. I am passionate for science and maths. I am interested in the huge number of opportunities that computer science brings us to study physic systems. I also play guitar (extremely bad, actually), particularly, I enjoy playing jazz. If I have to choose a film, it is Midnight in Paris by Allen, a book, 1984 by Orwell and a song, Isadora by Christian Scott.
About Summer of HPC
Regarding my virtual stage in the Summer of HPC, I am involved in the 2120 High Performance Quantum Fields program. Specifically, I am beginning a project about Carbon Nanostructures (CNS). The main goal of this project is to collaborate with the CNS group from the Jülich Supercomputing Centre, performing and creating a C++ library and an algorithm to model CNS using fermionic matrixes. I will learn and try to post about C++, OpenMP, GPU programming and, in general, scientific program design.
Hope to know more about all of you. And if you have planned to visit Barcelona do not hesitate to contact me: I have a list of the best restaurants!!
Hi all, this summer I am doing the “Summer of HPC” programme. In this blog, I want to share my experiences and thoughts with you. But first, let me introduce myself:
Introduction
Me, doing a handstand with the new “Summer of HPC” T-shirt
My name is Miriam and I am 28 years old. Until 2018 I have studied Mathematics. After that, I started working as a software engineer. While I enjoyed the programming part of my work, I wanted to learn more about the theory and applications of Computer Science. Therefore, I decided to do my Masters in Computer Science which I am currently pursuing at the University of Edinburgh.
In my leisure time, I love doing sports, gardening, cuddling my cat and baking cakes.
Motivation
High-performance computing (HPC) combines my interests in Computer Science and Natural Sciences due to its wide range of applications. What I particularly like about HPC is that it is often necessary to use advanced programming and software engineering techniques and that it is therefore closely related to research.
The Summer of HPC programme is a combination of HPC training and practical work. I am very enthusiastic about increasing my knowledge in high-performance computing and getting practical experience in applying this knowledge to real-world problems. Therefore, when I initially heard from the Summer of HPC programme I knew immediately that I want to apply for this programme. I am very pleased that my application was successful and I am very excited about what I will learn during this summer.
First training week
The first training week covered the usage of Python in HPC, OpenMP, and MPI. What I enjoyed most during this week was the mixture of theory and practice. In this way, the learning outcome was accordingly high.
From next week on, I will be working on the project “HPC Implementation of Molecular Surfaces”, organised by the VSC Research Center in Vienna (https://summerofhpc.prace-ri.eu/hpc-implementation-of-molecular-surfaces/). I am looking forward to all the experience and insights I will gain during the next two months!
Finally – a small quiz for you
To keep my brain working during the first week, I definitely needed enough cake! So that not only I benefit from this cake, I implemented a small rebus (https://en.wikipedia.org/wiki/Rebus) for you on it (see the picture below).
Foto of my workplace
Marzipan-Nut-Cake. Can you guess the phrase in the Rebus?
Can you guess what I have “written” on the cake? You are welcome to comment below if you found a solution. If it is too difficult I can also provide some tips.
Hello everyone! My name is Athanasios Kastoras (but I go by Thanos) and I am really excited about the opportunity I’ve been given, to participate in this year’s PRACE Summer of HPC. I will be working on Precision based differential checkpointing for HPC applications with the Barcelona Supercomputing Center, but more about that later.
Who am I?
I was born on October 19th, 2001, in the city of Volos, Greece, where I was also raised, and now study. I loved maths from a young age, and I was curious about science, but I never got in touch with Computer Science and Programming until I went to University. When I entered the Electrical and Computer Engineering Department at the University of Thessaly, I was seduced by the C programming language and later by the amazing hardware and computer architecture fields. Now I’ve just finished the second year of my studies and I’m already passionate about computational speed, software optimization, and HPC system programming. This is why I applied to the PRACE SoHPCprogram; because I believe this is the best starting point to begin my journey to the limitless world of high-end computing.
If you think you can do a thing or think you can’t do a thing, you’re right.
Henry Ford
Some facts about me
I love coding! Whether we talk about scripting or low-level programming, I can look at lines of code for hours and never be bored.
I am a Linux enthusiast! In the last two years, I’ve switched between Ubuntu, Kubuntu, Fedora, CentOS, and Manjaro Linux distributions and I always love to experiment with new Operating Systems on my personal laptop.
I practice Karate! Since the age of seven years old, I’ve been practicing Shotokan Karate. Until now, I’ve gained a black belt with a second Dan, many experiences, and exciting skills.
I love camping and nature! I always look for a chance to spend time in nature and relax away from the fast city life.
My experience in SoHPC so far
Today, I’ve just finished the SoHPC training week. We learned many interesting things during the whole week, about the Python language in HPC, parallelizing C and Fortran code using OpenMP and MPI, and more importantly, we run experiment programs on an actual supercomputer. It is amazing to know that the code you wrote on your personal laptop is running on a supercomputer in Ireland just with the press of a button (actually the button was a script, but anyway).
This is the Barcelona supercomputing center. I recently learned that it is built inside a church. Isn’t it so cool? Unfortunately, the program is remote, but this place definitely goes on my travel bucket list!
The project I will be working on
During the summer I will be working on a project called precision-based differential checkpointing for HPC applications. I will be dealing with FTI, a library that helps programmers make HPC applications less vulnerable using four levels of checkpointing (i.e. storing the state of the execution to be restored in case of an error). Levels that cover a larger amount of errors are more expensive than those that cover more simple errors, thus it is necessary to find the best balance between them. Our goal will be to implement differential checkpointing in FTI and explore precision-based differential checkpointing. Sounds confusing? During the following blog posts, I will be presenting the project in more detail, but no prior knowledge is needed to understand it.
Finally
I am really excited about this summer and so far it is going really well! Unfortunately, we weren’t able to travel to the sites we will be working with, but I will try to make the best out of this experience. Do you have any experiences like this? If yes, feel free to share your experience or anything you want to discuss in the comment section.
Finally, I hope you enjoyed my post and found it interesting. More posts are coming, so please follow me on my LinkedIn account if you want to be updated.
Hello, my name is Adrian and I am attending the Summer of High Performance Computing program doing the project Hybrid AI Enhanced Monte Carlo Methods for Matrix Computation on Advanced Architectures. It might seem like a lot of words, but essentially it boils down to replacing some part of different algorithms for solving very large linear equation systems with reinforcement learning. I will expand on this in upcoming posts, but for now all I can say is that it feels like a very exciting project and that I will definitely learn a lot!
As for my background I just finished my first year as a master student in the Complex Adaptive System program at Chalmers University of Technology in Gothenburg, Sweden, following three years doing the Engineering Mathematics bachelor. This means that I have studied plenty of linear algebra, algorithms, and programming, making this project a perfect fit for me. Also, the master thesis and the resulting end of my studies is on the horizon, and so SoHPC might be a hint of what sort of work I will be doing after that. When I am not doing that type of work I know that I like to go climbing, watching a nice movie, or having a coffee together with a dear friend.
This is my first blog post submission for this year’s PRACE Summer of HPC where you get to know me, my application experience and also get a sneak peek of what I will be working on in the next seven weeks.
#define REGINA_MUMBI
İ am a 23 year old student , born and raised in Nairobi,Kenya. İ am currently pursuing my bachelor’s degree in Computer engineering at Ankara Yıldırım Beyazıt University in Turkey. My main motivation for applying for this year’s soHPC was my newly found interest in computer system architecture from a class İ took last semester.
How I got here
It all started when a friend sent me a Facebook post announcing this year’s summer of HPC applications were open. Before that time, I had never heard about PRACE , nor did I know a lot about high performance computing, but I had heard of Barcelona Supercomputing Center from Dan Brown’s “Origin”,one of my favorite books. This year,there were 33 amazing projects to choose from. I put in my application and waited…
My ordeal with the spam folder
One lazy afternoon ,a week before the results release,İ decided to check my junk email out of boredom and to my surprise, there were five follow-up emails from one of the soHPC coordinators. They were enquiries on my project choices and my reference letter submission. The deadlines to make changes to my application had already passed and even as I tried to salvage the situation by responding to the emails, I knew for sure that my chances of getting selected were close to nil.
İ was wrong.
On 1st of May, I woke up to an email informing me that I had been selected for my first choice project .İt felt so surreal.İ was ecstatic!
Project 2101
This summer, I will be working on the analysis of data management policies in HPC architectures together with my project partner Aneta İvanicova. This MEEP project, situated at Barcelona supercomputing center, involves assessing various data management policies that will allow faster data access and movement through the different levels of memory hierarchy (more about it here). We will be guided by our mentor Borja Perez. This past week ,İ have had the opportunity to virtually attend the rigorous training at İCHEC (İrish Center for High End Computing) where we covered various topics such as memory parallelization, interfacing C code with python and performance analysis on HPC systems . I look forward to sharing more with you on what I have learnt so far from the project as time goes by.
And that’s about it from me for post #1.Make sure to stick around for more blog posts detailing my progress in the coming weeks. Also, don’t hesitate to comment below with any questions or remarks that you may have about participating in soHPC or the application process.
Meanwhile, you might want to check your junk folder…
Is ‘catching’ the Sun really possible? Or just a fantasy?
Yes, you’ve read it right, creating the Sun here on Earth is not a sci-fi fantasy but a never so close reality. Why such a thing? Recreating here on Earth the fusion reactions that happen in the core of the Sun could change forever our worldwide need for renewable sources of energy. Serveral project involving the world greatest countries are in development, like Iter, and the world’s scientific community is thriving around all the different aspects involved in what could be one of the greatest achievements of humanity.
How can we study such complex theories and experiments? Well, we need very powerful tools and here HPC (High Performance Computing) come into play. Between the different projects and partnerships PRACE (Patnership for Advanced Computing in Europe) gives its strong contribution, and the Summer of HPC programme is an example of this.
Summer of HPC is a PRACE programme in which late-stage undergraduate and master’s students participate in pairs on different projects, related to PRACE technical or industrial work, supported and mentored from several PRACE hosting sites, spending two months working on those projects and produce a report and video of their results. One of this projects is the ‘Computational atomic-scale modelling of materials for fusion reactors‘ mentored and supported by BSC (Barcelona Supercomputing Center) Fusion group and is the one that I take part of. But who is talking? Let me introduce myself.
This is me, with the beautiful SoHPC 2021 T-shirt!
Hi, I’m Paolo Settembri, a 22 years old boy from Italy. I’m a student at the University of L’Aquila where I obtained a bachelor degree in Physics, and I’m currently attending at the first year of the master degree course in ‘Condensed Matter Physics:Nanotechnologies and Fundamentals’. In my childhood I’ve been very curious and had a strong passion for Astrophysics, reading Stephen Hawking child books, so when I grew up I chose a ‘Liceo Scientifico’ for high school, which is a type of high school in Italy focused on Math,Physics and Science in general. After graduation I joined the bachelor degree course in Physics and my passion switched from Astrophysics to Solid state physics. In my bachelor degree thesis I simulated a collision of a dark matter particle in a Sodium Iodide crystal at low temperatures.
It was the professor that mentored me during my thesis to notify me of the possibility of joining the SoHPC 2021. Searching through the projects I found the one perfect for me ‘Computational atomic-scale modelling of materials for fusion reactors’ and I decided to apply. I was really sad and disappointed when no confirmation e-mail was arriving, even if only a few days were left before the final deadline; but then , maybe as I should have done previously, I checked the Spam inbox, and there it was, my confirmation letter, and it was sent to me days before. I could not believe it, I almost didn’t make it in the program because I didn’t check the spam inbox, and in a few day I would have been discarded, luckly this did not happen! I was a little scared of not having enough computational science knowledge to take part of the program but the training week really helped me, integrating my basic knowledge with high level informations; and now I look forward for the next weeks, in which I will start working on my project.
Schematics of the plasma inside the reactor (left), polycrystalline tungsten metal structure (right), a material of interest, and MareNostrum-4 supercomputer (background).
In the project me and another candidate will study, guided by the BSC Fusion group, materials used in fusion reactors, and using LAMMPS molecular dynamics simulations to investigate some of their properties. These materials can be used as protective layers in fusion reactors, but to do that they have to resist in really hard conditions, with temperatures of 108 °C inside the reactor. Computer simulations are fundamental in cases where experimental data are not available or difficult to obtain like in this case, and the use of HPC technology will be key in obtaining larger and longer simulations.
I can’t wait to start working on this project, that could give a small contribution in what will possibly be one of the greatest achievements in human history, and I’m really thankful to PRACE and SoHPC organizers for this opportunity!
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe to late-stage undergraduate and master’s students. Up to 66 top applicants from across Europe will be selected to participate in pairs on 33 projects supported and mentored online from 14 PRACE hosting sites. Participants will spend two months working on projects related to PRACE technical or industrial work and produce a report and video of their results.
Up to 66 top applicants
from across Europe will be selected to participate in pairs on 33
projects supported and mentored online from 14 PRACE hosting sites.
Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. PRACE will be financially supporting the selected participants during the programme that will run from 31th June to 31th August 2021, with the amount of €1300 per student during the summer.
Late-stage undergraduate and master’s students are invited to apply for the PRACE Summer of HPC 2021 programme, to be held in July & August 2021. Consisting of a training week and two months with onlone participation at top HPC centres around Europe, the programme offers participants the opportunity to share their experience and learn more about PRACE and HPC.
Due to Covid-19 pandemics during the summer of 2021 the programme will run fully online as an exception and as long as Europe is under strong mobility limitations. Two prizes will be awarded to the participants who produce the best project and best embody the outreach spirit of the programme.
Applications are open until 12 April 2021. Applications are welcome from all disciplines. Previous experience in HPC is not required as training will be provided. Some coding knowledge is a prerequisite, but the most important attribute is a desire to learn, and share experiences with HPC. A visual flair and an interest in blogging, video blogging or social media are desirable.
The programme will run from 1 July to 30 August 2021. It will begin with a kick-off online training week organised Irish Centre for High End Computing (ICHEC) – to be attended by all participants.
Applications are open from 15th of January 2021 to 12th of April 2021. See the Timeline for more details.
PRACE Summer of HPC programme is announcing projects for 2021 for preview and comments by students. Please send questions to coordinators directly by the 11th of January. Clarifications will be posted near the projects in question or in FAQ.
About the Summer of HPC programme:
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 66 top applicants from across Europe will be selected to participate in pairs on 33 projects supported and mentored online from 14 PRACE hosting sites. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. PRACE is looking to financially support selected participants during the programme that will run from 31th June to 31th August 2021.
For more information, check out our About page and the FAQ!
Ready to apply? Click here! (Note, not available until January 10th, 2020)
Most people have heard of the exponential function that maps an arbitrary real (or even complex) number x to but what happens if x is not a number but a matrix? Does the expression with a square matrix even make sense?
The answer is: Yes, it does!
In order to understand what the expression means, we take a step back to the exponential function for scalars. When we have a look at the power series of the exponential function,
with ,
we can see that there are only multiplications, additions and divisions by a scalar involved. These operations can be generalized to matrices easily. Hence, we can define the exponential of a matrix as
.
The next question is: How can we compute the matrix exponential for a general complex matrix?
There exist several different algorithms, our project focuses on two of them: Taylor series and diagonalization.
The most intuitive one is using the representation above and replace the infinite sum by a finite one to obtain the Taylor series. The number of summands that one has to compute depends on the accuracy that is needed – although it is only an approximation, it serves its purpose in many applications.
The second approach to compute the exponential of a matrix in our project is diagonalization. At first the matrix is decomposed to a product of three matrices , and , , where the columns of the matrix contain the eigenvectors of , the matrix is a diagonal matrix with the corresponding eigenvalues stored at the diagonal and is the inverse matrix of . With this decomposition, the computation of the matrix exponential is very easy because the following equality holds
.
The only expression that has not been calculated is and this matrix is again a diagonal matrix with the exponential of the diagonal entries of . If we multiply the matrices , and , we obtain the matrix exponential of .
Hi everyone! Welcome to my fourth and last blog post about my work on SoHPC 2020. Today, I will explain one last optimisation and then I will share the project’s video presentation with you. So let’s get started!
Optimised CUDA C version
In my last blog post, I presented a CUDA C program that launches a single cooperative kernel (function executed on the GPU) for all iterations to avoid the overhead of launching multiple kernels on the GPU. To achieve that, I needed to use the CUDA runtime launch API which provides synchronisation through every single GPU thread.
#include <cooperative_groups.h>
//launching a cooperative kernel
cudaLaunchCooperativeKernel(kernel, blocks, threads_per_block, args);
However, I found out that the API applies some limitations to the amount of GPU blocks and consequently to the number of threads. That means it can not launch one thread for each element of the matrix (at least for large matrices), which would be the ideal situation. So the only solution when the available threads are less than the elements of the array, is to assign multiple elements to each thread. But, of course, this increases the work that each thread has to do, and for large scale factors this causes a performance drop.
To further investigate this, I developed another CUDA C version, which does not use the above API and launches multiple smaller kernels per iteration.
//standard way to launch kernels on CUDA C
kernel<<<blocks, threads_per_block>>>(args);
After that, I used a profiler to see where this GPU program spends its time and noticed that the time for launching kernels is only a small portion of the total time.
For a large problem (scale factor = 192), the program spends merely 0.18% of the time to launch the kernels.
Eventually, it came out that the overhead of launching multiple kernels instead of one is minor. Additionally, in this new CUDA C code there is no limit to the amount of GPU threads, meaning that we can launch one thread for each element, which explains why we get the most optimal performance.
Final performance graph, updated with the new version, C CUDA – separate kernels
Video and Conclusion
My partner, Alex, and I have prepared a video presentation, in which we describe our progress during the entire summer. You could write in the comments what you think about it.
Video presentation of the project
PRACE Summer of HPC 2020 was full of beautiful moments, but I think one of the most memorable ones is when our mentor told us in the end that our performance results went beyond expectation and that he was really satisfied with our work.
All in all, I am glad I participated in a programme that offered me creativity, knowledge and excitement and I am sure this will be an unforgettable experience for the years to come.
The last week of PRACE’s High Performance Computing is just
about wrapping up, my colleague and I have just submitted our video
presentation and are now working on cleaning up the code and writing up the
final report.
Although all the models have been finalized and tested this week’s work is just as important as all the others. In these final days we are structuring the code and writing up documentation such that the work we have done isn’t lost over time, but can be taken over, integrated and used, as was intended, by our supervisors at Hartree Centre.
That sense of peace you get, when you are walking around late at night, the roads are deserted, you have published your video and you are finalizing your work – that sense you get when everything is coming together.
We are also working on the final report, where this has to
be written in popular scientific method. I have to say that I like this aspect
of the program, where both the presentation and report are to be written in
such a way that is accessible to a wider audience. Of course we have to put
some technical terms in there to explain all of the things we have executed in
our models, some of these aspects being potentially complex to the untrained
eye, but we do our best to explain these terms and why we chose the methods
that we did.
This has been the style that I have also been using in my previous posts, which you can find here. I can’t say if I would have found SoHPC’s program last year or not, without this initiative, or if I would have been immediately attracted or not, but what I can say now, is that I hope some other student stumbles upon these blogs, and our videos, and thinks: wow this is cool.
If you are reading this and debating whether or not to apply,
or if you have already been accepted and debating whether or not to go, I can
say it’s worth it. Sure it will be a lot of work, and over your summer break,
but the experience gained and knowledge learned is without question worth it.
Well, thanks for following along my journey, it is goodbye for now …
In the last blog, we discuss what a CNN is, what “flavor” of CNN we were using, and how we solved the issue of the cost function adapting the cosine similarity to work on our problem. Now it’s time to show how all of this work out in the training process and what outcome we archived.
The training process
To train the neural network we needed to structure the data in a way that is easy for the network to go through and understand. For that, we created a dataset with the normalized labels of the galaxies and the corresponding images. Once we have done that, we need to initialize the network with values for all the parameters. When the network has weights and biases we can feed data through it, see the results (at first all are gonna be random guesses), and compute our fancy cost function to see how far we are from the ground truth. When we have the error that we made we can update the network (change the weights, biases and kernels) with our optimizer. We chose the Stochastic gradient descent (SGD) as our backpropagation method.
Improving the dataset quality
As in any scientific process, we didn’t achieve perfect results at the first try. The first training runs were quite a mess. We started using the imagenet pre-trained weights. Then we realize that the images were too similar to each other and imagenet was always predicting the same output for all of them because the network was too stable to learn properly. Then, we changed to a randomly initialize network and the things went better (not by much). After that, we focus on one of the most important things in machine learning, the data. We restricted our dataset to just the galaxies more massive of 10^11 solar masses to have enough quality in our images, that made a significant change in our results.
Figure 1: Comparison between one galaxy of around 10^10 solar masses (left) and one of 10^12 solar masses (right). The different quality of the images is visible.
When 24 CPU’s is not enough
As we advanced in the project, the training runs take longer and longer each time on the CPU cluster. As this is a highly parallelizable task and is based on the perfect operations for a GPU, we moved all the training process to a GPU cluster with 4 Nvidia Volta 100. We went from 5h runs to 27 minutes… Not bad. But this not came without a cost. We needed to adapt our code to work in multiple GPU’s and to be extremely careful with the way we pass our data and update the weights to avoid possible conflicts.
The results
At this point, you may be thinking… Okay, but what did you get ??!! Okay, here comes the results. The metric we used is the adapted cosine similarity function where -1 is a perfect score and 0 is totally off and this is what we find:
Figure 2: light blue and red are validation and training of the resnet18 NN and the dark-blue and orange are validation and training of Alexnet NN.
The images obtained with labels and predictions for the validation dataset are:
Figure 3: The green arrow represents the label and the red label the prediction. Notice that the first image is also a perfect score since we are looking for the axis of rotation and a 180º offset from the angular momentum vector is also perfectly align with the axis of rotation.
From there, and armed with all the things I learned in the SoHPC I will be looking for ways to improve this results and take it to another level… This was the end of the road but now I’m in the unexplored, the wild territory where real exploration is made. And thanks to the SoHPC I’m ready to face it.
It’s been just over a month of working on my project and I’ve just about got my head around the concepts of GPU programming using CUDA. In this post, I will try to explain some concepts of my work using everyone’s favorite analogy – FOOD!!! Let’s get into it then.
CPU vs GPU: What’s the difference?
Imagine that you are the head chef in a 5-star Italian restaurant, which serves the best Lasagna on the planet. Of course, the Lasagna is going to be the most ordered dish in your restaurant. And due to the soaring popularity of your restaurant, you begin to receive a crazy amount of orders (say around 1000 Lasagnas) every hour. You could sweat it out and make each Lasagna, one by one, baking each of them to perfection. However, at the end of one week, you’re absolutely stressed out and are not able to handle this volume of orders everyday. Now, this is where your manager comes and helps you out, telling you that you can recruit 100 junior chefs to help you to get those orders out faster. However, the catch is that, the junior chefs are not ‘that’ well trained to make decisions on their own. You can however give them instructions and they will follow them to a T (sometimes even faster than yourself!).
This is the difference between a CPU and GPU. You are similar to a CPU. You can operate independently using your logic and get work done, but only one order at a time. The group of 100 junior chefs you’ve recruited, that’s how a GPU works. There’s not much independent logic and everyone in the group does the same work that they’ve been instructed to do. But given the right instruction and facilities, you can get a whole lot more work done using the group. Now, let’s see how the actual work can be distributed among our new recruits.
Making Lasagna vs Moving Plasma Particles
A VERY ROUGH(!) ANALOGY OF THE PARTICLE IN CELL (PIC) CODE I’M WORKING WITH
The entire Lasagna recipe can be roughly broken down into three important stages as can be seen above. Let’s try to distribute the work in the three stages.
Pre-Processing
Let’s assume that this stage involve cutting, sauteeing the vegetables, meat and preparing the sauce. So before you can give instructions to your army of chefs, you need to assemble all the chefs together to explain exactly what needs to be done. Also, you need to distribute the ingredients equally among each of the 100 chefs. Once the junior chefs have the ingredients and the instructions, they can each go to their individual workstations and get to work!!!
Is this the best way to go ahead? We need to see if the time taken for you to work alone is lesser or more than the time taken to assemble the chefs, distribute the ingredients and then for the chefs to finish. We can safely assume that distributing the cooking among 100 chefs would easily take less time than one person working on 100 dishes despite the extra time required for assembling the chefs and then distributing the ingredients. So yes, you can breath easy now. But wait, your manager tells you that the number of orders are doubling every day!!!! Now, the amount of ingredients you need to distribute becomes humungous and this alone takes a lot more time than before.
One of the reasons you have been made head chef is that you can come up with solutions to such problems, and EUREKA!!! You have come up with a very simple solution to it all!!! Why not just give each of the chefs the list of ingredients and ask them to get the ingredients on their own. Then, you remove the distribution step completely and they can start cooking immediately. YOU GENIUS!!!!!!
How does this compare with the particle in cell (PIC) code I’m working on? When we need to call the GPU to do some work, we need to first call the special code (instructions) we’ve written for the GPU and then transfer all the data for each ‘worker’ in the GPU to work with (distribute ingredients) from the CPU. These codes run with a huge amount of particles (>1000000000 particles!!!!) and you can imagine that transferring this data would take a lot of time relatively. To solve this problem, we simply create these particles in the GPU itself (similar to how each junior chef gets their own ingredients).
Processing
Now the lasagna needs to be baked. However, the problem is that the restaurant is still using the one oven which you have in your workstation. So each of the half-finished lasagnas need to be brought to you first. But luckily the oven you have is absolutely high-end and efficient and you are still able to manage the large amount of orders!!! PHEW!!! But afraid that the number of orders might increase further, you approach the manager asking if we can afford individual ovens for each of the junior chefs’ workstations. She says that we could do it, but the ovens wouldn’t be as good as the current one. You’re now stuck in a dilemma! Is it worth investing in 100 individual slower ovens or do we continue with the extremely fast oven we still use? You also understand that it depends on the number of orders you receive. If that keeps increasing, it may be a good idea to invest in this. If the orders stay the same or decrease, you could still continue working as before without further investment.
Again, let’s compare this with our PIC code. The code for this step currently has a very serial (can be done only in a certain sequence) but extremely efficient and fast algorithm to solve for the fields. However, you can try to change the algorithm completely in such a way that each ‘worker’ in the GPU can work in parallel. It might not be the most efficient algorithm, but it could be faster than the initial version if the number of particles we have is bigger. Contrarily, you’ll need to give extra instructions to the GPU which also takes some more time.
Post-Processing
Now that the lasagna is ready, you need to make it a 5-star dish. And since you’re the head chef, only you can add the finishing touches and check if everything is right with each lasagna. Hence, if each of your junior chefs has his/her own dish, you’ll need all of them to assemble and go through each of them one by one. Unfortunately, this can only be done by you and hence, this step cannot be avoided.
Some of the cool visualisations from the OOPD1 PIC code!!!!
Similarly, in a PIC code, all the calculated results in the GPU need to be transferred to the CPU so that the required data can be extracted and presented to the user in a cool, visual manner. Unfortunately, this can be done only on the CPU for now.
Final Word
Throughout the last month, me and my teammates Victor and Paddy have tried out all these different methods of work distribution between the CPU and GPU and checked which ones give us the best results. ( If you have more novel ideas for similar work distribution, please mention them in the comments below!!! ) We’ve managed to get some exciting results which we will present at the end of this month. Stay tuned for this!!!! All this talk about food has got me craving for some authentic Italian food. I’m off to satisfy my hunger along with a cold beer to beat the summer. Till then , Ciao Adios!!!!
In machine learning we often talk about accuracy as a metric for how good our model is. Yet a high accuracy isn’t always a good way to measure model performance, nor does it always correlate to an accurate model. In fact, an accuracy rate that is too high, is something to be skeptical about. In machine learning this is something that comes up time after time, you build a model, you get a great accuracy score, you jump with joy, you test it further, you realize that something was wrong all along… you fix it and get a far more reasonable answer…
Side note for the inexperienced reader, if any model or any
person tells you that their model predicts the target variable with a 100%
accuracy… don’t believe them! How big is their sample size? Are they training
and testing their model on the same dataset such that the algorithm simply learns
the answer? Have they tested their algorithm on all data we have available for
the topic throughout human history? In short, what’s the catch?
In this post I will be talking about how the accuracy metric wasn’t useful in the models I have been building to predict SLURM job run times. In fact, the accuracy metric could even be counterproductive in some scenarios. If you remember from the previous post, estimating predictions of actual job run times could greatly improve the efficiency of Hartree’s cluster scheduling algorithm.
The first model I built was a regression model, given information from the user, like how many CPU’s and nodes they request from the cluster, and what they expect the run time to be, the model tries to predict what the actual run time of each job will be, whilst minimizing the error in prediction. Now in this case, the classic accuracy score isn’t applicable, for regression prediction, things like MSE, mean squared error, are used to estimate the goodness of fit achieved from the model. In the figure below on the left hand side, we can see the red line is the line of best fit for the scattered points. This line minimizes the distance between estimated and actual points. In our case, the line of best fit becomes a bit more complicated. The algorithm my colleague and I are building has the obligation to predict run times such that no run time is underpredicted. That is because in Hartree, similar to many clusters, if the actual run time of a job goes above the provided time limit by the user, the job is killed. So the first and foremost necessity our model has, is that it cannot predict run times lower than the actual time, otherwise the job would then be killed. On the right hand side of the figure we can see how this would look like, the line of best fit is shifted upwards such that no points fall above it.
A typical line of best fist on the left hand side, the right hand side modifies the line of best fit such that the line lies above all points.
Now this figure represents a simplistic example, but it is a good way to visualize the problem. In real world problems though the data is much more messy, there are outliers, in both directions that would skew the line of best fit. For our dataset we have an additional complication though, which is presented by the fact that even users sometimes make errors. Users will occasionally submit jobs with runtime predictions that are too low, causing the scheduler to kill their job. These jobs are highly unpredictable and will almost never be accurately estimated by the algorithm. Due to this limitation, we decided construct another model, this time to tackle a classification problem. The idea behind this is that if we were able to classify jobs into two categories, predictable and unpredictable, then we could simply put aside the unpredictable jobs, and run our regression model on the rest, without risking under-predictions and having jobs killed off.
As such we ran our regression model and defined accuracy as
whether the run time was predicted correctly above the actual elapsed time (but
below the user’s time limit), or incorrectly below it. Having these two
classifications we constructed and tested other models like Random Forest
Classifier and Naïve Bayes Classifier, to predict the “predictability” of a
job. With the accuracy we defined before we had roughly 92% of jobs classified
as predictable, and 8% as unpredicatble.
In the first run of a classifier algorithm the code spat out 92% accuracy!!…. Well, that was easy? In fact, too easy. Since algorithms are built to train in such a way that they achieve the highest possible accuracy score, Naïve Bayes simply predicted every job as predictable….oh, so we got 92% accuracy, but in reality that was useless? Yes, the whole point was to distinguish the jobs that are unpredictable, and even with such an “accurate” algorithm, we were nowhere nearer to achieving that task.
This is one of those scenarios where accuracy scores are misleading. In this case we want to achieve 100% accuracy in predicting “unpredictable” jobs…but remember, what’s the catch? To do this we have to sacrifice accuracy in predicting those jobs which actually are predictable, and misclassify them into the other category. The below image helps to clarify. In our case, think of jobs being predictable as the positive class, while jobs that are not predictable are part of the negative class. Traditional accuracy only looks at the diagonal values of jobs correctly classified as positive or negatives. Other metrics, like recall and precision, take into account the other diagonal of false predictions. In our case we want precision to be 100%, this means that we do not tolerate any false positives (such that negative cases are all correctly labeled). To achieve this though we will generate many false negative cases.
A confusion matrix displays the true and false categorization of classes. In this case the classes are only two, positive and negative.
This is it for now folks, we have cornered the best tactic to achieve no under-predictions and now what remains is to optimize the models best we can and test such models on the cluster. We are slightly passed the mid-program mark but I can see the deadlines for the final week approaching. Working on this project has been an amazing experience so far and I have been learning so much thanks to SoHPC, but it’s not time to relax yet .…
As announced in the last post, here you can understand the connection between sorting the ducks and the GPU performance.
Rotation
When using the Fast Multipole Method, the goal is to obtain a certain physical quantity e.g. a force. It needs to be obtained at a particular point in a particular box (target box) of the considered system. To do so, all the little boxes in the system need to be rotated and shifted to the target box. Here, only the rotation is considered. The implementation can be described with following steps:
Build a pyramid that contains the rotation coefficients which are based on the fancy math .
Build a triangle containing the multipole to be rotated
Build another triangle to store the rotated multipole
Multiply the pyramid entries with the corresponding entries of the multipole that needs to be rotated as shown in the image below
Data structures used for the rotation. To obtain the box in red of the rotated multipole (on the left), red boxes in the pyramid and the input triangle on the right need to be multiplied.
After spending some time playing with these and similar blocks, and sorting out in their minds how it all works, students have shown unusual behavior patterns while carrying out basic everyday routines. An instance of this behavior occured after a long day of coding on the student’s attempt (let’s call the person Student X) to brush his teeth before sleeping. Namely, all the lights were off and the Student X spent three full minutes switching the bathroom light on and off, contemplating about the possibility of the light bulb being burned out already, only to realize that the bathroom door was closed. Only a basic behavioral analysis was enough to diagnose that these building blocks remain trapped inside the student’s mind even after the working hours. At a late hour of the night, these blocks transform into little rubber ducks and other shapes, and start rotating around and multiplying, making the student act silly. The Student X happens to be myself.
Therefore, the goal is to sort the ducks out to improve the GPU performance. Whether or not it is accomplished, will be visible in the next blog post where you can expect some performance graphs!
Let me know in the comment section below if a certain part should be further clarified and thank you for reading!
Is it possible to account for all the particles in the system without actually iterating through each of them every time? Read this post to get an idea about my summer project and the groundwork for the answer to the title question!
Problem/Motivation
Particle interactions are a common subject of simulations in the fields like Molecular Dynamics and Astrophysics. Imagine computing the forces on each planet in our solar system caused by all the other planets. For each of the planets, one would need to add up the contributions to the total force of all the others, one at a time. Assuming all the data is available, the task is computationally trivial. However, in typical simulations, where the particles are counted in trillions, computing anything with this approach would take a lifetime even on the most sophisticated computing architectures.
Solution
A different approach is used in one of the most important algorithms of the 20th century – the Fast Multipole Method (FMM). It allows faster computation of particle interactions by grouping them together. Each group is represented by a single expression called the multipole expansion. Multipole expansion is an approximation of the impact the considered group has on its environment. When computing a force at a certain point, instead of accounting for each particle in the group, only this single expression is considered. That is how the complexity is reduced!
3D visualization of a multipole in front of a Coulomb potential surface from a real-world simulation.
How to Implement Such a Thing?
FMM includes a series of steps, but for brevity, I will focus only on one for now. In real-world simulations, domains of interest are the three dimensional ones. Imagine a cubic box which I will only refer to as a box. This box is divided into a lot of small boxes where each is represented by a single multipole expansion. Consider a point in which we want to compute the potential. Consider the box in which the point is located. It is required to compute the impact of all other boxes onto our box. The impact computation is done by a so called Multipole-to-Local (M2L) operator which is exactly the segment to be improved in my summer project.
Recipe for Speed
Current implementation developed at FZ Jülich rotates the multipole twice followed by the translation into the box of interest. It can be shown that instead of the two rotations we can rotate it in a different manner six times to reduce complexity. Each of these six rotations are individually cheaper than the currently implemented two and when everything is put together, I expect to see some speed gained.
Multipole expansion representation up to the third order. Each block holds a complex number
Ducks?
It requires some effort to grasp the maths behind creation, rotation and the shift of multipoles which makes it too broad for the scope of the blog. However, implementation-wise, it comes down to building data structures to represent these concepts and that is exactly where the fun starts! Above you can see how the fancy multipole in the first picture can be represented in the computer. Coefficients of the multipole expansion are stored in each blue box. The first box on the left represents a monopole coefficient, the two boxes stacked on top of each other hold the dipole coefficients etc.
To win exclusive bragging rights, write your idea on how the ducks are linked to the story in the comments section below! Otherwise, you can find the answer in the second part of this post. For the outcome of the new way of rotation, stay tuned for further posts!
Hey Everyone! I’m Cathal and I’m super excited to partake in Summer of High Performace Computing (SoHPC) 2020. I have recently graduated from NUI, Galway in Ireland completing a BSc in Computer Science & Information Technology. In my spare time I like to exercise by running, climbing a mountain or anything that works up a sweat. Additionally, I have been playing the banjo since the innocent age of 8. Irish Traditional Music is one of the things very special about Ireland! Here’s a video to give you a taste of the music.
Myself and friends playing a few tunes in McGann’s pub situated in Doolin, Co. Clare.
Training Week
I intend to pursue a MSc in HPC with Data Science from the University of Edinburgh this coming September. I guess applying to SoHPC intended to be a good introduction and learning curve to what will come from the masters. The remote training week was definitely the educational experience I needed by getting introduced to HPC technologies such as super-computing, parallelization, OpenMP, MPI, and my favorite topic, CUDA programming for GPU’s. Hats off to all the organizers and mentors that made the training week happen so flawlessly.
This blog would never be complete without a ‘hello world’ example!
My SoHPC Project
I will be partaking in project #2018 called ‘Time Series Monitoring for HPC Job Queues’ hosted from SURFsara in Amsterdam. The goal of the project is to create a monitoring system which captures real-time information regarding the number of jobs running on HPC clusters, while integrating the work with DevOps and Continuous Integration practices and tools. Job queue information will be collected, processed and stored as time series and summarized and made available to users and administrators at SURFsara in the form of graphs. I hope my next post will be able describe tools, technologies and inner workings of the project more in-depth.
Thank you for reading, Cathal.
The less technical explanation of my project… meme format.
I really would have liked to write this post while I was enjoying the cool morning mist in Edinburgh. But here I am, at my home in Izmir, Turkey; suffering from a heat stroke.
I am a computer engineering student in here in my hometown’s university IZTECH. I’m trying my hand at every subject of CS that I can. I played with web development, machine learning; and then I learnt about the opportunity to actually spend my summer working in a whole new area, in a whole new country. Needless to say I was overjoyed. A work opportunity that actually requires you to travel, to learn a new technology, to go over your boundaries in every way possible.
Then a pandemic occurred and my expectations were flipped on their head. Yeah, great. I thought I was going to miss the most exciting summer I was ever going to have. And I was right! I would be, if it weren’t for soHPC going online. Surely, it can’t compare to changing a whole continent… But it really is the most amazing opportunity. I have access to a cluster, miles away from here, and I can actually make changes on it. It really blows my mind how all I need is my computer and a connection.
So, how hard it’s been? I have the greatest mentors and colleagues over at EPCC and I’m working with Fulhame (project 2006, if you would like to check it out). But it’s really hard to feel like you’re doing something worthwhile for your project when you can’t see the results, can’t interact with the involved people. I’m testing a new set of instructions and… They work. But it doesn’t feel solid to me.
Enough of the morbid whining though. I really am grateful for such an experience, creating beautiful memories and unforgettable bounds with new people. I will soon be writing a whole new post about Fulhame (and clusters in general), and maybe differences between intel based and ARM based HPC instructions’ differences.
Since my last post, and over the last 2-3 weeks, I have been mainly familiarising myself with the GALILEO Cluster at CINECA, running basic models and simulations, and getting my hands dirty with sample models from the PLUTO Code for Astrophysical Modelling.
The PLUTO Code is a fluid dynamic code capable of reproducing magnetohydrodynamic (MHD) and supersonic fluids, such as the plasma which exist in stars and Supernova Remnants (SNr).
Once I was comfortable running sample simulations on GALILEO, I began modelling my own simulations of Supernova Explosions (SNe), and their resulting SNr. Firstly, I began running simple, 3D, spherically symmetric explosions, and monitoring the evolution of the blast over the course of ~1900 years.
See the photos below, which show the evolution of the blastwave, with clear forward and reverse shockwaves present. Forward and a reverse shockwaves are created when the supernova blastwave interacts with the surrounding Interstellar Medium (ISM). The forward shock continues to expand into the ISM, the reverse shock travels back into the freely expanding SNr.
Snapshots of a spherically symmetric expanding Supernova Explosion simulation. The above photo shows the density profile of the blast after 100, 450, 1000, and 1900 years.
Once I was happy with the shape and evolution of the simple SNe described above, it was time to introduce some realistic physical conditions, because in reality, no SNe looks that perfect. One of my supervisor’s suggestions was to introduce a torus around the expanding blastwave (imagine the star was surrounded by a ring, similar to Saturn’s). This torus like feature depicts a much more realistic SNe simulation, in which the expanding blastwave interacts with a region of high density somewhere out in the ISM.
In the photos below, you can clearly see the interaction of the blastwave with the surrounding torus feature, and the effect this additional condition has on the evolution of the SNe, relative to the previous photos.
Snapshots of a spherically symmetric expanding Supernova Explosion simulation, interacting with a surrounding ‘torus of matter’. The above photo shows the density profile of the blast after 350, 950, 1500, and 1900 years.
In the coming weeks, I plan on introducing more realistic physical conditions, such as random clumps of matter scattered around the SNr, among other things. It’s also worth noting that the images in this post are 2D slices of a 3D model.
If you have any questions about the project, please feel free to comment and I’ll be more than happy to answer.
Hi, my name is Aarushi Jain. I am a Computer science and engineering student from Indore, India. I am in pre-final year of my graduation. I am working on project no. 2024 “Marching Tetrahedrons on GPU” under SoHPC-2020. I have made a tangential entry to this program. I was selected for internship at VSC through some other route but it did not materialise due to pandemic. I was very much disheartened, but by grace of god, I got this opportunity to work under SoHPC-2020. This was a blessing in disguise for me.
This project will be completed under the supervision of Project Mentor Siegfried Hoefinger.
I was intended to join the Summer training in Vienna but due to pandemic we were told that the program has been shifted to online mode. I was happy and sad at the same time because I was getting to learn new skills by experts with comfort of home, on the other side I will be missing a chance to visit new places and meet people from diverse culture.
My first exposure to HPC began during my training at VECC (Variable Energy Cyclotron Centre) Kolkata. Their I learned fundamentals of Parallel processing using pthreads and worked on a project for Compressed Baryonic Matter (CBM) experiment at Facility for Antiproton and Ion Research (FAIR) Darmstadt, Germany.
As a kid, I visited International Center of Theoretical Physics (ICTP), Trieste, Italy with my parents for a month. I would like to share a video of ICTP where I got an award for being a good listener. Watch the video below and try to recognise me :-).
If you are still here, I would like to share my hobbies which are photography and swimming.
Picture that I have clicked today at my home for this blog (Nikon D5600)
Thank
you for reading, and I will keep you updated about my project here.
My name is Shiva Dinesh. I come from India, but I am right now in the beautiful Franconian city of Erlangen. I am currently pursuing my Masters in Computational Engineering at the Friedrich-Alexander-Universität Erlangen-Nürnberg in Germany. In my free time, I love playing strategy games.
How I met HPC and SoHPC:
I got introduced to the High-Performance Computing in a course in my first semester of my Master’s. I was intrigued by the concept of HPC and realized its need and importance. I then took another course on HPC in which they recommended me to go to ISC Frankfurt – an HPC event. This was where I came to know about PRACE, and they made me aware of SoHPC. I undoubtedly wanted to take part in SoHPC, so I decided to apply for it in the next application period. Voila! I am here writing a blog on their website.
Me awkwardly posing at the ISC Frankfurt 2019 at the release of the 53rd edition of Top 500 List. This was the first time all top 500 HPCs were faster than one Petaflop.
The Project:
The idea of simulating flow around the submarines has captured my interest, so I have chosen the project #2021:Submarine Computational Fluid Dynamics. Considering my background in mechanical engineering, I believed that this would be the ideal topic where I could integrate fluid mechanics and HPC.
Training Week in Vienna:
Due to the pandemic, most of the summer schools got canceled but, luckily for me, not SoHPC. I am glad that the organizing team has managed to set up an online version of the summer school and have done a commendable job till now. The training week was well scheduled, which introduced us to the basics of parallel programming and gave us hands-on experience on the Vienna Scientific Cluster (VSC). The exercise and yoga breaks in between the lectures were well thought out, as it released the stress of sitting in front of a screen for long hours. It also gave me an opportunity to interact with like-minded people from the community, and I had fun interacting with them.
First Weeks at Luxembourg:
I am still daydreaming that I am in Luxembourg, and we will let it be that way for the moment. I got started to work on my project after the training week mentored by researchers from the University of Luxembourg. I began getting acquainted with the CFD concepts and visualization tools required to simulate the flow around the submarines, alongside my project partner Matt. So, for the next six weeks, I would delve into the realms of HPC and CFD.
If you are still here, Thank you for reading, and I will keep you updated about my project here.
“There is only one way to learn. It’s through action. Everything you need to know you have learned through your journey.”
Paulo Coelho The Alchemist
Hoping to have a great journey here and learn a lot of new things. Have a nice day, and stay safe! See you soon.
Hi, I’m Andres Vicente but everyone calls me Andreu, I’m an Astrophysicist (it’s official since yesterday in fact) based on the Canary Islands. I heard about the Summer of High performance computing program (SoHPC for short) in the course of programming techniques at my university and I couldn’t resist joining. I can tell that it was an awesome decision. Before I can talk to you about my project and how I planned to use Deep Neural Networks (DNN) in the field of galaxies let’s start from the beginning:
Let’s start by warming up our neurons: Training week
SoHPC starts with a training week where we were instructed with the basics of HPC and where our brains start to get used to thinking about performance and parallelization. The four days of the training fly by even quicker than the executions of our programs in the Supercomputers and then we are on our own, facing the project we dreamed about. But we are not worried, we don’t feel alone, our mentors and peers will make the journey easier.
The objective of the project is to use DNN to detect objects in images and we are lucky because, in the Astrophysics world, a very big portion of the data obtained from the Universe is in form of pictures. In the past, most of the classification of the galaxy morphologies was done by simple human inspection. Nowadays, we have better tools to classify the galaxies but almost all of them need to be applied using an analytical model and fit it to every one of the observations made, which is obviously time-consuming and computationally expensive. DNN and his object detection capabilities open a new world of possibilities in Astronomy and Astrophysics because we will not only be able to detect morphologies of galaxies (which is a rather easy task), but we also could go beyond that and infer physical properties of the galaxy just by looking at the raw image!!
In our case, we will try to detect the orientation of galaxies (which is closely related to the angular momentum vector if you are wondering). This is not an easy task since we don’t have “training data” to feed our network because we don’t know the ground truth of this magnitude in the observed galaxies but… Here comes the HPC again to rescue us again.
We can mock the observed data with high-resolution simulations of galaxies were we know all the parameters. These simulations are done for experts in the field in the most powerful supercomputers in the world as for example the Illustris simulation done at PRACE supercomputers: https://prace-ri.eu/universe-simulation-illustris-is-an-ongoing-success-story/
From those simulations, we can render images of galaxies in any orientation and point the angular momentum as a vector as we can see here:
Figure 1: Rendered Images from the NIHAO simulation with the angular momentum superimposed as a green arrow. These images have been created to obtain our dataset to train the network.
These physical properties of the galaxies are relevant because they tell us the story of the galaxy evolution and how it has been formed and evolved. This helps us to understand our own galaxy and somehow why we the Universe is as beautiful as it looks.
With the Idea and the training the data, it’s time to get our hands dirty!!
We plan to use a convolutional deep neural network to do the job. The full architecture is not clear yet but the general scheme (simplifying a bit) will look something like figure 2.
Figure 2: Illustration of the DNN for computing the angular momentum of a galaxy.
I’m very excited to see how the project evolves and I hope you will join my journey and share some of my emotions to unveil the mysteries of the Universe through HPC!!
Keep an eye on the SoHPC blogs if you want to stay posted.
Greetings dear blog intruder, this blog belongs to Anssi from the northern forests of Finland. My home office for this summer will be in secluded Kuopio. At this point, I would like to rectify common fallacy; there are no polar bears in here.
The PRACE Summer of HPC (HPC actually stands for high-performance computing) consists of annually organized projects in different supercomputing centres around Europe. As I am always watchful about embracing new information on the mathematics-related things, it was an easy decision to apply for the SoHPC.
Figure 0: Transition between states of mine (Quantum incoherence was faint)
Enigmatic problems emerging from all kinds of disciplines have always been intriguing for me. Such as statistical inverse problems which I’ve been studying in the computational physics department at Eastern-Finland university. Albeit this kind of problems often crack one’s head and diminish the remaining sanity, they also offer the dire moments of euphoria. Similar challenges arise when designing well-functioning HPC algorithms and protocols.
I consider myself as a sloppy coder; nonetheless, it would be still nice to manifest computed results at times. From SoHPC, I expect to learn some basic concepts of HPC and to extend my programming toolkit for the future. Furthermore, there is also a hazard to be comprehending new neat stuff from physics/mathematics.
The project under code name 2011 and the training week:
My projects organizer is the Jülich Research Centre in Germany for which I’ll be working remotely for the next seven weeks. The projects essence is the lattice discretization of Quantum Chromodynamics (LQCD), which allows one to compute the behaviour of gluons and quarks by machinery (see https://summerofhpc.prace-ri.eu/high-performance-quantum-fields/). As the lattice tends to have an enormous amount of estimated parameters (e.g. colours and spins), the large scale parallel computing is harnessed to drastically reduce the computational time requirements.
So far, the SoHPC offered the training week for the participants in which some essential HPC concepts were introduced. Namely, how to use OpenMP, MPI and Cuda and how HPC architectures are often constructed. Overall, the training week was an enjoyable experience and had some refreshing daily embedded activities such as Yoga with professional (thanks to Cornelia, the yogi).
Figure 1: Not all heroes wear blades, but this one does. Kept me cool during the training week.
Since the program is done remotely, I am not currently capable of providing any local views from Jülich. Instead, you can check out the Jülich blogs from the previous years. The plan is to publish sceneries from the vicinity of mine as quick as I get to outside.
To finalize the post, I would like to provide a few futile facts (someone could call these as opinions).
The best GPU: Nvidia GeForce GTX 770 GPU – Has served me well during the years, with a minimum amount of protests.
The best juggling pattern: Mills mess (4 balls) – Fairly easy and splashy.
The best basis: Bell basis – Entanglement for everyone.
The best (prog) music composer: Steven Wilson – Dude just knows how to do amass the instruments.
The best type of graph: Markov random fields – I just like the approach.
The best paradox: Banach-Tarski paradox – Virtually, solves the worlds food shortage problem.
The best type of toothbrush head: One with medium-strong brushes – Works gently.
The best type of book: Hardcover – Allows one to tap audible patterns.
Within the upcoming weeks, you can find about my progress via this blog. Also, check out Aitor’s blog, who is also working with the high-efficiency quantum fields project.
To keep things interactive, there’s brief juggling video below
Video 01: Providing (below) mediocre juggling, because why not?
Hi, my name is Cathal Maguire. I am a Physics & Astrophysics graduate from Dublin, Ireland. I obtained a First Class Honours Degree in Physics & Astrophysics from Trinity College Dublin in 2020. I am also a big soccer and gaelic football fan, however I’ll realistically watch any competitive sport.
As part of SoHPC 2020, I will be working on Project #2003 – ‘Visualization of Supernova Explosions in a Magnetised Inhomogeneous Ambient Environment’. This project will be completed under the supervision of Project Mentor Salvatore Orlando, in collaboration with my fellow Trinity classmate and friend, Seán McEntee.
Project #2003
Project #2003 was due to take place in the CINECA Computing Centre in Bologna, Italy. However, due to the Global COVID-19 Crisis, the Summer of HPC 2020 is being carried out remotely. Although I would have enjoyed the summer sunshine of Northern Italy, the wind and rain of West Dublin will have to suffice.
I have previous experience running MHD (Magnetohydrodynamic) computer simulations, such as that outlined in the project description. As part of my final year, I undertook a Research Project investigating the circumstellar environment of the red supergiant star Alpha Orionis (aka Betelgeuse), which was ultimately accompanied by a thesis titled: ‘What can radio emission tell us about the stellar winds of Alpha Orionis?’.
Instead of canceling in pandemic times programme decided to double number of participants (50 selected for 24 projects) and organise remote training and project mentoring. All applicants welcomed such decision with comments like:
Indeed we are lucky that computing is one of the things we can still do in this extraordinary time. I am still very enthusiastic about taking part in SoHPC!
I am currently working remotely for college so continuing to do so is not an issue. My whole family is working remotely as this is and will be the new normal method of working for the foreseeable future so I fully support the decision to switch the project to a remote one
I’m not happy about that but I understand that Covid-19 is a serious threat.
I think it would be the best moment to learn more about programming as staying at home might be recommended or even drafted. Also, now I am getting used to this type of work as my final degree projects require cluster and local programming. Furthermore, I am habituated in communicating with partners and tutors weekly as we work wih the same framework and data.
Remote it not a problem, also it has some good sides. I really want to take part in this program.
Those are the times of unfortunate events but it’s also good for academic improvoment since we all stay at home and focus on study. I’d happily like to attend SoHPC distant education. Creating a group of students is also good for starting new friendships and academic network. I hope I can be part of this great organization. Best regards.
While it is a bummer the program cannot continue as planned, given the recent developments, I fully understand the need to make this decision and I am on board with it.
I would be more than happy to participate in the programme remotely, if selected.
Despite the exceptional circumstances, project subjects are intriguing enough to be done remotely. On the spot or not, PRACE Summer of HPC remains as an excellent opportunity to learn plenty of new stuff during the internship. Into the bargain, I’ll be given a fancy t-shirt 🙂
Right now I’m even more excited of participating. Working remotely in a team will be part of any job in the future, and getting an experience on this kind of situation maybe very useful. Plus I can still achieve the goals I set for myself. I can see only positive aspects in this kind of solution. Thanks for still giving us an opportunity and for all your work.
Considering the past events, I was concerned that SoHPC might have been cancelled for this year, therefore I am happy and hopeful as i still have a chance of being selected for the program! I am now comfortably at home and I have both the chance and the high motivation to complete a project remotely if selected. I believe that recent events will not condition my ability to perform high quality work if selected.
Although I regret not being able to live the full SoHPC experience, I think that it is still an excellent opportunity to work with HPC experts of the best centres around Europe. Moreover, working in groups could make the experience more challenging and enriching, given that selected students should cooperate with people of other countries and with different academic background. Last but not least, dealing with cooperating seems an excellent training for the future, both for academic and professional careers.
In tough times we take exceptional measures.
Despite the exceptional circumstances, project subjects are intriguing enough to be done remotely. On the spot or not, PRACE Summer of HPC remains as an excellent opportunity to learn plenty of new stuff during the internship. Into the bargain, I’ll be given a fancy t-shirt :)I would still be delighted to participate in the SoHPC programme – to get the fantastic opportunity to work on one of projects. The remote structure sounds very good – I think it would be great to work in a team with other SoHPC participants!
I’d be happy to participate in the project under any circumstances.
Participation in the programme is still a valuable experience, doing it remotely is not optimal, but given the circumstances I think is for the best. I will try to make the most of what it is and go through it with the best of my abilities.
All of us had to give up their normal lives because of this disease that is everywhere in the world.We are currently successfully distance learning from our schools. I believe that it will also be successful in remote works.
Many events and flights are canceled and schools are closed due to pandemic. In this chaos, It is unexpectable a big event like SoHPC to be taken place. However, instead of cancellation of this event, execution of trainings remotely will still be a great opportunity for us because what SoHPC offers does not change. We will still be working on a project with the finest HPC centers. Of-course, it is a little disappointing that not meeting people in person and not having a cup of coffee with them to discuss ideas. However, I believe, everyone will do their best to work on their project and also keep alive the spirit of SoHPC.
Thank you for making this internship possible, earlier I was thinking that it will not take place. But glad it will take place and happy to be part of it.
Applications are open from 11th of January 2020 to 26th of February 2020. See the Timeline for more details.
PRACE Summer of HPC programme is announcing projects for 2020 for preview and comments by students. Please send questions to coordinators directly by the 11th of January. Clarifications will be posted near the projects in question or in FAQ.
About the Summer of HPC program:
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 24 top applicants from across Europe will be selected to participate. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. The programme will run from June 31th, to August 31th.
For more information, chech out our About page and the FAQ!
Ready to apply? Click here! (Note, not available until January 10th, 2020)
Applications are open from 11th of January 2020 to 26th of February 2020. See the Timeline for more details.
PRACE Summer of HPC programme is announcing projects for 2020 for preview and comments by students. Please send questions to coordinators directly by the 11th of January. Clarifications will be posted near the projects in question or in FAQ.
About the Summer of HPC program:
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 24 top applicants from across Europe will be selected to participate. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. The programme will run from June 31th, to August 31th.
For more information, chech out our About page and the FAQ!
Ready to apply? Click here! (Note, not available until January 10th, 2020)
Hello ladies and gentlemen, as we start our descent, please make sure your seatbelt is securely fastened.
It was a great month for me with awesome stories. SoHPC2019 introduced new topics, nice friends and beautiful places to me.
Have you ever realized the waiting time for the machines decreasing day by day in our daily life? Of course, nobody wants to wait for slow automated machines or have slow smartphones, but do you remember how long you were waiting for the old devices to open?
How to express the real-world problems to the machines?
We can describe the problems with the help of the linear algebra and solve the problems by expressing as systems of linear algebraic equations. The machines understand things as numbers in a matrix, and any manipulation means matrix operations for the machines. The speed of the calculations can be increased. However, the speed can be the enemy of the crucial calculations that require to be precise at the same time.
Let’s get back to my project
In my project, the aim is improving the non-recovered output of “Markov Chain Monte Carlo matrix inversion” using a stochastic gradient algorithm method.
What is Stochastic Gradient Descent (SGD)?
Gradient Descent is an iterative method that is used to minimize an objective function. If randomly selected part of the samples used in every iteration of gradient descent, it is called stochastic gradient descent.
With the help of the mSGD method which purposed on the paper, the error of the inverse from MCMCMI method is decreased. After successful results in Python, the algorithm is implemented in C++.
”The Cherry on the cake”
“Life is not fair. Why should random selection be fair?”
In mSGD, the rows are selected with uniform probability. The results are slightly better if probability of selecting a row is proportional to the norm of the row.
The results of the implementation.
“Last touch: Adding Batches to Parallelization”
Stochastic Gradient Descent and Batching are as like as two peas in a pod. Instead of using whole matrix A for the method, rows are divided into the subsets for each process in the parallelization with Hybrid MPI/OpenMP. When rows are not equally divided, it is a good trick to give lower amount of rows to the master process since it has more work than others
As I
elaborated in my last two blogposts on what the Intel Neural Compute Stick is
and how to use it in combination with the OpenVino toolkit, I will describe in
this blog post what the “dynamic” part in the title of my project stands for.
As explained in my first blogpost, the Neural Compute Stick is meant to be used
in combination with lightweight computers such as the Raspberry Pi to speed up
computations for visual applications. Now often enough devices such as the
Raspberry Pi are found on the edge. This could be a device attached to
satellites in space running earth observation missions, or an underwater robot
conducting maintenance on critical structures such as underwater
pipelines. One problem that remains
though, even with the Neural Compute Stick accelerating computations, is that
once a model is deployed on an edge device, it’s hard to repurpose the device
to run a different kind of model. This is where the “dynamic” part of the title
of my project comes into play. The Neural Compute Stick is a highly parallelized
piece of hardware, with twelve processing units independently running
computations. This way, it is possible to not only have one, but several models
loaded into its memory.
Even more
so, it is possible to switch these models and load new models into memory. This
allows to adapt to new situations in the field, like when the feature space
changes or the things that are supposed to be detected change. The simplest
case of such an occurrence might be the sun setting or bad weather conditions
coming up. Another motivation to switch models might also be to save power, as
edge devices tend to have limited capacities if it comes to energy sources.
Instead of deploying one big model that is supposed to cover all cases that
could occur in a production environment, it would be possible to have many
small models that could be loaded in and out of memory at runtime.
With this
in mind, I went ahead and investigated the feasibility of doing so and
implemented a small prototype that switches models at runtime. For this
prototype I used two models detecting human bodies and faces and had the
prototype switch between these two models. These models are both so called single
shot detector MobileNets, networks that are better suited to be deployed on lightweight
devices such as the Raspberry Pi. These networks localize and classify an
object in a single pass through the network and draw bounding boxes around objects
they detect in it.
I used
OpenCV for this task, which is a library featuring all sorts of algorithms for
image processing and is best described as “swiss army knife” if it comes to
visual applications. Next to OpenCV I had OpenVino running as a backend to
utilize the Neural Compute Stick in my application.
I
eventually tested this model switching prototype by loading and offloading
models in and out of memory of the Neural Compute Stick. I did this with a very
high frequency of one switch per frame to determine what the latency of such a
model switch would be in a worst-case scenario. The switching process includes
reading the input and output dimensions of a model by using the XML
representation of its architecture and then loading it into the memory of the
Neural Compute Stick. On average this switch caused an extra overhead of about
14 percent of the overall runtime. To put this into perspective, on average it
took my application half a second to capture and generate an output for an
image, whereas a model switch in between would add a little less than a tenth
of a second to this time. Of course, there is a lot of room for improvement
given these numbers. One such improvement would be concerned with the parsing
of the model dimensions. I used a simple XML parser to do so and had to read in
the input and output dimensions of a model on every switch. Doing this once for
all models that potentially will be used on the Neural Compute Stick when the
application starts running and saving the dimensions into a lookup table could
cut the switch time almost in half. Further speedup of this switch could be achieved
by conducting it asynchronously, as while the model is loaded onto the Neural
Compute Stick the next frame can already be capture instead of waiting for the
switching process to finish.
A performance breakdown of my application with the pink part depicting the part of my code responsible for the model switch.
All in all,
I found that although at the current state this prototype would not be
applicable to real time applications yet, given the potential for improvement
it could get there. Yet if no hard conditions are imposed for it to perform in
real time as is the case for many applications, it is deployable already.
As the number of frames between switches increases, the performance of the application starts to drastically improve.
With this I
would like to sum up my findings on this project, if you would like to learn
more about this project feel free to have a look at my blog on the website of
the PRACE Summer of HPC 2019. Lastly, I would like to thank my supervisors for
their amazing support throughout this whole project and in general the staff at
ICHEC for welcoming me and making this stay such a great experience!
Today I will tell you how to speed up a programme running on a GPU. Do you remember the accumulation tree example from my last post? I was provided with a working version of it, with a global queue to stores tasks ready to be launched.
The protagonists of our story are the following. The threads are our roman soldiers. They are working on a Floating Point Processing Unit (FPU) and they do the computation. They are grouped in units of 32 threads called warps. Threads within the same warp should execute the same instruction at the same time, but on different data. Warps are themselves grouped into blocks. Threads within the same block can communicate fast through shared memory and they can synchronize together. However, no inter-block communication or synchronization is allowed except through the global memory which is much slower.
The tree is another player. He lives in the global memory since all threads should be able to access it. At last there is the queue, also living in the global memory. It is implemented by a chained list, meaning that its size can adapt to the number of tasks it contains. To avoid race conditions, access to the tree and the queue are protected with a mutual exclusion system (mutex). We rely on a specific implementation of a mutex for GPU. It allows only one thread to access the data protected by the mutex at a time. To avoid deadlocks (a kind of infinite loop when two threads wait for an event that will never occur), only one thread per warp tries to enter the mutex. We call it the warp master, and in this early version, only the warp masters can work, as follows:
Fetch a task from the queue
Synchronisation of the block
Execute the task
Synchronisation of the block
Push new tasks (if any)
Step 5 is done only if the task done at step 3 resolves a dependency and creates a new task.
Cutting allocations
The first improvement was to change the queue implementation with my own relying on a fixed sized queue. The reason behind this is that memory allocation is very expensive on a GPU, and with adaptive size queue you are allocating and freeing memory each time a thread pushes and pops a task.
Reestablishing private property
The second idea was to reduce the contention on the global queue since working threads are always trying to access it. I added small private queues for each block, that can be stored either in the cache or in the share memory for fast access. The threads within a block use the block’s private queue in priority, and the global queue as a fallback if the private one is full (push) or empty (pop).
Solving the unemployment crisis
For now only a few threads (the warp masters) are working. It’s time to put an end to that. First, since threads within a block are synchronised, access to the private queue is done at the same time thus performed sequentially because of the mutex. I decided that only one thread per block will access the queue, and will be in charge to fetch the work (step 1) and solve dependencies (step 5) for all the threads.
Breaking down the wall
Now that all threads are working, they are all waiting to enter the mutex protecting the tree, even if they are trying to access different parts of it. So I removed the mutex, and ensured that all operations on the tree are atomic. That’s a bit like if there was a mutex on each node in the tree, making it possible to several threads to access the tree concurrently.
Fighting inequality
Removing the mutex from the tree resulted in a huge gain on time, so I tried to also get rid of it for the shared queue access. First I did split the queue so that there is one shared queue for each block.
Because the queue is fixed in size, the operations of pushing and popping a task are independent. One block master can pop only from its own queues (private and shared) and can push to its own private queue, its own shared queue and also other block’s shared queue.
Thus, we do not need mutex protection for the popping operation.
Last but not least, the work must be equally shared between blocks (that is called load balancing). I provided a function that tells a calling thread in which shared queue a newly created task should be pushed.
How much is the speedup ?
Quite high actually. The optimized version I wrote works 450 times faster than the version I was provided. For an octree of depth 5, the execution time is reduced from 700 ms to 1.5 ms, enabling it to be used in real applications.
Is you graphics card able to run N-body simulations in a smart way? A complex tree algorithm, a sophisticated tasking system, is all that a task for a GPU? No, some will say, a graphics card can do only basic linear algebra operations. Well, maybe the hardware is capable of much more…
It is now time to give you some insight on what is my project. The main goal is to make progress towards the implementation on a smart algorithm to solve the N-body problem. To known the main idea behind this algorithm (without going into all the dirty details), you can check the video I made for PRACE there.
To put it in a nutshell, the Fast Multipole Methode (FMM) is a algorithm to compute long-range forces within a set of particles. It enables to solve numerically the N-body problem with a linear complexity, where it would otherwise be quadratic (when computing the interactions between each pair of particles). Doubling the number of particles only doubles the computation time instead of quadrupling it. However the FMM algorithm is hard to parallelize because of data dependencies. Tasking, meanings splitting the work into tasks and putting them into a queue helps a lot to give work to all threads. A working tasking framework for the FMM has been implemented on Central Processing Units (CPU) by D. Haensel (2018). Will such a tasking framework run efficiently on General Purpose Graphical Processing Units (GPGPUs or more simply GPUs)?
The answer is not obvious at all, because CPU and GPU have a very different architectures. My goal this summer was to shed some light on that topic.
A smooth start
Let me first introduce the problem that we will use to test the tasking framework. It is a simplified version of one of the operators of the FMM, and we named it the accumulation tree.
The principle is very simple: the content of each cell is added to its parent. So one task is exactly “adding the content of a cell to its parent”. You already see that dependencies will appear since a node needs the result from all children before having its tasks launched. Imagine we have 10 processing units (GPU or CPU threads), the computation of the accumulation will be as follows.
Initialisation
All leaves are initialized with 1. All tasks that are ready, that is all tasks from leaves, are pushed into the queue.
Round 1
Ten blue tasks are done. Two new green tasks are ready; so they are pushed into the queue.
Round 2
The last six remaining blue tasks are done as well as two green tasks. The last two green tasks are ready and pushed into the queue. Here we can see that the tasking permits to maximize the threads usage since all tasks that are ready can be executed.
Round 3
The last two green tasks are executed. We get the correct result, hip hip hooray!
And on GPUs ?
Such a tasking system works well on a CPU. Why can’t we just copy-paste the code, translate some instructions and add some annotation to make it work on a GPU?
Because many assumptions we can rely on CPUs do not hold anymore on GPUs. The biggest of them is thread independence. You can compare the CPU to a barbaric army: only few strong soldiers, any of them being able to act individually.
credits: Wikimedia CC-BY-SA
A graphics card however is more like the roman army, with a lot of men divided into units. All soldiers within the same units are bound to do exactly the same thing (but on different targets).
Even if I’m sure you are looking forward knowing if it is possible to mate this powerful army to implement a tasking system, you will have to wait for my next blog post. Be well in the meantime!
I must say am quite satisfied as I eventually managed to modify the DL Meso DPD code in order to run simulations with 3 billion particles. This was quite nasty, as there were several hidden variables that need to be modified to stock very big numbers: we are talking about long integers for those who are fond of computer sciences.
In conclusion, the DL Meso DPD code scales pretty well on very large GPU architectures. This result is amazing: by running these simulations, based on a mesoscale approach, we can jump several length scales and approximate pretty well the continuous nature of fluids.
To understand better, check the following video about a similar work based on a molecular dynamics perspective:
Moreover, this allowed me to run some very large jobs (up to 2048 nodes) on the Piz Daint supercomputer, which is kind of a story to talk about.
So here we are: the SoHPC is over and I couldn’t be any sadder. Not only it has been a great opportunity to work in an innovative environment on cut-edge research, but also the occasion to meet amazing people and live in the amazing city of Liverpool.
So, thanks for having followed this blog and, for the last time, in the immortal words of the fab four: “Hello, Goodbye”
In my one + three blog posts, I tried to present every 20 days a different aspect. Me + three relatively different topics that combined provide the basis of my unique project in the summer of HPC.
I also had a video presentation of my project, explaining nuclear physics, with stupid jokes, simple animations, puns and tv series references. https://www.youtube.com/watch?v=P12FqpXB7Yg
I improved a lot my coding skills, and with the help of guys in the lab I learned much more than I thought I would. I also managed to create a GUI without prior experience on it. I learn about nuclear fusion, and plasma. I also learned about and used HPCs.
…but there ‘re more things in life
There ‘re more things in life, like getting a song playing all night, or enjoying Ljubljana, meeting and traveling around, and receiving slovenian hospitality.
1 Ljubljana
Ljubljana is very peaceful and small, yet there are a lot of opportunities for young students. Big music festivals, open-air cinema, parties, music shows. I loved cycling in here (you can get access to a fully equipped bike renting system with an annual subscription that costs 3€ for trips that last less than an hour), and I also enjoyed just walking around the city, looking for parks and recreations. River Ljubljanica, besides having the cutest river name, can be very crowded by the main bridges, but also passes through sites where you can enjoy serenity. The castle also provides an awesome view, where in same cases I guess you can see the whole Slovenian countryside from there (yes, Slovenia is very small).
2 Meet & Travel
Since this was the first international program I participated (I didn’t manage to participate on any Erasmus program), it was the first time I got the opportunity to meet so many people from all over the world. I was very fascinated by this. There were 25 participants in 12 cities this year, and we managed to find time to meet again with some of the participants of soHPC that were living on nearby countries. So, I had the opportunity on almost every second weekend to be in a different country with people I had good time with.
In two months I got the opportunity to meet most of the participants of soHPC in Bologna, to learn and appreciate Indian culture in Ljubljana, to get balkan and Pasok vibes in Zagreb, to constantly hum “fly me to the moon” in Vienna, to learn about Amaziɣ in Budapest and talk about feminism in Bratislava (and get back with a 12 hours delay through Munich). All very exciting experiences, and I am quite thankful for having them.
3 Slovenian hospitality
Me and my roommate and soHPC participant Khyati, were living for two months in a very comfortable university dormitory apartment. We also received great hospitality from our mentors and site coordinators. Specifically, we had the opportunity to have an adventure in nature with the soHPC coordinator Leon Kos where he proved to be the most athletic of us, when he succeeded in shallow river walking, climbing and hiking in Rakov Skocjan and canoeing in the Slovenian-Croatian border river.
Conclusion
It was an awesome experience, I got much more than I expected. Everything was quite well organized, and if you are here thinking about it, don’t think about it. Apply for the soHPC.
In
the German movie Good bye, Lenin, a proud socialist woman
falls into a comma in October 1989. When she wakes up again, the Wall
has already fallen. To protect her health, her son decides to hide
this historical episode from her. She would suffer a huge anguish if
she knew her loved old regime has disappeared. Then, the son builds a
parallel reality inside their apartment, where nothing has changed, a
bastion of the old German Democratic Republic. How is this related
to my Summer of HPC project? We’ll see…
Project recap
As
we have learnt in the previous blog posts, drug discovery is a very
expensive (~US $2.8 billion) and slow (12 to 15 years) process. Free
Energy Perturbation (FEP) calculations coupled with Molecular
Dynamics simulations may allow us
to speed it up and reduce the cost and time of efficacy optimization
(one of the goals of Lead Optimization, which in turn is the most
expensive pre-clinical phase of drug discovery). These simulations
allow, in theory, to computationally screen many compounds, select
those with better binding affinity of their target and synthesize
only the most promising ones, thus saving valuable resources and
time. The goals of this project have been 1) to compare FEP results
with experimental data to prove whether it is accurate enough to be
used in industry, and 2) whether High-Performance Computing (HPC) is
necessary for FEP/MD calculations. All simulations were performed on
ARIS supercomputer, the national Greek supercomputer.
Project Results
Comparison between experimental and GAFF2 results for different analogs.
Our
test case has been CK-666. It is a micromolar inhibitor of protein
Arp23. This protein is involved cell movement and in tumor cells
migration. However, when CK-666 binds to it, the protein is
inactivated. Different modifications of CK-666 (analogs), were
studied.
After
completing FEP/MD simulations for 11 analogs against CK-666 we could
generate a correlation plot with the experimental results. How did
our predictions correlated with experiments? That’s the big
question.
When looking at the
final correlation plots, it is observed how, for example, FEP
simulations using Gromacs software and the GAFF2 force field
correctly predicted that molecules ai003 and ai007 are favored over
the reference ligand (CK666), because their relative free energy is
smaller than zero. For other analogs, such as ai015 or ai101, this
was not the case. But, the Mean Absolute Error was 1.13 kcal/mol,
which allows us to say that FEP can predict whether an analog will be
better or worse than the originally synthesized molecule with a ΔΔG
of ~1 kcal/mol. Thus, this technique has the potential to greatly
reduce the cost and time of lead optimization for a drug to enter
clinical trials.
Finally, we showed that HPC resources are essential for FEP/MD. On average, 13.5 hours and 1680 cores were needed for the complex phase of FEP, while 6.5 hours and 210 cores were necessary for the solvent phase (see here the hardware specifications). And that is for one single compound. Since the goal of FEP is to screen many analogs to select only a few of them to be synthesized, let’s imagine the scenario in which we want to analyze 100 analogs in 1 day. This would require 5250 cores for the solvent phase of the simulations, and 84000 cores for the complex phase.
One of the last stages of the project included creating a popular video in which the entire work performed is explained, together with the results and conclusions. If you are eager to learn more at what I have done, please, check it here!
Farewell
Hopefully,
the day when drug discovery cost is reduced thanks to the integration
of computational tools in the pipeline will arrive in the near
future. However, there is still work to be done by industry and
academia in this direction. Meanwhile, from next week on, the only
good times I will be missing will be my two months at Athens.
Perhaps I will ask my family to pretend I am still there…
Hello dear readers! Today, we will write the ending of this story. A story that lasted for two months. TWO MONTHS. For me, July was full of surprises and new friends, August was full of wonders and adventures!
But let’s start by a summary of my project:
Last time I talked about Gradient Boosted Decision Trees (GBDT) and addressed quickly the parallelization methods. This time we will dig deeper into this aspect! So how can we parallelize this program?
Well, when it comes to the concept of a parallel version of a program, many aspects have to be taken into account, such as the dependency between various parts of the code, balancing the work load on the workers, etc.
This means that we cannot parallelize the gradient boosted algorithm’s loop directly since each iteration requires the result obtained in the previous one. However, it is possible to create a parallel version of the decision trees building code.
The algorithm for building Decision Trees can be parallelized in many ways. Each of the approaches can potentially be the most efficient one depending on the size of data, number of features, number of parallel workers, data distribution, etc. Let us discuss three viable ways of parallelizing the algorithm, and explain their advantages and drawbacks.
1. Sharing the tree nodes creation among the parallel processes at each level
The way we implemented the decision tree allows us to split the effort among the parallel tasks. This means that we would end up with task distribution schematically depicted in the figure below:
You have already seen this figure in the Episode 3!
Yet, we have a problem with this approach. We can imagine a case where we would have 50 “individuals” going to the left node, and 380 going to the right one. We will then expect that one processor will process the data of 50 individuals and the other one will process the data of 380. This is not a balanced distribution of the work, some processors doing nothing while others maybe drowning in work. Furthermore, the number of tree nodes that can be processed in parallel limits the maximum number of utilizable parallel processes… So we thought about another way.
Sharing the best split research in each node.
In our implementation of the decision tree algorithm, there is a function that finds the value that splits the data into groups with minimum impurity. It iterates (for a fixed column — variable) through all the different values and calculates the impurity for the split. The output is the value that reaches the minimum impurity.
Sharing the best split research in each node, for each feature
As can be seen in the figure above, this is the part of the code that can be parallelized. So every time a node has to find a split of individuals in (two) groups, many processors will compute the best local splitting value, and we keep the minimum value from the parallel tasks. Then, the same calculations are repeated for the Right data on one side, and for the Left Data on the other.
In this case, a parallel process will do its job for a node and when done it can directly move to another task on another node. So, we got rid of the unbalanced workload since (almost) all processes will constantly be given tasks to do.
Nevertheless, this also has an notable drawback. The cost in communication is not always worth the effort. Imagine the last tree level where we would have only a few individuals in each node. The cost of the communication in both ways (the data has to be given to each process, and received the output at the end). The communication will eventually slow down the global execution more than the parallelization of the workload speeds it up.
Parallelize the best split research on each level by features
The existing literature (See) helped us to merge some features of the two aforementioned approaches to find one that reduces or eliminates their drawbacks. This time, the idea is to parallelize the function that finds the best split, but for each tree level. Each parallel process calculates the impurity for a particular variable across all nodes within the level of the tree.
This method is expected to work better because:
The workload is now balanced because each parallel process evaluates the same amount of data for “its feature”. As long as the number of features is the same (or greater, ideally much larger) than the number of parallel tasks, none of the processes will idle.
The impact of the problem we described in the second concept is reduced, since we are working on the whole levels rather than on a single node. Each parallel task is loaded with much larger (and equal) amounts of data, thus the communication overhead is less significant.
The global concept is shown in this figure:
Parallelizing the best split research on each level by features
And now let’s see one way to improve all of this. We will focus on the communication, particularly its timing.
While we cannot completely avoid loosing some wall-clock time in sending and receiving data, i.e. communication among the parallel processes, we can rearrange the algorithm (and use proper libraries) to facilitate overlap of the computation and communication. As shown in the the figures below, in the first case (a) we notice that the processor waits until the communication is done to launch the calculations, whereas in the second case (b) it starts computation before the end of the communication process. The last case (c) shows different ways to facilitate total computation-communication overlap.
Timing of communication and computation steps: (a) no overlap, (b) partial overlap. Timing of communication and computation steps: (c) cases of complete overlaps
One of the libraries that allows for asynchronous communications patterns, thus overlap of computation and communications is the GPI-2, an open source reference implementation of the GASPI (Global Address Space Programming Interface) standard, providing an API for C and C++. It provides non-blocking one-sided and collective operations with strong focuses on fault tolerance. Successful implementations is parallel matrix multiplication, K-means and TeraSort algorithms are discribed in Pitonak et al. (2019) “Optimization of Computationally and I/O Intense Patterns in Electronic Structure and Machine Learning Algorithms.“
So we’re done now! We have discussed some of the miscellaneous possible parallel versions of this algorithm and their efficiency. Unfortunately we did not have enough time to finalize the implementations and compare them with the JVM-based technologies but also with XGBoost performances.
Future works could focus on improving and continuing the implementations and comparing them to the usual tools. Including OpenMP in the parallelization could also be a very interesting approach and lead to a better performance.
Another side that could be considered as well is using the fault tolerant functions of GPI-2 which would ensure a better reliability for the application.
Now what?
Now I am going back to Paris for a few days. Full of energy, full of knowledge and with a great will to come back here. This summer has been crazy. In two months I met some incredible people here and I am sooo grateful for all of this.
The first week in Bologna was a great introduction to the HPC world. We learned a lot and got the chance to meet and know more about each other. I really hope we will keep in touch in the future! I’d love to hear about what everyone will be working on, etc.
The TEAAAAAM in Bologna! <3
Then Irèn and I got the incredible chance to work with the CC SAS team here in Bratislava. People are so nice and the city is lovely! I’d like to take this opportunity to thank my mentors and the whole team for their help and kindness (and the trips!).
Living in Bratislava does not only mean living in a lovely city. It also means being in the middle of many other countries and big cities. So I was able to visit Vienna with Irèn (and we actually met 3 SoHPC participants who came from Ostrava and Ljubljana, we spent a great weekend together!), then I visited Prague (And fell in love with it) and Budapest!
Five HPC participants in Vienna! (Belvedere Palace) From left to right: Pablo (Ostrava), me, Khyati (Ljubljana), Irèn (Bratislava), and Arsenios (Ljubljana).
Well, well, well. I think that’s all what I have to say. You should apply and come to discover by yourself! I can promise that you won’t regret it.
Check my LinkedIn if you want some more information and drop me a message there!
All the world’s a stage, and all the men and women merely players: they have their exits and their entrances…
William Shakespeare
I admit this is a bit pathetic, but maybe it’s the end of these great two months which makes me a bit emotianal! However, I think one can describe this summer like a three-act drama, and the acts are represented by my blogpost:
The first one was the introduction, where you got to know all important characters of the play.
The second one was the confrontation, where I had to fight against the code, to get him work!
And this is the third one, the resolution, where I’ll reveal the final results and answer all open questions (Okay, probably not all, but hopefully many).
Okay, to be honest, the main reason for this way seeing it is that I can cite Shakespeare in an HPC blog post which is quite unusual, I believe! But there is some truth in it…
Now it’s time to stop philosophizing and start answering questions. To simplify things, I will summarise my whole project to one question:
Is GASNet a valid alternative to MPI?
Allow me one last comment before finally showing you the results: I must mention that what we’ve done here by replacing MPI with GASNet was quite an unusual way of working with GASNet’s active messages. We don’t make really use of the “active” nature of GASNet, so probably one could redesign the applications in a way that they make better use of them. But our goal was to find a simple drop-in replacement for MPI. So if the answer of the above question will be no (spoiler-alert: it will), then I mean just that specific use-case, not GASNet in total!
Below you see two plots of the performance of two very different applications:
I won’t explain all the details about these applications here, for this check-out my video or my final report. There are only to things I want to point out here:
The scaling-behaviour: On the x-axis, you see in both plots the number of cores (one time with log-scale, one time linear). If we now increase the computing power by the number of cores, we have to decide, if we want to change the amount of work to do for the application:
If we don’t change the amount of work, the runtime will decrease hopefully. In the best case twice the number of cores result in half the runtime. Here we can see, if our application scales good. This is called strong scaling.
If we change the amount of work in the same way as the cores, the time should stay constant. If it does not, it is a sign, that the underlying network is causing problems. This behaviour you can see above, in the left plot the time is pretty constant, in the right one it increases a bit at the end.
The communication-pattern: this is what we call the kind and number of messages a program sends. The stencil usually sends few, but big messages whereas the graph500 many, but small ones. And here we see the big performance difference: In the first case, MPI and GASNet are almost equal which is totally different in the second one. We assume, the reason for this are differences in the latency. This means that GASNet simply needs longer to react on a message-send which has much more impact with many, small messages.
However, GASNet is not able to beat MPI in one of these cases, so in the end we must conclude: It is not worth to use GASNet’s active messages as a replacement for passive point-to-point communication via MPI. But that doesn’t mean, there are no cases, in which GASNet can be used. For example, according to the plots on the official website the RMA (Remote memory access) component of GASNet performs better.
But it is not like the project failed, because of these results. Besides the fact, that I learnt many new things and feeling now very familiar with parallel programming, these results could also be interesting for the whole parallel computing community! So I’m leaving with the feeling, that this wasn’t useless at all!
Goodbye, Edinburgh!
Unfortunately, it’s not just my time with GASNet which is ending now, but also my time in Edinburgh. I really fell in love with this city in the last weeks. And it was not so much the Fringe (which is very cool, too), but more the general atmosphere. We also discovered the environment around the city in the last weeks, it is such a beautiful landscape!
I’m really sad, that this time is over now, but I’m very grateful for this experience. I want to thank all the people who made this possible: PRACE and especially Leon Kos for organising this event, Ben Morse from EPCC for preparing everything in Edinburgh for our arrival, my mentors Nick Brown and Oliver Brown for this really cool project, and last but not least all the awesome people I spent my time with here, mainly my colleagues Caelen and Ebru, but also other young people like Sarah from our dormitory! Thank you all!
Hello, my friends. Today we will learn how to cook the dish show bellow, in Python, using PyQt5. I don’t know if you see it yet, but bellow is a dish of pasta with tomato sauce and cheese.
Pasta with tomato sauce and cheese.
Still don’t see it? Let’s have a closer look on the recipe.
Ingredients
The ingredients will be provided by the (product placement) PyQt5 library. Specifically we will need to provide from the QtWidgets
1 QApplication (table)
1 QWidget (serving platter)
1 QGridLayout (dish)
3 QPushButton (pasta, tomato and cheese)
and we will need the backends from matplotlib and pyplot which we will call plt for short and the yum yum numpy as np. And of course sys.
matplotlib.pyplot or just plt (pot)
just a tip from np (salt)
a slice of sys (bread)
Bellow, you can see a picture of those ingredients.
from PyQt5.QtWidgets import (QApplication, QWidget,
QGridLayout, QPushButton)
import numpy as np
import sys
import matplotlib
matplotlib.use('Qt5Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from matplotlib.backends.backend_qt5agg import NavigationToolbar2QT as NavigationToolbar0
Method
We will first cook the pasta, the basic compartment of our dish. Then, we will prepare the tomato and afterwards our cheese. We will treat all of them with a very delicate way. Finally, we have to combine all things together on our dish. We could leave things that way, but a proper chef takes care of the whole serving procedure, so we will also take care of the dish on it’s way to the table.
1 Boil water.
First, we have to boil some water. So, let’s grab some pre-boiled water from the fridge and let’s go.
self.figure = plt.figure() # figure out which pot you will use
self.canvas = FigureCanvas(self.figure) # convey water to pot
toolbar = NavigationToolbar(self.canvas, self) # navigate to hob
Just kidding. We grabbed it from the s(h)elf.
2 Get the pasta ready
Now, we will prepare the paste. Let’s choose the desired pasta, and let’s think of what we’ll do with it!
button_pasta = QPushButton('Pasta') # Choose what pasta we will use
button_pasta.clicked.connect(self.add_pasta) # think of preparation
But what is the actual preperation? Let’s have a better look at it.
def add_pasta(self):
# prepare new pot
self.figure.clear()
# add a little bit of salt
chi = np.random.uniform(0, 10, 400)
psi = np.random.uniform(0, 10, 400)
# get boiling water
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# put pasta in pot and wait
plt.plot(chi, psi, color='gold')
# get pasta ready/dry/drawn
self.canvas.draw()
3 Get tomato sauce ready
Again, we choose the desired tomato sauce and we think about how we will incorporate to our dish.
def add_sauce(self):
# add just a small tip of salt to the tomato sauce
chi = np.random.uniform(1, 9, 40)
psi = np.random.uniform(1, 9, 40)
# heat olive oil in a pot
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# add tomato sauce to the heated olive oil
plt.plot(chi, psi, 'o', color='red', ms=35, alpha=0.5)
# get sauce out of pot
self.canvas.draw()
4 Get the cheese ready
Finally, we have to prepare the cheese. It’s should be very easy for you by now.
Now preparation is simpler but still is very important.
def add_cheese(self):
# taste how salty cheese is but you can't do much about it
chi = np.random.uniform(2, 8, 40)
psi = np.random.uniform(2, 8, 40)
# grab a grate
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# grate the cheese with a specific grate size
plt.plot(chi, psi, '.', color='yellow', mec='black', mew=0.5)
# get cheese ready in a bowl
self.canvas.draw()
5 Get things on dish
The last part, is the dish preparation.
# choose a nice dish
layout = QGridLayout()
# start by adding pasta, tomato sauce and cheese in that order
layout.addWidget(button_pasta, 2, 0)
layout.addWidget(button_sauce, 2, 1)
layout.addWidget(button_cheese, 2, 2)
# now hope that pasta will be hot enough
# and cheese will melt
layout.addWidget(toolbar, 0, 0, 1, 3)
layout.addWidget(self.canvas, 1, 0, 1, 3)
# smell the result
self.setLayout(layout)
6 Serve dish on the table
A proper dish is only good when you have add it to the table. So here, in this final step, let’s add this dish to our table with a slice of bread too, for any customer/friend/partner that wants to try our food!
if __name__ == '__main__':
# add a slice of bread to the table
app = QApplication(sys.argv)
# incorporate the dish recipe as we described it above
main = PastaWithTomatoSauce()
# present the dish on a serving plate
main.show()
sys.exit(app.exec_())
If you execute this recipe, you will get the following result.
An empty table
Don’t worry, you didn’t spend 10′ for nothing, although you might see just a table. You just have to click on the buttons with the order you want to add. If you want to add more sauce or cheese feel free to click multiple times! And voila!
A dish with only pasta
Pasta with a lot of tomato sauce. I like tomato sauce.Our full dish
Bon appetite!
Here is the full recipe code.
from PyQt5.QtWidgets import (QApplication, QWidget,
QGridLayout, QPushButton)
import numpy as np
import sys
import matplotlib
# for full compatibility we use the specific render
matplotlib.use('Qt5Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from matplotlib.backends.backend_qt5agg import NavigationToolbar2QT as NavigationToolbar0
class PastaWithTomatoSauce(QWidget):
def __init__(self, parent=None):
super(PastaWithTomatoSauce, self).__init__(parent)
# create figure, canvas and toolbar objects
self.figure = plt.figure()
self.canvas = FigureCanvas(self.figure)
toolbar = NavigationToolbar(self.canvas, self)
# add button objects and
# connect the click signal to relevant functions
button_pasta = QPushButton('Pasta')
button_pasta.clicked.connect(self.add_pasta)
button_sauce = QPushButton('Sauce')
button_sauce.clicked.connect(self.add_sauce)
button_cheese = QPushButton('Cheese')
button_cheese.clicked.connect(self.add_cheese)
# create the layout and add all widgets
# on proper positions
layout = QGridLayout()
layout.addWidget(button_pasta, 2, 0)
layout.addWidget(button_sauce, 2, 1)
layout.addWidget(button_cheese, 2, 2)
layout.addWidget(toolbar, 0, 0, 1, 3)
layout.addWidget(self.canvas, 1, 0, 1, 3)
# assign the layout to self, a QWidget
self.setLayout(layout)
def add_pasta(self):
# clear the canvas for a new dish
self.figure.clear()
# generate random data
chi = np.random.uniform(0, 10, 400)
psi = np.random.uniform(0, 10, 400)
# create axis with relevant lengths
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# plot data
# set relevant color
plt.plot(chi, psi, color='gold')
# update canvas with current figure
self.canvas.draw()
def add_sauce(self):
# generate random data
# tomato should be on pasta so we limit the boundaries
chi = np.random.uniform(1, 9, 40)
psi = np.random.uniform(1, 9, 40)
# create axis with relevant lengths
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# plot data
# set relevant color, marker size, and transparency
plt.plot(chi, psi, 'o', color='red', ms=35, alpha=0.5)
# update canvas with current figure
self.canvas.draw()
def add_cheese(self):
# generate random data
# cheese should be on tomato so we limit the boundaries
chi = np.random.uniform(2, 8, 40)
psi = np.random.uniform(2, 8, 40)
# create axis with relevant lengths
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# plot data
# set relevant color, marker edge color, and width
plt.plot(chi, psi, '.', c='yellow', mec='black', mew=0.5)
# update canvas with current figure
self.canvas.draw()
if __name__ == '__main__':
app = QApplication(sys.argv)
main = PastaWithTomatoSauce()
main.show()
sys.exit(app.exec_())
It’s hard to believe that I have less than a week left here in Luxembourg. The past few days have been quite busy between filming my presentation, working on my final report, tidying up my code, and fleshing out the documentation but, despite being kept busy, it’s hard not to feel strange about the fact that Summer of HPC is almost over. However, I have been able to make a reasonable amount of progress on my main progress. A number of additional experiments have been run to build on the results described in my last post. Many of these have been examining whether my previous findings are also valid when using CPUs instead of GPUs for training (short answer: they are but everything is quite a bit slower). I’ve also been able add support for another framework, namely MXNet, to my code, although it doesn’t seem to want to run on more than one node at the moment. However, as my final report will be available on this site soon, I thought I would take this opportunity to review what I have learned over the past two months and encourage anyone who wants to read about all the technical details to have a look at that when it’s available.
Over the course of the last two months I that I have gained a relatively large amount of experience in deep learning. This is an area which I had a small amount of prior knowledge of, but I feel that the chance to spend a summer working on one substantial project and apply my knowledge on a machine that was far more powerful than a standard laptop was extremely beneficial in allowing me to become more experienced in what is becoming a very important topic. I also feel that I am becoming more proficient with the most common software packages used in this area. Indeed, the fact that my project was based around comparing different frameworks means that I now have at least some experience in a lot of the libraries used in the field. This might be helpful, given that machine learning will probably feature quite a bit in the master’s program I’m starting in less than two weeks.
I also feel that I learned a lot about HPC in general this summer. While the training week in Bologna now seems like a long time ago, the tutorials in MPI and OpenMP, among other topics, were very interesting and informative and I plan to keep practicing and build on my knowledge of this material for any future HPC related projects I may undertake. Working on my project has also helped me to become familiar with the SLURM scheduler and generally get used to thinking about how simple pieces of code have to be adapted to work on a large scale. It’s also quite hard to spend two months developing code on a cluster without picking up a few new Linux and vim keyboard shortcuts. Overall, I feel that my knowledge of HPC has improved substantially this summer and I look forward to any opportunities that may arise to use what I have learned in the future.
View of Echternach cathedral in the distance from near the end of the hike.
Personal Update
Over the past eight weeks, I have had a number of opportunities to explore Luxembourg city, which is a very pleasant place to wander around, with many events happening over the summer. I also had the chance to go on two daytrips to neighbouring countries. The first was to Trier, the oldest city in Germany, birthplace of Karl Marx and home to an impressive amount of Roman ruins for somewhere this far North. The second was to Metz in France, another extremely old town with an extremely large cathedral as well as an impressive modern art gallery. Both trips were very enjoyable and, in general, I felt that this summer was a great opportunity to spend some time in an area which is not exactly a standard tourist destination.
In our last full weekend in the country, Matteo and I spent the in Echternach, near the border with Germany waking part of the Mullerthal trail, Luxembourg’s main hiking route. We only managed to do a small section of the trail, which is over 100km long in total, but it was still a very good way to relax and enjoy the good weather after a busy week sorting out some of the final details of our projects.
My time in Amsterdam is coming to an end. Not only did Allison and I explore a lot, but we also learned a lot and are now finalising our projects. So, for one last time, I guess it’s time to tell you what I’ve been up to for the last few weeks.
I want to start off talking a bit about encryption. There are different encryption algorithms and encryption modes. The algorithms I looked at in my project were AES, Twofish, Serpent and CAST5. The differences between them are not too important, so instead, let’s take a look at their most significant similarities:
they all remain unbroken and are therefore secure, and
they are all block ciphers.
The term blockcipher refers to the idea of splitting a plaintext that you want to encode into blocks of the same length and encrypting them seperately. This is necessary because otherwise, you would always need a key of the same length as the plaintext. If you’re not sure why that is, take a look at the Vernam cipher. By encrypting blocks of a determined length, you only need a key of that exact length and can use it for all of the blocks. There are different ways of applying a key to a block, i.e. different encryption modes. In my benchmarks, I looked at three such modes, namely ECB, CBC and XTS. The following is a short overview of these modes.
ECB mode
CBC mode
XTS mode
ECB: The Electronic Code Book mode is the simplest and fastest way to encrypt blocks of plaintext. Every block is encrypted separately with the chosen algorithm using the key provided. This can be done in parallel. However, it is highly deterministic: identical plaintexts have identical ciphertexts.
CBC: The Cipher Block Chaining mode is less deterministic, but slower. The encryption is randomised by using an initialisation vector to encrypt the first block. Every subsequent block’s encryption depends on the block before, i.e. also on the initialisation vector. This cannot be done in parallel.
And finally, XTS: The XEX-based tweaked-codebook mode with ciphertext stealing uses two keys: The first one is itself encrypted by an initialisation variable and diffuses both the plaintext and the ciphertext of every block. It is also changed slightly for every block. The second key is used for actually encrypting the block. XTS also uses ciphertext stealing which is a technique for when the data length is not divisible by the block size: The plaintext of the last block is padded with the ciphertext from the previous block. Again, this mode can work in parallel.
Now you have a basic idea of what to consider when encrypting a disk: Not only are there different algorithms, but also different modes. Since they work quite differently, they might also have a different impact on the performance. To look into this, I set up a simple test environment out of two virtual machines: One acts as the host for the virtual machine to be benchmarked, and one acts as an NFS-Server which contains the encrypted disk. For more details on this, I suggest you have a look at my final report.
To benchmark the setup, I transcoded a 17-minute-long video of a German news programme. I did this on eight differently configured virtual disks: seven that were each encrypted differently, and one that was not encrypted. In every configuration, five runs were performed. I plotted the results in the following two graphs, and the configurations were:
Configuration No.
cipher
mode
key size
hash function
1
None
None
None
None
2
AES
ECB
256
SHA256
3
AES
XTS
256
SHA256
4
AES
CBC
256
SHA256
5
AES
CBC
128
SHA256
6
Twofish
CBC
256
SHA256
7
Serpent
CBC
256
SHA256
8
CAST5
CBC
128
SHA256
Here you
can see how the encryption impacted the user performance. The non-encrypted
setup is the green bar at the
bottom. The impact of encryption on the user time is very low; the time
needed is only increased by 0.5 to 3 percent.
User performance in test setup when benchmarking the transcoding of a video.
Things look different in system time: The performance is decreased by 13-30%. While this sounds like a lot, it is not completely crazy; if you really need to secure your data, you’ll be okay with it. Also, the system time takes up only a small part of the real time.
System performance in test setup when benchmarking the transcoding of a video.
Looking back at the process, I can say it is fairly simple to set up an encrypted disk for a virtual machine. The benchmark results show that the encryption does impact the performance, but only to a moderate extent. Also, they suggest that performance demands can be balanced with security demands since different configurations of the encryption lead to different timings. So basically, yes, you can have it all!
Sadly, this is my last post for the Summer of HPC. It’s been a great two months! Today, I’d like to talk about the final results of my project, namely the use of my animation framework for the coastal defence/wave demonstration, and the web framework I created for the simulation. First, a video of the coastal defence simulation in action!
Coastal Swapping
Haloswapping in coastal defence simulation
The main operation used in the coastal defence demonstration is what is known as a “haloswap”. This is an operation at a slightly higher level of abstraction to typical MPI communications, but is very simple. To start, you need a grid, or matrix, or array – whatever terminology you prefer.
In the wave simulation, you need to split this grid across many different Pis to run a simple differential equation solver on it (which I won’t go into here). So far, so good. In the video above, we split this grid into 16 equal horizonal strips and give one strip to each Pi (the output from the simulator shows slightly different vertical lines). But how does a wave move from the top of the simulation to the bottom? There is some kind of local relationship in the grid that means changes in one area affect those nearby. How do we pass this local information to other Pis?
The size of this local area of influence is known as a halo, and as I’m sure you’ve guessed from the video, we simply swap the halos between the Pis, known as a “haloswap”. Under the hood, this is a simple send and receive both ways, of an array of numbers (a buffer of data). Once the local information is properly exchanged, the simulation can transfer waves between strips. Wonderful!
Diagram showing a haloswap
This does have to happen after every iteration (tick of time), so many thousands of haloswaps will occur during the simulation. This would be absolutely impractical to visualise, so there needs to be some kind of limitation on what is shown. For the ease of getting a nice visual output in this project, I’ve just shown every 100th haloswap, a number which would vary depending on the speed of the simulation. Better solutions exist – you could calculate the speed of showing the visualisation vs a sample speed of the visualisation, and show them at the correct rate to keep them in sync, for example, but this is a complex operation given how little it offers in return.
In order to show the haloswap, I added a custom function to my Wee MPI framework (now available in Fortran, as that’s what this and many other HPC simulations are written in), which allows a user to play any named animation on the server, allowing lots of easy flexibility for custom animations such as that shown above.
A Web Interface?
In my previous post, I discussed the creation of a web framework for running simulations. To recap on the current situation (the framework used for the current version of the wave simulator), all computations are managed by a web server (using Flask) on the top left Raspberry Pi. This Pi dispatches simulations and collects results using MPI, and then continuously returns these results over the network. Currently the client is also a Python application with an interface made using WxWidgets, but this doesn’t have to be the case!
Developed as a NodeJS package, I’ve ported the network code from the client to python to JavaScript – meaning I can now start and download simulations in the browser. I used React to build the user interface around this and the website it’s embedded in (optimised and built by Gatsby). With this, I’m able to start all of my MPI tutorial demonstrations from the webpage directly, and in theory, also the wave demonstration.
Homepage
MPI Tutorial
Coastal Defence Demonstration
The website is mainly functional – I didn’t spend long perfecting looks. As the focus of my project is the visualisation of the parallel communications, I didn’t want to spend too long on web development! I developed a simple version of the whole site, with placeholders in place of some things. For example, while network communications are fully implemented for the MPI demonstrations, handling data in the browser is not the same as in Python!
As a result, I wouldn’t be able to fully re-implement the wave simulation without a bit more work than I had time for, though you can see an initial version, which allows you to place blocks and shows the benefit of using a webpage as the user interface. Many further features could be added from here, such as a detailed control panel for Wee Archie operable by a student or non-HPC expert, further tutorials and demonstrations, even a virtual version of Wee Archie – WebGL is getting pretty fast nowadays.
Alternatives include a more modern UI framework than WxWidgets, but this requires a lot more knowledge than web development of the average Wee Archie developer. I’ll leave the final decision on the path to take up to the maintainers of Wee Archie here at EPCC!
Goodbye!
I’d like to thank everybody here at EPCC and PRACE for such a wonderful experience, in particular my mentor, Dr. Gordon Gibb. I’ve learned a lot, from MPI to video editing in these two months, and I’m sure that I’ll never program in serial again if I can help it.
Before too long, all of my code will be available on the Wee Archie Github repository. If you’d like to see my final video presentation, I’ve embedded it below. My final report will be available on this site also. If you’d like to follow my work in future, check out my website!
I hope you’ve enjoyed my blog posts, goodbye for now!
In my last
blog post I introduced the Intel Movidius Neural Compute stick and sketched out
a rough idea of what it is good for. In this post, I would like to build up on
that and describe another very important component that accompanies the Neural Compute
Stick, which is the OpenVino toolkit.
To make use
of the Neural Compute Stick, it is necessary to first install the Open Visual
Inferencing and Neural Network Optimization (OpenVino) toolkit, which I briefly
introduced in my last post. This toolkit aims at speeding up the deployment of
neural networks for visual applications across different platforms, using a
uniform API.
I also touched in my last blog post on the difference between training a Neural Network and deploying it. To recap on that, training a Deep Neural Network is like taking a “black box” with a lot of switches and then tuning those for as long as it takes for this “black box” to produce acceptable answers. During this process, such a network is fed with millions of data points. Using these data points, the switches of a network are adjusted systematically so that it gets as close as possible to the answers we expected. Now this process is computationally very expensive, as data has to be passed through the network millions of times. For such tasks GPUs perform very well and are the de facto standard if it comes to hardware being used to train large neural networks, especially for tasks such as computer vision. If it comes to the frameworks used to train Deep Neural Networks, so there are a lot that can be utilized like Tensorflow, PyTorch, MxNet or Caffe. All of these frameworks yield a trained network in their own inherent file format.
After the
training phase is completed, a network is ready to be used on yet unseen data
to provide answers for the task it was trained for. Using a network to provide
answers in a production environment is referred to as inference. Now in its
simplest form, an application will just feed a network with data and wait for
it to output the results. However, while doing so there are many steps that can
be optimized, which is what so called inference engines do.
Now where does the OpenVino toolkit fit in concerning both tasks of training and deploying a model? The issue is that using algorithms like Deep Learning is not only computationally expensive during the training phase, but also upon deployment of a trained model in a production environment. Eventually, the hardware on which an application utilizing Deep Learning is running on is of crucial importance. Neural Networks, especially those used for visual applications, are usually trained on GPUs. However, using a GPU for inference in the field is very expensive and doesn’t pay off when using it in combination with an inexpensive edge device. It is definitely not a viable option to use a GPU that might cost a few hundred dollars to make a surveillance camera or a drone. If it comes to using Deep Neural Networks in the field, most of them are actually running on CPUs. Now there are a lot of platforms out there using different CPU architectures, which of course adds on to the complexity of the task to develop an application that runs on a variety of these platforms. That’s where OpenVino comes into play, it solves the problem of providing a unified framework for development which abstracts away all of this complexity. All in all, OpenVino enables applications utilizing Neural Networks to run inference on a heterogeneous set of processor architectures. The OpenVino toolkit can be broken down into two major components, the “Model Optimizer” and the “Inference Engine”. The former takes care of the transformation step to produce an optimized Intermediate Representation of a model, which is hardware agnostic and usable by the Inference Engine. This implies that the transformation step is independent of the future hardware that the model has to run on, but solely depends on the model to be transformed. Many pre-trained models contain layers that are important for the training process, such as dropout layers. These layers are useless during inference and might increase the inference time. In most cases, these layers can be automatically removed from the resulting Intermediate Representation.
Depiction of a workflow utilizing the Model Optimizer and Inference Engine.
Even more so, if a group of layers can be represented as
one mathematical operation, and thus as a single layer, the Model Optimizer
recognizes such patterns and replaces these layers with just one. The result is an Intermediate Representation that has fewer layers
than the original model,
which decreases the inference
time. This Intermediate
Representation comes in the form of an XML file containing the model
architecture and a binary file containing the model’s weights and biases.
Locality of different software components when using the OpenVino toolkit in combination with an application.
As shown in the picture
above, we can split the locality of an application and the steps that utilize OpenVino
into two parts. One part is running on a host machine, which can be an edge device
or any other device “hosting” an accelerator such as the Neural Compute Stick.
Another part is running on the accelerator itself.
After using the Model Optimizer to create an
Intermediate Representation, the next step is to use this representation in
combination with the Inference Engine to produce results. Now this Inference Engine
is broadly speaking a set of utilities that allow to run Deep Neural Networks on
different processor architectures. This way a developer is capable to deploy an
application on a whole host of different platforms, while using a uniform API.
This is made possible by so called “plugins”. These are software components
that contain complete implementations for the inference engine to be used on a
particular Intel device, be it a CPU, FPGA or a VPU as in the case of the
Neural Compute Stick. Eventually, these plugins take care of translating calls
to the uniform API of OpenVino, which are platform independent, into hardware
specific instructions. This API encompasses capabilities to read in the network’s
intermediate representation, manipulate network information and most
importantly pass inputs to the network once it is loaded onto the target device
and get outputs back again. Now a common workflow to use the inference engine
includes the following steps:
Read the Intermediate Representation – Read an Intermediate Representation file into an application which represents the network in the host memory. This host can be a Raspberry Pi or any other edge or computing device running an application utilizing the inference engine from OpenVino.
Prepare input and output format – After loading the network, specify input and output dimensions and the layout of the network.
Create Inference Engine Core object – This object allows an application to work with different devices and manages the plugins needed to communicate with the target device.
Compile and Load Network to device – In this step a network is compiled and loaded to the target device.
Set input data – With the network loaded on to the device, it is now ready to run inference. Now the application running on the host can send an “infer request” to the network in which it signals the memory locations for the inputs and outputs.
Execute – With the input and output memory locations now defined, there are two execution modes to choose from:
Synchronously to block until an inference request is completed.
Asynchronously to check the status of the inference request while continuing with other computations.
Get the output – After the inference is completed, get the output from the target device memory.
After laying the foundations for working with the Intel Movidius Neural Compute Stick and the OpenVino toolkit in this and my last blog post, I will elaborate on what the “dynamic” in my project title stands for. So stay tuned for more!
First of all, I wanted to share the final presentation of my project, in which some of the concepts mentioned in previous articles are explained in a more visual way.
After two articles in which the basics of aerodynamics and CFD in HPC systems were discussed and a presentation in which everything was wrapped up, now it is the perfect occasion to build on top of all that and take a different approach. This article will get more technical than the previous ones and will basically cover how to have a good start in CFD for Formula Student (or any other similar project).
Motivation: Reproducibility of the Results
One of the motives to write this article is to make it easier for anyone wishing to perform a similar study to the one that I have developed during this summer to be able to achieve it. The details regarding the scripts and modified pieces of code used throughout the project will not be shared although, if someone is interested, they can be provided. This means that the article only accounts for the geometry preparation, not the CFD setup itself, for which information is available in many blogs such as CFD Online.
The list of software used along the project contains:
SolidWorks to handle the CAD files
Oracle VM VirtualBox Manager to set up a Linux-Ubuntu Virtual Machine
OpenFOAM to perform all the CFD meshes and simulations
ParaView and EnSight to carry out the post-processing
How to Start? Pre-processing of the Geometry
The base for obtaining an accurate enough result is building an adequate mesh. But an adequate mesh can never be achieved without a good pre-processing of the solid (car) geometry.
The starting point of the project was a complex CAD version of the TU Ostrava Formula Student car in STP format. However, it must be first acknowledged that CAD formats cannot be used as an input to generate a mesh: what the meshing algorithm needs are a cloud of points defining a surface. CAD files present two problems: (1) they do not contain a set of points, but rather a set of operations and mathematical functions that define the geometry of the solids; and (2) they contain unimportant information, such as material or colour of the different parts, that are meaningless for the CFD process.
The geometry of the car should, therefore, be converted into a more suitable format such as STL. This was achieved by importing the STP file into SolidWorks and then exporting it as an STL file. The exporting process is also important:
The small details that are not relevant to the solution should be removed. Overall, the geometry must be carefully cleaned. This includes all kind of bolts, nuts and screws in the car, for example. It is true that they can have a small effect on the aerodynamic performance of the car, but it is not significant enough so as to compensate for the associated increase in computational time/resources.
The file must be exported as an ASCII file type due to functionality reasons.
The reference coordinates system must be chosen carefully: it will define the absolute coordinates of the file and its location in the mesh.
The exporting quality should be large enough. Otherwise, it will be possible to observe even at first sight how the geometry is not continuous. The mesh will never improve a bad STL surface quality; at most, it will be able to keep up to it. To get this right, one must analyse the maximum accuracy that is going to be possible to achieve on the surfaces of the body in the meshing process, and then generate the surface file from the CAD in such a way that the characteristic size of its elements is smaller than that of the smallest mesh elements -for the geometry not to be the limiting factor-, but not much smaller -to avoid a very large geometry file that will slow down the whole process-. It could take several iterations to find the sweet point.
The geometry assembly (in this case the car) should not be saved as a single STL file since this will not allow to then analyse the forces created at each of the parts. The best approach is, therefore, to save the geometry in as many parts as relevant subdivisions the geometry features. In the formula car case, the geometry was split into the following parts:
Front wheels
Rear wheels
Suspension assembly
Chassis assembly
Engine
Driveshaft
Driver
Front wing
Rear wing
Underbody
Splitting the geometry into different parts implies that they have to be merged to be input as a single file (the internal split into the different parts will be kept). This is done via a merging bash script.
Once all these steps have been fulfilled, the STL file is ready to be used.
Conclusions
If these steps are followed carefully, a base will have been built to start performing CFD simulations.
The next step would be following one of the numerous OpenFOAM tutorials to install the software. Followingly, one can start running actual simulations and adjusting the parameters of the simulations to improve the modelling of the problem up to the moment in which a satisfactory level of accuracy is reached. This article is just an introduction, but more details can be found both in the final presentation and the final report of the project.
The time has come to talk about parallelization and MPI, the heart of our topic and my project. First, the problem the parallel Radix Sort can solve will be presented. Then, we will see the algorithm before talking about my implementation. Lastly, it will be interesting to expose performance results.
Problem the parallel Radix Sort can solve
Every day we collect more and more data to be analyzed. With nowadays large-scale data and large-scale problems, parallel computing is a powerful ally and important tool. While processing data, one recurrent step is the sorting. In my second blog post, I have highlighted the importance of sorting algorithms. Parallel sorting is one of the important component in parallel computing. It allows us to reduce the sorting time and to sort more data, amountsthat can’t be sorted serially. Indeed, it is possible when we want to sort a huge amount of data that it can’t fit on one single computer because computers are memory bounded. Or, it may take too much time using only one computer, a time we can’t afford. Thus, for memory and time motivations, we can use distributed systems and have our data distributed across several computersand sort them in that way. But how? Let’s define properly the context and what we want before.
What we have
Presources able to communicate between themselves, store and process our data. They can be CPUs, GPUs, computers… Each resource has a unique rankbetween 0and P-1.
Ndata, distributed equally across our resources. This means that each resource has the same amount of data.
The data are too huge to be stored or processed by only one resource.
What we want
Sort the data across the resources, in a distributed way. After the sort, resource 0 should have the lower part of the sorted data, resource 1 the next and so on.
We want to be as fast as possible and consume the lowest memory possible on each resource.
Like all distributed and parallel systems, an issue we have to be careful of is communications between the resources. They have to be as minimal as possible to make parallel algorithms efficient. Otherwise, we spend too much time in communications instead of treating the problem itself. Through communications, we can exchange data and information between the resources. The more amount of data we exchange, the more it will take time.
There is a bunch of parallel sorts and among them sample sort, bitonic sort, column sort, etc. Each of them has its pros and cons. But, as far as I know, they are not so many satisfying all of our requirements above. Often, they need to gather all the data or a huge part of them on one resource, at one point or another, to be efficient. This is not suitable. They can be managed without gathering the data on one resource, but, most of the time, they are not efficient enough anymore because of the communication overhead involved. The parallel Radix Sort is one of those which meets our requirements while being efficient. It is currently known as the fastest internal sorting method fordistributed-memory multiprocessors.
Now, we will use the word processor instead of resource because it is the word often used in HPC.
Parallel Radix Sort
I recommend to read my two previous blog posts where I have detailed the serial Radix Sort because the parallel version is entirely based on it. So if you don’t know the Radix Sort, it is better to read them first. The notations used here have been introduced in the previous posts.
In general, parallel sorts consist of multiple rounds of serial sort, called local sort, performed by each processor in parallel, followed by movement of keys among processors, called the redistribution step. Local sort and data redistribution may be interleaved and iterated a few times depending on the algorithms used. The parallel Radix Sort also follows this pattern.
We assume that each processor has the same amount of data otherwise processors workload would be unbalanced because of the local sort. Indeed, if one processor has more data than the others, it will take more time to achieve its local sort and the total sorting time will be greater. However, if the data are equally distributed, the processors will take the same time to achieve their local sort and none of them will have to wait for another to finish. As a result, the total sorting time will be shorter.
The idea of the parallel Radix Sort is, for each key digit, to first sort locally the data on each processor according to the digit value. All the processors do that concurrently. Then, to compute which processors have to send which portions of their local data to which other processors in order to have the distributed list sorted across processors and according to the digit. After iterating for the last key digit, the distributed array will be sorted as we want. Below is the parallel algorithm.
Input: rank (rank of the processor), L (portion of the distributed data held by this processor)
Output: the distributed data sorted across the processors with the same amount of data for each processor
1. for each keys digit i where i varies from the least significant digit to the most significant digit:
2. use Counting Sort or Bucket Sort to sort L according to the i’th keys digit
3. share information with other processors to figure out which local data to send where and what to receive from which processors in order to have the distributed array sorted across processors according to the i’th keys digit
4. proceed to these exchanges of data between processors
Each processor runs the algorithm with its rank and its portion of the distributed data. This is a “high-level” parallel Radix Sort algorithm. It describes what to do but not how to do it because there are so many ways of doing the steps 3. and 4. depending on many parameters like the architecture, environment and communication tool used. Let’s go through an example and then I will explain how I have chosen to implement it.
Figure 1. Unsorted versus sorted distributed array across three processors. Our goal with the parallel Radix Sort is to get the sorted one from the unsorted. Source: Jordy Ajanohoun
We start with the distributed unsorted list above to run our parallel Radix Sort. P equals 3 because there are 3processors and N equals 15. We will run the example in base 10 for simplicity but don’t forget that we can use any number base we want and in practice, base 256 is used as explained in my previous posts. Also for simplicity, we deal with unsigned integer data and the sorting keys are the numbers themselves. To sort signed integers and other data types, please refer to my previous article.
Figure 2. First iteration of the parallel Radix Sort on a distributed array across three processors. The numbers are sorted according to their base 10 LSD (Least Significant Digit). The first step (local sorting) is done concurrently by all the processors. The second step (data redistribution) can be computed and managed in several ways depending on the architecture, environment and communication tool used. One iteration according to their base 10 MSD is remaining to complete the algorithm and get the desired final distributed sorted array. Source: Jordy Ajanohoun
The keys having less digits (2, 1 and 5) have been extended to the same number of digits than the maximum in the list (two here) by adding zero as MSD. If you don’t understand why, please refer to my second article. It doesn’t change the value of the keys. The challenging part when implementing can be the redistribution step. We have to think about, for a given processor, what information from other processors is required to figure out where to send the local data. It is not complicated, we want the distributed array to be sorted according to one particular key digit (the i’th digit). In our example, the value of a digit is between 0 and 9. We have sorted locally the data on each processor, therefore, each processor knows how many data there are having their i’th key digit equals to a given digit value. By sharing this information with the other processors and getting their information too, we can determine which data each processor has to send and receive. In the example above, all the processors (from the rank 0 to P-1) have first to send their local data having their key LSD equals to 0 to the processor p0 until this one can’t receive data anymore because it is full. There is no such data so we continue with the next digit value which is 1. The processor p0 keeps its data having their key LSD equals to 1, then receive those from the processor p1, and finally those from the processor p3 but it doesn’t have. The processors order really matters and has to be respected. Once done, we repeat the same with the value 2, then 3 and so on until 9. When p0 is full, we continue by filling p1 and so on. Careful, the redistribution step has to be stable too, like the sort used in each iteration of the Radix Sort and for the same reasons. This is why we have said the order mattersand has to be respected, otherwise it doesn’t sort.
The algorithm is still correct if we swap the local sort step and the data redistribution step. But, in practice, it is not suitable because often, to send data efficiently, we need the data to be contiguous in memory. Two data having their i’th digit identical will most probably be sent to the same processor, so it is a good point to sort the data locally before.
We still have one iteration to do according to the algorithm. Let’s finish.
Figure 3. Last iteration of the parallel Radix Sort on a distributed array across three processors. The numbers are sorted according to their base 10 MSD (Most Significant Digit). The first step (local sorting) is done concurrently by all the processors. The second step (data redistribution) can be computed and managed in several ways depending on the architecture, environment and communication tools used. At the end, the algorithm is completed and the distributed array is sorted. Source: Jordy Ajanohoun
We have done the same thing but according to the last digit this time. The algorithm is now finished and, good news, our distributed array is sorted as we wanted.
I have presented here the parallel LSD Radix Sort. But, like the serial, there is a variant called parallel MSD Radix Sort which is the parallelization of the serialMSD. I have implemented both to sort int8_t data type and my performance results were better with my LSD version. This is why I have continued with the LSD to generalize it to other integer sizes. It is also the reason why since the beginning I am focusing on presenting the LSD version and I don’t really go further into detail with the MSD.
My implementation
MPI is used for communications between processors.
I have used the Counting Sort and not the Bucket Sort because my implementation with the Bucket Sort had an extra charge due to memory management. Indeed, unless making a first loop through the keys to count before moving them into the buckets, we don’t know in advance the length of the buckets. Therefore, each of my buckets was a std::vector and although they are very well implemented and optimized I still had a lack of performance due to memory management. The problem is absolutely not due to the std::vector class, it comes from the fact that on each processor, each bucket has a different size, depending on the key characteristics, and we can’t predict them. So, instead of making a first loop to count and find out the length of each bucket before creating them with appropriate sizes, I opted for the Counting Sort which is finally almost the same because we also count before moving the data. The difference is we don’t use buckets anymore, we use a prefix sum instead.
To do step 3. of the algorithm, I save the local counts from the Counting Sort on each processor and share it with other processors via MPI_Allgather. In that way, each processor knows how many keys having their i’th byte equals to a given value there are on each processor, and from that, it is easy to figure out where to send which data as explained in the Parallel Radix Sort section example (just above this “My implementation” section).
Step 4. is managed using MPI one-sided communications instead of send and receive which are two-sided communications. I have tried with both and most of the time either performances were similar or better with one-sided communications. MPI one-sided communications are more efficient when the hardware supports RMA operations. We can use them in step 4. of parallel Radix Sort because we don’t need synchronization for each data movement, but only once they are all done. The data movements are totally independent in step 4., they can be made in any order as long as we know where each element has to go.
In terms of memory, on each processor, I am using a static array (std::array) of 256 integers for the local counts. In addition to that, I have the input array “duplicated”. The original local array is used to receive the data from other processors in step 4. Thus, at the end, the array is sorted in this same input buffer. The copy is used to store the local sorted data and send them to the right processor. It is possible to implement it without duplicating the array, but a huge amount of communications and synchronizations will be necessary and the communications won’t be independent anymore. In my opinion, it is too much time lost for the memory gain.
As said previously, there are several ways of doing steps 3. and 4. It is also possible for example, to build the global counts across processors (for step 3.) using other MPI collective communications like MPI_Scan. To send the data in step 4. we can also use MPI_Alltoallv instead of one-sided communications but it requires to sort again the data locally after receiving. I tried several alternatives and what I have exposed here is the one giving me the best performances.
Performance analysis
As hardware, I have used the ICHEC Kay cluster to run all the benchmarks presented in this section. The framework used to get the execution times is google benchmark. The numbers to sort are generated with std::mt19937, a Mersenne Twister pseudo-random generator of 32-bit numbers with a state size of 19937 bits. For each execution time measure, I use ten different seeds (1, 2, 13, 38993, 83030, 90, 25, 28, 10 and 73) to generate ten different arrays (of a same length) to sort. Then, the execution times mean is registered as the execution time for the length. These ten seeds have been chosen randomly. I proceed like that because the execution time also depends on the data, so it allows us to have a more accurate estimation.
Strong scaling
Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size.
Figure 4. Sorting time of 1GB int8_t as a function of the number of processors. dimer::sort is my parallel Radix Sort implementation. The execution times given using boost::spreadsort and std::sort are for one processor only. They appear here just as a reference to see the gain we have with dimer::sort. They are not parallel algorithms, therefore, strong scaling doesn’t make sense for them contrary to dimer::sort. The perfect scalability curve represents the best execution time we can reach for each number of processors. Theoretically, it is impossible to do better than this perfect scalability. It is defined as the best serial execution time for the problem, divided by the number of processors.
We can notice that my implementation seems good because, in this case, it is better than std::sort and boost::spreadsort whatever the number of processors used. Plus, it seems to be very close to the perfect scalability. My implementation here is faster than boost::spreadsort even in serial (with only one processor) because when the array is not distributed, I am using ska_sort_copy instead of boost::spreadsort. ska_sort_copy is a great and very fast implementation of the serial Radix Sort. Actually, it is the fastest I have found. This is why it is my reference to compute the perfect scalability.
The problem with the previous graph is the scale. It is difficult to distinguish the perfect scalability and the dimer::sort curves because the times are too low contrary to std::sort execution time. This is why we often use a log-log scale to plot execution times in computer science.
Figure 5. Exactly the same as Figure 4 but in log-log scale this time.
With a log-log scale we can see better what is happening. Now, we are able to tell that until 8 processors, the perfect scalability and dimer::sort curves are the same. It means until 8 processors, theoretically, we can’t do better than what we already have to sort 1GB int8_t. But, from 8 processors it is not the case anymore and it is most probably due to communication overhead because there are more and more processors. Let’s verify it with another graph.
Distribution of execution time
Figure 6. Distribution of dimer::sort execution time as a function of the number of processors. The same 1GB int8_t is sorted but with different numbers of processors. This 1GB int8_t is the same for figures 5 and 4.
Indeed, when the number of processors increases, the communication part in the total execution time increases too. The local sort step (in blue) contains zero communication. All communications are done during the data redistribution step (in orange). These communications are one MPI_Allgather and several MPI one-sided communications.
Until now, we have seen performance results only for int8_t. It is good but limited because the int8_t value range is too small and not representative of real-world problems and data. What about int32_t which is the “normal” integer size?
int32_t, sorting time as a function of the array length
Figure 7. Sorting time as a function of the number of items to sort. The items here are 4 bytes integers. dimer::sort curve is the execution times of my parallel Radix Sort implementation ran with 2 processors.Figure 8. Same as figure 7 but in log-log scale.
The performances are also satisfying. Using dimer::sort we go faster than anything else when the array is enough big and this is what we expect when we use HPC tools. For small array sizes, we are not really expecting an improvement because they can be easily well managed serially and enough quickly. Plus, we add extra steps to treat the problem in a parallel way, therefore, when the problem size is small, it is faster serially because we don’t have these extra steps. It becomes faster in parallel when these extra steps are a small workload compare to the time needed to sort serially.
Demonstration
Let’s finish by visualizing my implementation! We easily distinguish the two different steps of the algorithm in this visualization.
Visualizing dimer::sort (with 3 processors) using OpenGL
Just a quick info: This blog post was written by Allison Walker and Kara Moraw. Due to a missing plugin, the shared authorship is not displayed correctly, so we’re letting you know like this instead.
** This post is intended for future participants in the SoHPC program, and for anyone planning a visit to Amsterdam. **
SURFsara
SURFsara is a member of the SURF cooperative. It brings together many educational and research institutions in the Netherlands to drive innovation and digitization. SURFsara is an institution that provides a variety of computing services: networks, storage, visualization and, of course, supercomputers.
SURFsara’s supercomputer is called Cartesius. It ranks 455 on the TOP500 list and 158 on the GREEN500 list. It offers dutch researchers the opportunity to process their data and compute large simulations on national resources. It is designed to be a well-balanced system for maximum usability.
For the hardware nerds among you, here are the juicy details: Cartesius runs on different Bull nodes with the bullx Linux operating system. The nodes sum up to 47,776 cores and 132 GPUs with a theoretical peak performance of 1.843 Pflop/s. Its power consumption is about 706.00 kW, and in case there is a power cut in Amsterdam, there sits a power generator ready to keep the systems up for 24 more hours. Brace yourself for an insanely high number: This power generator would use 500,000 tons of diesel… Let’s hope they’ll never have to use it!
Cartesius
Cartesius cables
Cartesius Cables
Cartesius Cooling
Back of Cartesius
A supercomputer is insanely loud…
Cartesius Cases
We have had a really wonderful time working here. The people are lovely, the environment is open, and coffee is better than average. Our only complaint: the air conditioning seems to be consistently a few degrees too cold.
Airbnb
Possibly the most important piece of advice that we can offer is: book your accommodation early! Amsterdam is one of the most popular tourist destinations in Europe, and the Summer of HPC program takes place during its busiest months: July and August. For a variety of reasons, our accommodation was not arranged until mid June, and as a result we had to pay a pretty exorbitant price! But be prepared, even an early booking doesn’t mean you will get a great deal. Amsterdam is so popular that hosts can pretty much charge what they want. Also, you can’t really be picky about the location – in our case, we’re lucky to live quite close to the centre in Amsterdam West, but on the other hand, cycling to work takes us 40 minutes every day.
If you know someone in Amsterdam, maybe ask them to look at notice-boards at the University of Amsterdam. It is just across from SURFsara and so is their housing. Maybe you’re lucky and find some student renting out their apartment over the summer.
Bike etiquette
Amsterdam, is the fifth most bicycle-friendly city in the world. There are more than 400km of bike paths, endless parking options and many bike rental companies catering to locals, expats, and tourists alike. Given the distance from our accommodation to SURFsara, the high cost of public transport, and our general eagerness to live this summer as the Dutch do, we chose to cycle to and from work each day (10km and 40 minutes in each direction). We rented some great little bikes through Swapfiets for less than 20 euro per month. Swapfiets is a great option for anyone visiting the city for more than a few weeks: they provide well-maintained biked and a repair service that will come to you.
Meet our Swapfiets bikes – Poppy and Sally!
Of course, it’s important to be aware of the rules in Amsterdam. Cyclists rule the roads. They (we) have little tolerance for pedestrians or fellow cyclists who don’t know what they’re doing, so if you don’t learn the rules quick smart you risk getting knocked over:
Keep to the right. Cyclists often pass each other, and you don’t want to be the person holding up all of the people behind you on the bike path.
Abide by the road rules! Even if the locals don’t…Stop at the lights, watch out for trams and tram tracks, don’t cycle on footpaths, signal when turning, make sure you have good lights.
Always lock your bike (twice). Amsterdam is notorious for bike theft, so take the necessary precautions.
Buienradar weather app was our saviour this summer. Amsterdam is known for it’s rainy and unpredictable weather. It’s always best to check the forecast before heading out on a ride.
Public Transport
Of course, you want to be a real Amsterdam local so you’ll mostly find your way through the city cycling. It’s fast, it’s cheap, and it’s a nice exercise. Oh, and it’s a very good way of getting rid of your sleepy mood in the morning. But you’ll find that summer days in the Netherlands can be quite rainy. You’ll get used to a light rain, but you should really stay off your bike when it’s pouring! Just take your favourite book and catch a tram.
Public transport in the Netherlands is a bit different from what you might be used to. For trams, buses and metros, you can’t get a ticket based on where you’re going. You have got two options:
Get a one-hour-ticket for 3,20 €. For this, just get in at the front and ask the driver for it. You can only pay by card! 1 € of this amount is a fee because you’re using a paper ticket (shame on you!).
Get an OV-Chipkaart. You can get it at all the bigger stations (e.g. Science Park Station close to SURFsara), and it’s 7,50 €. With the OV-Chipkaart, you check in when you enter your chosen means of transport and check out when leaving. You only pay for the distance travelled, which means at least 1 € less than with a one-hour-ticket because you’re not using a paper ticket. You pay with the OV-Chipkaart itself, it works just like a Prepaid SIM card. You can top it up at a lot of supermarkets and tram stations. For checking into a tram/metro/bus, you need to have at least 4€ credit on it. You can check your credit online. At the end of your stay, you can get the rest of the money you have on it at the NS office of Amsterdam Centraal, as long as it’s less than 30 €. Or just come back within the next five years and use it again!
Trains:
If you opted for an OV-Chipkaart (which we definitely recommend), you can also use it on trains. Beware that to check into a train, you need to have a minimum credit of 20 €. If you buy a one-way or day train ticket instead, you again pay the extra fee of 1 € for using a paper ticket.
Weekend tickets:
If you have friends or family coming to visit for the weekend, these are all good options (depending on what you want to do):
Have them rent a bike. Renting a bike for a day is about 10 € and you can be sure there’s at least three different places to do that on the way from your home to the next supermarket.
The multi-day ticket. It can be bought at all metro stations and is valid in all metros, trams and buses in Amsterdam. There are different options but the 72-hours-version costs 19 €.
The Amsterdam & Region travel ticket. It takes you a bit further than the multi-day-ticket, e.g. to the beach or Haarlem. Again, there are different options but the 3-day-version is 36,50 €. It can be purchased in I Amsterdam visitor centres, at Amsterdam Centraal and other places. Just google it.
Exploring A’dam
As mentioned, Amsterdam is one of the most popular European destinations, and with good reason. The city truly is beautiful and culturally very diverse. The museums, sites, and shopping are enough to draw any adventurer to this beautiful town. Below are some of our favourite experiences as Amsterdam expats:
Pride Canal Parade Amsterdam
Canal in Amsterdam
Another canal in Amsterdam
A particularly beautiful house boat in Amsterdam
Amsterdam Pride
Amsterdam Canal Pride Parade
Another canal in Amsterdam
The A’dam tower
The flower market
Explore the canals (if you’re here during the Pride parade be sure to pay a visit!)
Anne Frank house
Rijksmuseum
Van Gogh museum
Stedelijk museum
Food
Look, ‘Dutch’ isn’t necessarily a cuisine that comes to mind when you think about interesting/delicious/exotic dishes. But saying that, we have discovered some delicious treats that are native to the country. Our must-tries are:
Bitterballen: breaded and fried meatballs, often served with mustard.
Herring: a classic Dutch street food, Herring can be found all over Amsterdam.
Dutch Pancakes: bigger and less fluffy than their American counterparts, these pancakes can be topped with both sweet or savoury toppings. Try out the Pannenkoekenboot (Pancake boat) for 75 minutes of all you can eat pancakes while cruising around Amsterdam.
Typically served on its own with onion and pickles, or in a sandwich.
Croquettes: you can find these in all varieties; meat, cheese, fish…
Cheese (Gouda being the classic Dutch variety)
Rijsttafel: directly translated to rice table, this Indonesian food is traced back to colonial history between the Netherlands and Indonesia. This dish is made up of dozens of small and shareable dishes.
Stroopwafel: two waffles with a gooey caramel centre, mmmmm.
Day trips
Something you’ll notice right away is that Amsterdam gets very busy, especially on the weekends. It’s not the ~800,000 citizens, it’s the 18 million tourists. It’s insane how many tourists roam around the city centre, and that’s why Amsterdam stopped advertising the city as a tourist destination. If you’re looking for a quieter weekend, consider exploring other parts of Holland. We got more suggestions from our colleagues here at SURFsara than we could fit into our two months, just take a look!
Utrecht:
We went to Utrecht on a sunny Sunday to avoid the crowds in Amsterdam. We took a train from Amsterdam Centraal which cost us about 9€ and arrived there within an hour. Something to have in mind about day trips on Sundays: Most of the shops open between 11 and 12. This means that until about lunchtime, the city is sleepy and quiet. If you want to enjoy a nice walk through the centre, just arrive before that, but if you’re out for a shopping trip, don’t bother arriving before 12.
Street in Utrecht
Canal in Utrecht
Trash Whale in Utrecht
After grabbing a coffee at Anne&Max, we stumbled upon the Domtoren, a gothic bell tower of 465 steps, and the Domkerk itself. Around the corner, there is also a cozy little garden. From there on, we just wandered around, but there’s two places you should definitely check out: De Zakkendrager is a restaurant with an exquisite Bittergarnituur, a collection of Dutch food. This was our first taste of Dutch food, and we loved it! Later, we stopped at the famous Dapp Frietwinkel for some Dutch fries (eat them with mayonnaise, that’s what makes them Dutch). Allegedly, you can get the best fries of all Netherlands there!
The Hague:
The Hague (or Den Haag, as the Dutch call it) is a bit further away than Utrecht. We took a train from Amsterdam Sloterdijk for about 12€ and arrived an hour and a half later. From the station, it’s a quick walk into the centre. Our highlights were the Peace Palace (home to the international law administration) which has two towers that look like they were inspired by the Disney castle, and the Binnenhof and Buitenhof, two historical squares just where the parliament and the Prime Minister of the Netherlands meet. We also enjoyed some herring just in front of Binnenhof – we were told this is where the politicians go for lunch, and we can’t blame them: it was delicious.
Den Haag
Peace Palace Den Haag
Parliament Den Haag
Haarlem:
Haarlem is a lovely town right next to Amsterdam: It’s perfect when you have a friend visiting for the weekend or just aren’t up for a full day trip. You can easily bike to Haarlem, but if you choose to go on a day with winds gusting up to 90km/h like us, you might want to opt for the train instead. The Sprinter from Amsterdam Sloterdijk costs 3,70€ and is there in less than 10 minutes. South from the station is the Nieuwe Gracht where the historical centre begins. East from here you can find a beautiful windmill (you can also enter it, but that’s 5€), and then turn west into the centre. All the surrounding streets and alleys are lovely, quiet, cozy and adorned with greenery. Towering over all the houses is the Grote Kerk (big church), or St. Bavokerk, which was definitely worth entering. It’s a beautiful and very unique cathedral built in the 16th century.
Windmill in Haarlem
View in Haarlem
Another view in Haarlem
Sailing houses in Haarlem
Grote Kerk in Haarlem
Canal in Haarlem
Another canal in Haarlem
Zaanse Schans:
Zaanse Schans is apparently a ‘must do’ when in Amsterdam. After reading about its popularity and crowds, we chose to take a day off work and cycle the 15km to this beautiful little town. It is famous for its historic windmills and distinctive green wooden houses. While there, we visited the wooden clog carving workshop, the cheese factory, and the chocolate workshop. We also stopped into the saw-mill: one of the six still-functioning mills in the town.
While this town was indeed quaint and interesting, be aware that the crowds are very overwhelming. Honestly, our day-trip felt more like a visit to Disneyworld than a visit to a historic Dutch village. It’s worth the visit, but it’s also worth timing your visit to avoid peak crowds (ie. choose a weekday and visit in the evening).
Windmills in Zaanse Schans
Other places worth a visit:
There is a lot more places worth exploring around Amsterdam. Here are a few which we had on our list, but didn’t have the time to visit:
Python programming language which its main philosophy is ‘code readability’ has entered our lives in 1991. After the Python developer Van Rossum has decided to pursue his career on Google, Google has became the tower of strength of Python. Then Python’s popularity has begun to increase linearly. One common question arises in mind of most people, how Python has become popular although its slow ? If you can’t believe or don’t want to believe, you should examine at the numerical data.
I have got caught up in this popularity trend and I chose a project where I could use the Python programming language when applying to the Summer of HPC program. What would you say about speed when you compare C and Python programming language? Of course, everyone will call it C. Actually I’m not trying to solve a known result of problem. I’m working on how much speed we can improve by doing code optimization in python. That’s when I met Numpy more closely.
in my dream python with numpy
in reality
NumPy (Numerical Python) is a mathematical library that allows us to perform scientific calculations quickly (many people confuse it with a module or framework but it’s a library). Numpy arrays form the basis of Numpy. Numpy arrays are similar to python lists, but are more useful in terms of speed and functionality than python lists. Also, unlike python lists, Numpy arrays must be homogeneous, all elements in the array must be of the same data type.
You know that you can write an algorithm in more than one way in a programming language. So I has written this problem in three different ways:
version 1 – Using the Numpy library
version 2 – Numpy not using the library
version 3 – Using the Numpy library and never using for loop
I have mentioned the CFD problem in my previous post. We have done many experiments by changing the parameters of this problem. There were 3 different parameters that could affect the speed of the algorithm. You can see the values :
Scale factors of 1, 4, 16, 32 and 128
For each scale factor, two cases: without Reynolds number and with Reynolds number
Let’s imagine you have a working code and now you are producing some results. What do you feel? You are very happy, of course, but you feel the urge to improve your code. It’s always a good idea to parallelize the code, and in the meantime you can think about the visualization, how to impress your colleagues and friends.
Parallelization
In my previous blog post I wrote about the already existing MPI parallelization of the code. Briefly, we have to solve the eigenvalue-equation of the Fock-matrix that has huge dimensions, but the matrix is -fortunately- block diagonal. The blocks are labelled with a k index, and each block is diagonalized by a different process. According to the original project plan, my task would have been the further development of the parallelization to make the code even faster. What have I done instead? I extended a code with a subroutine that performs a complicated computation for a lot of grid points. Yes, electron density computation made the program running longer. Well, it’s time for parallelization.
We get several electronic orbitals for each k value, and the electron density is computed for each orbital. I decided to implement the MPI parallelization of the electron density computation the same way as it was done for the Fock matrix diagonalization. We use the master-slave system: If we have N processes, the main program is carried out by the Master process (rank=0) that calls the Slave processes (rank=1,2,…,N-1) at the parallel regions and collects the data from the Slaves at the end.
Each process does the computation for a different k value. We distribute the work using a counter variable that is 0 at the beginning. A process does the upcoming k value if its rank is equal to the actual counter value. If a process accepts a k value, it increases the counter value with 1, or updates it to 0 if it was N-1. This way, process 0 does k=1, proc. 1 does k=2, proc. N-1 does k=N, then proc 0 does k=N+1, proc. 1 does k=N+2, and so on. The whole algorithm can be described something like this:
Master: Hey Slaves, wake up! We are computing electron density! Slaves: Slaves are ready, Master, and waiting for instructions. Master: Turn on your radios, I am broadcasting the data and variables. Slaves: Broadcasted information received, entering k-loop. From now, the Master and the Slaves do the same: Everybody: Is my rank equal to the counter? If yes, I do the the actual k and update the counter; if not, then I will just wait. At the end of the k-loop: Slaves: Master, we finished the work. Slaves say goodbye and exit. Master: Thank you Slaves, Master also leaves electron density computation.
The first implementation of parallelization
There is a problem with the current parallelization: you cannot utilize more processors than the number of k values. It is very effective for large nanotubes with lots of k values, but you cannot use the potentials of the supercomputer for small systems.
Next idea: Let’s put the parallelization in the inner loop, the loop over the orbitals with the same k. In this case we distribute the individual orbitals among the processes. If we have M orbitals in each block, the first M processes start working on the k=1 orbitals, the next process gets the first orbital for k=2. This way we can distribute the work more evenly and utilize more processors.
The second implementation
I tested the parallelization on a nanotube model that has 32 different k values. The speedup as a function of the number of the processors is shown in the figure below. The blue points stop at 32, because we cannot use more processors than the number of k values. The speedup is a linear function of the processor until about 64, but after that the speedup does not grow that much with the number of processors.
Test of the parallelization
Visualization
Motto: 1 figure = 1 000 words = 1 000 000 data points
Before I started this project I thought that the visualization part will be the easiest one, but I was really wrong. It can be very difficult to produce an image that captures the essence of the scientific result and pretty as well. (I have even seen a Facebook page dedicated to particularly ugly plots. I cannot find it now, but it’s better that way.) I do not have a lot of experience with visualization, as it was only lab reports for me. For lab report all we had to do was to make a scatter plot, fit a function to the points, explain the outliers and -most importantly- have proper axes labels with physical quantities in italic and with proper units. The result? A correct, but boring plot and the best mark for the lab report.
Now I am trying to make figures that can show the results and capture the attention, too. If you have a good idea, any advice is greatly welcomed.
The first test system of the was the benzene molecule, and I computed the electron density only in the plane of the molecule. I plotted the results with Wolfram Mathematica using ListPlot3D and ListDensityPlot.
Electron density of benzene orbital using ListPlot3D in Mathematica
The same benzene orbital but plotted with ListDensityPlot
But electron density is a function is space! How can I plot the four-dimensional data? The answer is simple, let’s make cuts along the x-z plane, and make a plot for each cut. Here you can find 1231 cuts for the R=(6,0) nanotube. I apologize if you really tried to click on the “link”, it was a bad joke. I do not expect you to reconstruct the 3D data from plane cuts, because there are better ways of visualization.
What we can plot is the electron density isosurface, a surface where the electron density is equal to a given value. Quantum chemical program packages can make such plots if we give them the input in the proper format. I have x-y-z coordinate values and the data, and I convert this to Gaussian cube file format, then I use the Visual Molecular Dynamics program to make the isosurface plots. Here you can see some examples for the R=(6,0) nanotube.
This is all for now, I hope I could tell you something interesting. I will come back with the final blog post next week.
Hello dear passengers, the new topic about my project is equality!
I believe the topics in SoHPC should be comprehensible for everyone, so I decided to make a game to explain how HPC and Deep Learning works with a game.
” Everything should be made as simple as possible, but not simpler.”
Albert Einstein
While designing the game, the aim was giving the idea as simple as possible for every age and gender.
There are two modes of the game: -Training Mode: You can train the deep neural network by sliding the images into the Deep Learning Module. -Inference Mode: The trained deep neural network will try to guess the images that you slide.
Also, there is a button where the magic happens! HPC Button: You can “Speed Up” the training and inference process by increasing the “HPC Workers”.
Don’t be shy! Just try the game, it is fun:
You can access the game with the full resolution by clicking here or you can download the PC version here. I made the game with Unity.
The training and inference of deep learning is a heavy computational process. While extracting information from images (they are just numbers in computers’ eyes), many matrix operations (like adding, multiplication and padding) happening. The process is long, however, the magic of HPC helps.
You can feel the powerful touch of HPC in the game with the button.
Very big thanks to my friend Çağdaş for helping me with his expertise in this project.
-Behind the idea-
Besides working and learning new academic things commonly in the beautiful nature of STFC Hartree Centre,
Office
River near the STFC Hartree Centre.
Small lake near the centre.
SoHPC event gave me a great opportunity to meet awesome people and visit fantastic places.
At Edinburgh Fringe Festival with Benjamin and Ebru.
At Arthur’s seat, Edinburgh
Fringe Festival
Edinburgh
After one inspirational month in this event, I started to think about what we do and feel should be shared with everyone as simple as possible.
The photos in the game:
Adolf – My brother’s best friend
Ramses – My best friend
Şermin – Our beloved friend in the school.
Alice – My sister’s best friend
A seagull photo taken by my cousin in Spain.
Atak – My cousins’ best friend
A beloved friend at Izmir Institute of Technology, we have many friends like him. You can support our student branch by following @iytehaydos .
In loving memory of my best friends, the bird “Tekir”, and the dog “Kuzu Kulak”.
Since my last blog post quite some time has passed, and wherein a lot of progress was made on my project! After a very warm welcome by my supervisors and the other colleagues working at ICHEC I immediately started out setting up what would keep me occupied for the next couple of weeks, the Neural Compute Stick by Intel Movidius.
The Intel Movidius Neural Compute Stick
This stick
was developed specifically for computer vision applications on the edge, with
dedicated visual processing units at its disposal to inference tasks on deep
neural networks. This last sentence was loaded with quite a lot of terminology,
so I’ll go ahead and try to unentangle this a bit.
So, what
does inference mean in this context? Once a deep neural network was trained on
powerful hardware for hours or sometime days, it needs to be deployed somewhere
in a production environment to run the task it was trained for. These can be for
example computer vision applications, which can encompass detecting pedestrians
or other objects in images or also classifying objects in detected images such
as different road signs. Simply put, inference using a trained model at hand refers
to the task of making use of the model and getting predictions depending on the
inputs def into the model. This can happen on a server in a data center, an on-premise
computer or an edge device. Now the crux with the latter is, that applications
such as computer vision tend to have quite a high computational load on hardware,
whereas edge devices tend to be lightweight and battery powered platforms. It is of course possible to circumvent this
problem by streaming the data that an edge device captures to a data center and
process it there. But the next issue that the term “edge device” already
implies is, that these computing platforms are situated in quite inaccessible
regions, like on an oil rig in the ocean or attached to a satellite in outer
space. Transmitting data is in these cases costly and comes with a lot of
latency, rendering real time decision making nearly impossible. Especially if
this data encompasses images, which as a rule of thumb tend to be large. That’s
where hardware accelerators like the Neural Compute Stick bring in a lot of
value. Instead of sending images or video to some server for further analysis,
the processing of data can happen on the device itself and only the result i.e.
pure text is sent for storage or statistical and visualization purposes to some
remote location. Such an approach brings numerous benefits, but most
importantly alleviates latency and communication bandwidth related concerns.
Example of running inference on the Neural Compute Stick. On the picture a combination of object detection and classification is to be seen.
Now what is
this magical Neural Compute Stick all about?
The Intel Movidius
Neural Compute Stick is a tiny fan device that can be used to deploy Artificial Neural Networks on
the Edge. It is powered by an Intel Movidius Vision Processing Unit, which
comes equipped with 12 so called “SHAVE” (Streaming Hybrid Architecture Vector
Engine) processors. These can be thought of as a set of scalable independent accelerators,
each with their own local memory, which allows for a high level of paralleled
processing of input data.
Complementing
this stick is the so called Open Visual Inferencing and Neural Network
Optimization (OpenVino) toolkit, a software development kit which enables the
development and deployment of applications on the Neural Compute Stick. This
toolkit basically abstracts away the hardware that an application will run on and
acts as an “mediator”.
Other than working on my project I already got a first taste of how it is living in Ireland and I find it nothing short of amazing! The weather has been very gentle, bringing way more sunshine than I expected, given all the clichés about Irish weather! Living centrally during my stay here also brings about the advantage, that exploring the city is fairly easy and the workplace is just a short walk away!
Fortunately such rainy weather was rather a rare sight in the first couple of weeks!
After this
short introduction I will outline in my next blog the workflow and setup of the
Neural Compute Stick and show demonstrate an example application, so stay tuned
for more!
Bird pun, check. Now let’s move on to the next installation of Allison’s Summer of HPC blog!
A quick recap: my project is all about improving the accessibility of meteorological radar data for ornithologists at the University of Amsterdam. These data provide new insight into the migratory trends of different flocks, and the reason for changes to migration patterns. See the below posts for more info on my project and the SoHPC journey so far.
When I say ‘radar data’, I’m guessing that the first image that comes to your mind is a big ol’ circle with different blobs of blue and green on a retro screen in a submarine or airplane or some other militaristic vehicle. Maybe the screen is beeping and updating with every sweep, and the shapes are changing as the torpedo/alien starship is coming closer and closer to Air Force One (#hollywood). Am I right?
The important thing to recognise about this mental image is that: this may be the typical imagery associated with radar, but the data certainly isn’t readily accessible in this format. It’s all just numbers and figures with no discernible patterns. If I gave you a table (or worse still, a 1D array) of meteorological data like the below and told you to find the cloud, how do you think you would fare?
Horrible, ugly, useless
We need visualization to be able to make any use of this data! We need to turn this boring information into a visual that our brains can actually understand. Now what if I gave you this?
Wonderful, beautiful, useful
That’s more like it. Here we can clearly see the radar data represented in a way that is interpretable. Add in a time dimension (like in the gif at the top of this post) and Bob’s your uncle.
There’s a bit more complexity to visualizing radar data than you might first imagine. Image data is typically stored in 3D (height, width, and colour if a colour image), or 2D arrays (height and width if black and white). This is basically a matrix, or ‘gridded’ data, and it’s pretty easy to wrap your head around. Polar data, however, is not stored in this format. Instead of having x and y axes (width and height), a polar dataset basically collects data along different azimuths and radial distances. This data combines to form a circle with 360 data points at each range.
Azimuth: the horizontal angle or direction of a compass bearing. Ie. if you stand in a circle and do a twirl you’ll cover 360 azimuths.
Radial distance: the distance (in any metric) from measurement point to the fixed origin, being the radar.
The polar data structure can make it quite complicated to visualize this data. Most visualization tools require pixels to be stored in a tabular format, ie. all the horizontal pixels are stored in rows, and all the vertical pixels stored in columns. If you try to shove in polar data and tell the tool to start at the middle and go out in graduated circles: fail.
A crucial step in the data transformation pipeline that I am building is, therefore, ‘gridding’ the data. Conceptually, not too challenging: plotting out the circular data that we do have, and then filling in to the corners with null values. In practice, this was a lengthy process involving many documentation rabbit-holes.
def polar_to_grid(polar_array):
# lots of code
# more code
return gridded_data
Et voila, we have a grid! The datapoint that was at the top left corner in the table above (ie. at azimuth 0 and radius 0) now becomes the datapoint at the very centre of our new grid.
Easier?
With gridded data, things become much “easier”. Now we just have to generate longitude and latitude arrays that accurately reflect the geographical position of each datapoint, with consideration of the altitude of the radar, the elevation of the scan, and the curvature of the globe.
With a little help from some brilliant tools and libraries like Geoviews, Wradlib, Holoviews, Spark and Bokeh, we can create an interactive visualization tool that allows researchers to quickly access and visualize their data. They can stop wasting time on data structures, programming and stack trace errors, and focus on what’s more important for them: ornithology.
Today, as promised, I will present and explain my challenge called oneGBSortChallenge! Then, complements about Radix Sort will be given, in particular how to deal with negative and real numbers but also with other kinds of data types. Lastly, I will explain what I have done so far and expose some performance results about the Radix Sort I have implemented.
Beat my time record on sorting 1GB of data using MPI and C++. There are Amazon gift cards to win and it is also a good opportunity to practice or discover HPC and MPI. Everything is guided and made simple to allow also beginners to take part. So no hesitation!
How to take part
Go to the challenge’s git and follow the very simple and short instructions. Everything is detailed there.
If you read my blog posts, you should have some hints on how to defeat me…
Back to Radix Sort
If you have paid attention in the previous blog post, when we have presented the Radix Sort, all examples were with positive integers (unsigned integers). What if we want to sort a list containing reals or both positive and negative numbers? Let’s see an example with signed integers.
Sort signed integers
Usually, to implement the Radix Sort, the base 256 is used and one digit corresponds to one byte. The reasons for that have been explained in my previous post. As soon as we work with the binary representation of numbers for the Radix Sort, whatever the base, we encounter the upcoming issue to sort signed integers.
What is the issue?
For the example, in order to make it faster and simple, we will treat the integers in base 16 and we want to sort int8_t numbers (signed integers stored on one byte). In base 16, one digit corresponds to 4 bits. Indeed, if you take 4 consecutive bits (half a byte), the value you will end up with is between 0 and 15inclusive (between 0 and F within base 16 notation). Base 16 is named hexadecimal or hex often.
One byte is composed of 8 bits, so there are two base 16 digits in one byte. Thus, the Radix Sort d parameter we have seen in the previous post equals2 here because all the integers we want to sort have two digits in the chosen base. Besides, we will use the Bucket Sort in the example but notice that we have the same issue and the solutions are the same with the Counting Sort. The Bucket Sort is just more handy to illustrate. Below is the list we want to sort.
Figure 1. A list of signed integers we want to sort using the Radix Sort to demonstrate the issue with signed integers. Source: Jordy Ajanohoun
The first row is our list in base 10, the last row is the same list but represented in binary signed 2’s complement (like they are really represented and stored in a computer). The middle row is the corresponding base 16 representation. According to the Radix Sort algorithm we have seen in the previous post, we have to first sort the elements based on the last digit (Least Significant Digit). Then the result will be again sorted by the second digit which is the last in this case.
Figure 2. First Bucket Sort round of the Radix Sort on a signed integer list (list of int8_t integers). The list is sorted according to the base 16 Least Significant Digit of the numbers. The considered base to run the Radix Sort algorithm here is the base 16. The empty buckets are not represented. Source: Jordy Ajanohoun
The empty buckets are not represented here to save space. This example also allows revising how the Bucket Sort works. As d equals 2, we still have one iteration to do with the next digit (the Most Significant Digit here).
Figure 3. Second and last Bucket Sort round of the Radix Sort on a signed integer list (list of int8_t integers). The list is sorted according to the base 16 Most Significant Digit of the numbers. The considered base to run the Radix Sort algorithm here is the base 16. The empty buckets are not represented. The result is wrong, the final list is not sorted. Source: Jordy Ajanohoun
Now the algorithm is finished and our list is not sorted. So, let’s understand why with unsigned integers it works fine but not with signed integers. The MSB (Most Significant Bit) of unsigned integers, not to be confused with the MSD, is entirely part of the number, helps to code the value of the integer, unlike signed integers where it is used to store the sign. And this makes the difference. If we pay attention, in the final list above, we have the positive integers first and then the negative ones and, whatever the length of the list, we always will have this partition. This is due to the value of the MSB because 0 is used as MSB for positives and 1 for negatives. As we always treat each digit as an unsigned integer, the MSD value of negative integers will always be greater than those of positive ones because of the MSB. Thus, in the final iteration, when we sort according to the MSD, positive integers will always come before the negatives as the value of their MSD is lower. The very good point is that if we look only at the positive section of the list, it is sorted and the same with the negative section. So, the only thing we have to do is to find a way to swap these two parts and we will end up with the desired sorted list. There are several ways to achieve this.
How to fix it?
We could for example loop through the final list and find at which index the negative section begins and then proceed to multiple swaps. However, this solution is not suitable because it requires at least one more loop through the entire list and, as a consequence, increases the complexity of the sort.
A more clever and suitable solution is to sort the list still with the Radix Sort but, according to anotherkey. Until now, we have sorted the numbers according to their bitwise representation, which is perfectly correct and fine for unsigned integers but not for the signed. Plus, the main inconvenient of the Radix Sort is that it can sort only data having a bitwise representation according to how it proceed. Thus, to sort signed integers, we have to find a suitable binary representation of signed integers such that the Radix Sort, executed using that representation as key to sort, ends up with the correct sorted list. Looks maybe complex but actually, it is fairly simple. If we take the bitwise representation of the numbers, invert their MSB and use the output as a key to sort, our problem is solved. The reason is straightforward, we have seen just above, it is because the MSB of negative integers equals 1 and those of positives equals 0 that there is this partition at the end. By inverting the value of this bit, we invert the final placement of the positive and negative sections in the list. Let’s go back to our example to illustrate.
Figure 4. A list of signed integers we want to sort according to their key and using the Radix Sort. These keys fix the problem we encounter when we want to sort a list of signed integers in the same way that a list of unsigned integers with the Radix Sort. Source: Jordy Ajanohoun
We still want to sort the same list, the numbers are unchanged and we can read them in the first row. The last row contains the keys as explained and the middle row contains the keys within base 16 representation. As the LSD of the numbers have not changed, the result of the first Bucket Sort will be the same as previously (Figure 2). Thus, we can take this result (Figure 2) to continue the Radix Sort with the second and last Bucket Sort.
Figure 5. Second and last Bucket Sort round of the Radix Sort on a signed integer list (list of int8_t integers) using correct keys. The list is sorted according to the base 16 Most Significant Digit of the keys. The considered base to run the Radix Sort algorithm here is the base 16. The empty buckets are not represented. The final list is well-sorted. Source: Jordy Ajanohoun
This time the list is sorted. In practice, there are several ways to invert the MSB of a signed integer and it can also depend on the programming language. Unfortunately, we are not going to go further into details with this aspect today because otherwise, the post would be too long. However, the issue has been highlighted and a solution has been provided.
There are other possibilities to face the problem. I have presented this one because it is the first I have found, succeeded to implement and for me, the simplest. Plus, this solution doesn’t need an additional loop through the list because we can extract the keys when needed during the Bucket Sort (or Counting Sort) via a function dedicated to that. In that way, no need to store them neither. We can extract or get the key of a given signed integer in a negligible constant time so, we are not increasing the complexity.
Careful! If we use this process to sort also unsigned integers, it doesn’t work. By doing that with unsigned integers, we are losing information about the value of the numbers because the MSB is not a sign bit anymore. The keys with unsigned integers have to be their binary representation without any transformation. In other words, the numbers themselves.
Ok, now we know how to sort signed and unsigned integers, but what about reals and other kind of data types?
Sort reals and other data types
Explain in detail how to sort float and double using the Radix Sort would be too much for the purpose here. That is not really complex but requires knowledge about floating-point representation, which is not so straightforward. It is a bit more complex and tricky than with integers but the idea remains the same concerning the Radix Sort. Still a question of dealing and playing around with binary representations.
Sort other data types like dates orhours or whatever, is simple with comparison-based sorting algorithms. We can implement a predicate which takes two instances of our data type and returns if or not the first one is greater than the second. Then we provide this predicate as entry to the sort function accompanied by the list to sort and it is done. But, we can’t generalize that to non-comparison based sorts because, by definition, they don’t compare the elements between themselves. Besides, as said, the Radix Sort needs a binary representation to sort. This is why the Radix Sort has a limited scope, this constraint can be embarrassing. Dates and hours, for example, don’t have a native binary representation, they are not native data types in computers. Often, we use strings or two to three integers to represent and store them. So, to sort them using the Radix Sort, the rule is the following. We have to find a suitable binary representation of our data type such that the Radix Sort, executed using that representation as key to sort, ends up with the desired sorted list. By binary representation here, we have to understand integers, strings or reals because they are the native data types and they have a native binary representation. Finally, in that case, instead of providing a predicate to the sort, we can provide a function allowing to get the binary representation of an instance of the data type. This function is often called a functor or a callback.
I have started first to play around and implement some serial versions of the Radix Sort to be familiar with the algorithm and its variants. The goal was not to implement an optimize serial Radix Sort ready-to-use because we are not going to reinvent the wheel. Indeed, there already are some very good libraries with that, like the boost library and its spreadsort function. I will use one of them when needed. I remind that my project is to implement a reusable C++library with a clean interface that uses Radix Sort to sort a distributed (or not) array using MPI (if distributed). So, when the array is not distributed I can use, for example, the boost’s spreadsort. My project focuses on distributed arrays using MPI and, so far, I have a great parallel version able to sort all integer types. It is my third parallel version, I have started with a simple version and then optimized it little by little until this third version. My current version also allows the users to provide a function returning a bitwise representation of their data to sort other data types than the native ones. Since for now it can sort only integers, the bitwise representation should be in the form of an integer, but this already offers flexibility and a huge range of possibilities. Besides, I have also profiled my versions and compared them to the std::sort with more than 2 000 nodes using the ICHEC Kay cluster. We will talk about that in the next post!
Many hands make light work (or maybe just Lumos if you’re a wizard). How about many processors?
Howdy, Friends!
Form Gfycat.com
I hope you are all doing great and that you are enjoying your summer (Or any season… Future applicants of SoHPC20 triggered… )
I am getting the best of my time/internship here in Bratislava. I am learning a lot about programming, Machine Learning, HPC, and even about myself! In parallel with that (seems like HPC is a lifestyle now), I am really enjoying my free time here visiting Bratislava and other cities around!
I can hear you saying “Learning about programming? And herself? What is she talking about? Well, humor me and you’ll understand!
Programming
Well, this is something I wasn’t expecting to be this important, but it is actually an important feature! I have not programmed a lot in C lately and got more used to Object Oriented languages like C++ and Python. I found myself here programming in C an algorithm that my brain was designing only in an “object oriented” way.
So I had to adapt to the C language constraints while implementing my regression tree. This took some time and I had some delays, so I am not on time regarding what you saw here but the steps are the same and I am trying to catch up. Hopefully my final product will meet the expected requirements…
Machine Learning
So, as I told you in my last post (Come ooon, you’re telling me that you can’t remember the blog post you read 3 weeks ago? Shame… on me, for not posting for this long!)…
Excuse: I was really focused on my code’s bugs……..
…I implemented a decision tree. The idea of this algorithm is to find a feature (a criteria) that will help us “predict” the variable we study. As an example we can create our own (super cool but tiny) dataset as follows:
IndependentParts
LinesRepeated
LoopsDependency
Parallelizable
1
1
0
1
1
0
1
0
0
1
1
0
0
1
0
0
1
1
1
1
1
0
0
1
1
0
1
1
This is a (totally inaccurate and random) dataset that has 4 variables (columns). IndependetParts reflects whether the program has some blocs that are independent from each other. LinesRepeated means that there are some parts that it has to compute/do many times (a for loop on the same line for example). LoopsDependency tells us if the segments that are encapsulated in a loop depend on the result calculated in a previous iteration or not. Finally, Parallelizable states if we can parallelize our code or not. 1 always means yes. 0 means no. Each lines represents one program. It means that we have 8 different programs here.
If we want to predict the fourth one (Parallelizable), our decision tree will have to consider each feature (column) and see which one helps us more to distinguish if a code can be parallelised or not. Then we select it and split the codes according to that variable.
Easy, right? But it’s not done yet! If we stop here, we could have something that can help us, but this will never be accurate because you wouldn’t know (and neither your algorithm) what to do in this case, if you choose the LoopsDependency, for example:
IndependentParts
LinesRepeated
LoopsDependency
Parallelizable
1
1
1
1
1
0
0
1
Do you know how it can predict if it’s parallelizable or not? Because I don’t…
And this is why we have to consider all the variables that can help us (but not ALL the variables each time, some of them could be useless) reduce this uncertainty, or the errors that can occur.
BAM! You have a tree now! (Because you are splitting every time the “Branches” -will call them nodes-. The first one will be the root and the last ones that have no “children” will be called the leaves.
By the way, if I want to predict a quantitative variable, like in the table above, our decision tree will be called regression tree. If it predicts a categorical (qualitative) variable, it is called a classification tree! See below.
FavoriteColor
LovesHPC
ReadsMyPosts
Cool
Red
7
1
8.2
Blue
3
1
7.8
Red
5
0
3.1
Yellow
6
0
6
Razzmatazz
10
1
10
This is a (very serious?) dataset considering people’s favorite color, their love/interest in HPC and whether they read my blog posts or not to evaluate how cool they are. (Since you are reading this post, I give you 10!)
For the record, Razzmatazz is a real color. 89% red, 14.5% green and 42% blue. I can only imagine cool people knowing this kind of color names, I had to give a 10 here!
So since we are predicting this continuous variable (Cool), it is a regression tree! But don’t worry, if you are not used to this, you can still read the sections below since we won’t solicit this. But basically, you can see here what we are doing (only with one one stopping condition, there are others and this is simplifying -on purpose- a lot the code!):
A functional portion of the algorithm that considers only the size of the Nodes to stop the recursion and the tree building. Created on Lucidchart.com
Note: If you are really into Machine Learning and would like to know more about the algorithms, check out the tutorials of this great teacher. Now if you want to know more about my C implementation, I would be very happy to talk about it with you and write a detailed post about it! Just contact me!
More Machine Learning
With this in mind, we can now go further and see what is a Gradient boosting decision tree.
So, to make it easy-peasy, just consider it as a repetition of what we were doing before (building trees) for a fixed number of times (usually 100). But this is done in a serial (dependent) way from the others. It means that, in order to build the tree No. i we have to consider the result (especially the errors) of the (i-1)th tree.
This is very useful to have more accurate predictions.
HPC
I know, I just said that gradient boosting implies a serial repetition of decision trees and that each step depends on the previous one, so parallelising is complicated here… but we actually won’t parallelise this part! What we will parallelise are those loops and independent parts of the decision trees!
There are lots of ways to parallelise the decision tree building part. Since the algorithm finds the best value that can split the observations with a minimum error rate (the threshold in the algorithm you saw above), we could parallelise this part! Another option would be the parallelisation of the node building parts. This would give something like this:
An example of how the Node building part could be parallelized. This one is done only on Google Docs.
So the next step is to compare these different parallel versions and see if we can mix some to get better results!
I’m now very excited to see how all of this goes and which version will be better!
More HPC
Haha! You thought it was done! Well, no. But this part is not about the project, but still about HPC! (Not only)
Few weeks ago, we visited a museum that is located here in the Slovak Academy of Science, few meters away from the Computing Center. And guess what? It’s all about computers! I had such a gooood time there! The guide was so passionate about what he was talking about, and we saw some great things there. I’ll show you two things. You have to come here to see the others!
And now I’ll just say goodbye! I’ll do my best to post soon another post. As usual, if you have any suggestion or question, just send me a message on LinkedIn (on my bio).
Welcome, all! I have been working on an Industrial electricity consumption prediction project. I have received this data from a Slovene company which sells electricity to its customers. The dataset is a 15 mins Electricity Consumption data spanning one year for 85 end-users. The company wants to build a forecasting system in order to make better spending forecasts from the consumer’s consumption history, thereby making it more accurate to know how much electricity should be bought for the selected time interval, which will ultimately make it more profitable. I am developing a system for profiling consumers and develop algorithms for predicting consumption, which will enable smart ordering and planning of energy consumption and thus great savings.
Exploratory Data Analysis
After the identification of data and their sources; I am investigating the influential external factors for the consumption of end-user electricity. The focus is on the calendar and the weather. For this, the consumption data is fused with data about influential external factors (both on 15 minutes scale) and I have firstly performed exploratory analysis trying to understand how the selected factors influence the energy consumption. I have downloaded the weather data from the Environmental Agency of the Republic of Slovenia (ARSO) website. The holidays in Slovenia information is obtained from the Time & Date Slovenia weblink. Then, I have derived the long holidays i.e. suppose Thursday is a holiday then it is highly likely that most of the people take Friday also as leave day, thus making it a long holiday. A similar case is for Tuesday. This holiday data and the derived information are then used to get insights from the clusters which are formed using unsupervised learning. Using this I have performed statistical analysis for every consumer.
The consumption varies with temperature; more electricity is required at very low & very high temperatures. The curve is smoothed using the Loess method.
The correlation of various weather factors with consumption.
Though, there are many consumers who don’t exhibit similar temperature vs consumption behavior.
The consumption varies with day & time (every 15 mins, hence a total of 96 data points in a day).
Unsupervised Learning
I have performed cluster analysis to generate clusters of days with similar kind of electricity consumption using k-means clustering and hierarchical clustering with custom distance metrics. I have found the optimal number of clusters for every consumer and estimated the variance within and among the clusters. I have also calculated how much variance is reduced before and after the clustering.
This dendrogram shows various clusters for one customer.
Using this cluster information I have tried to find the relation among various factors like day of the week and the holidays. For now, I have developed a small process which takes a date for which consumption has to be predicted, then I find the cluster to which it belongs, and then the average of the cluster for that time is the predicted consumption. And then I calculate how much is the error percentage in the actual and predicted consumption.
What’s Next?
Till now, my analysis was for 85 consumers with 1-year data each. I don’t have direct contact with the company, I have requested SoHPC to request the company to provide data for more years using which I can do a prediction of future year consumption based on their past year’s consumption. After I receive more data, I will be splitting it into training and testing part, perform time series analysis and modeling, and evaluate the results.
Other Updates
I have developed algorithms for data analysis and predicting electricity in the R environment. I have used NoSQLMongoDB to store the dataset. The system is scalable in terms of a large amount of data. Some of the algorithms have been adapted to big databases for parallel processing on multiple nodes. I am working on adapting the rest of the code. I will be writing separate blog posts on these to provide some basic understanding. So, stay tuned!
Hello my dear reader, Sorry about the Clickbait. I know a bad joke won’t save my blog, anyway in this blog I’m going to talk about my first week, instead of telling what have happened at the California Fire Department.
Unfortunately, I could not attend the training week in Bologna due to bureaucratic processes (I did not get a visa). Don’t worry if one day you qualify for this program and you can’t participate because of country agreements! There will be many possibilities for you to work remotely! And don’t give up, sooner or later you can get a visa 🙂
Edinburgh
What is ARCHER ?
In the first week, I joined Archer Summer School. Archer is the latest UK National Supercomputing Service located in Edinburgh. Archer Summer School is a free program for anyone interested in course content. Here are the link that training program; https://www.archer.ac.uk/training/ . Archer has about 5000 nodes and 118 000 cores ( like 30 000 quad core laptops connected together) .
ARCHER (Advanced Research Computing High End Resource)
I had the opportunity to attend “Introduction to HPC” and “Message-passing programming with MPI” courses. Exercises on the courses were very time-consuming and each participant’s questions were answered separately. The courses were different from the ordinary course by being based on practice instead of listening so those courses were ideal for my project. I have to admit that I have learnt a lot.
What is CFD ?
Let’s talk about my project. My project includes a CFD simulation. The application of this simulation in different programming languages will be written. Even more than one version of a programming language will be used. I hope you know that there are more than one ways to create an algorithm. Let’s take a look at what CFD is.
CFD Simulation of a Motorcycle
Computational fluid dynamics is a branch of fluid mechanics that uses numerical analysis and algorithms to solve flow patterns and thermal conditions of fluid. Computational Fluid Dynamics is one of the key analysis method used in engineering applications. By using cfd, you can solve complex problems that involves fluid-liquid, liquid-solid, or liquid-gas interactions. Cfd Analysis has great potential to save time. It is therefore inexpensive and quick to obtain information compared to traditional tests.
To see the included subjects and case studies ; https://www.cfd-online.com/ When you click the “solve” button in Ansys or Autodesk cfd, the fans will sound as if they were fighting in the computer case. To speed things up first, it requires a very good processor and at least (today’s conditions) 16 gb ram. It is therefore one of the uses of HPC.
In the project, I will simulate the flow patterns of a fluid that goes through from an empty square box. The box have a single inlet and a single outlet that does not in the same axis with the inlet.
I will write different versions in Python programming language of the existing C code. Then I will compare their performance on different computers.
Simulation of flow depending on the number of iterations
Last May I had the opportunity to attend VivaTech, an innovation fair, which takes place in Paris. There, Vasant Narasimhan, Novartis CEO, shared some thoughts about the present and future of pharma and biotech companies. During his talk, he insisted on the need of using computational tools in drug discovery. That is so because, according to Vasant numbers, the cost of drug development is now way above US$ 1 billion, thus the extremely high cost of producing new drugs is an urgent problem that needs to be tackled. Indeed, there are even studies that estimate this cost in more than US$ 2.8 billion, but with the use of computational chemistry the time and cost of the drug development process can be significantly reduced.
Vasant at his talk at VivaTech
Drug discovery phases
Drug discovery has several phases. For simplicity, we may summarize them into four: discovery of an active compound, called “lead”; optimization of the lead, in which candidates are synthesized and characterized in preclinical studies to test safety and efficacy in animals; clinical trials to test for drug safety and efficacy in humans; and drug launching. The second phase, “lead optimization” may account for almost 25% of the total cost of all four phases, according to a study from 2010. Lead optimization includes optimizing drug metabolism, pharmacokinetic properties, bioavailability, toxicity, and of course efficacy. Drug efficacy is directly dependent on the binding affinity of the candidate drug onto the pharmaceutical target, which is commonly a protein. Optimizing the binding affinity, i.e. the strength of interaction of the candidate drug onto the protein of interest is the step we would like to optimize this summer.
Free Energy Perturbations
For pursuing this goal we are working at the Biomedical Research Foundation Academy of Athens on Free Energy Perturbation (FEP) simulations. FEP simulations determine the relative binding affinity of two drug candidates to discovery which one of the two binds more strongly onto the protein of interest. In the last post, we explained how with FEP simulations, we can compare the binding of two ligands to their target protein and determine which one is favored. As it was explained, this results from calculating a physicochemical property called “free energy of binding”, which directly relates to the binding affinity of the ligand-protein interaction. The difference in the free energy (ΔG) of binding of the two ligands is computed from Zwanzig’s equation, which takes into account the difference between the Hamiltonians – the potential and kinetic energies- of the two molecules in solution environment and in the environment of the protein. Using a thermodynamic cycle what we calculate is the difference in the free energy of binding (ΔΔG = ΔGΑ – ΔGΒ) between ligand A bound to the protein and ligand A solution (ΔGΑ) and ligand B bound to the protein and ligand B solution (ΔGB).
FEP computations are computationally expensive so they need to be carried out at High Performance systems, such as ARIS, the supercomputer administered by GRNET. The last simulation I ran on ARIS took around 14 hours and the usage of 80 computing nodes, each one with 20 cores of 2.8GHz each, to complete. That simulation was just comparing the lead molecule A with lead molecule B in which we had just substituted one hydrogen atom by a hydroxyl group (OH). The result was that lead molecule B had a negative free energy of binding compared to lead molecule A. Thus, we conclude that lead molecule B with the hydroxyl group is favored for binding to the protein of interest
Lead compound (left) vs mutated molecule (right). Spot the one difference!
In addition to high computational needs, FEP simulations still need to be validated as a tool for drug development. This is key and the goal of our project. We are comparing FEP simulation results using different force fields and software tools with the results obtained in experiments. In particular, I am using GROMACS software, and GAFF2 force field. At the end of the day, we will be able to tell how well FEP predicts the experimental results. If the correlation between calculated and experimental results is high, in the future FEP may save time and resources in the lead optimization process of drug discovery.
To compare FEP simulation results with lab experiments, we have decided to use a series of Arp2/3 inhibitors. Arp2/3, means “Actin-related protein 2/3” and it is a very important protein in our cells. It is part of the machinery that makes the cell move. However, in some scenarios we want to slow it down, since it is relevant in tumor motility. And that is what CK666 does. It inhibits Arp23 by binding to it. A collaborator group has obtained experimental results for the binding of CK666 and different analogs, which we will compare to our simulated results.
Arp23 structure. Source: generated by the author using the PDB file available at rcsb.org/structure/1k8k
Goodbye
As it may have been clear from the above paragraphs, the whole project is a huge challenge for a biomedical engineer like me. Since I am not a chemist, I have to indulge in chemical principles. And since I am not a computer scientist, I have to learn to operate new software plus a supercomputer! Luckily, I am working with great colleagues who have been helping me enormously. And of course, as Jack Torrance may say, all work and no play makes Jack a dull boy, so after-work dinners and walks with friends are also a great source of help 😉
Here I am at Filopappus Hill with a beautiful view of Acropolis.
In my last post, I gave a quick overview of the challenges involved in benchmarking deep learning algorithms and hardware as well as my plans to make a tool for this task to be used on UniLu’s Iris Cluster.The main development since then is that I now have a version of this tool up and running. What it does is let the user import a model and dataset and supply a config file with details of the training before it automatically distributes everything it across the hardware specified, trains it for a while, and collects data on how quick and efficient the training may or may not have been. There’s also a script to sort parse the config file and produce a suitable SLURM batch script to run everything for when you’re feeling lazy.
As mentioned in the last post, there’s a lot of tweaking that can be done to parallelise the training process efficiently but, after a bit of experimentation and reading up on best practice, it looks like I have found default settings which work as well as can be expected. The general consensus on the subject is well summed up in this article but the short explanation is to keep the batch size per core fixed and adjust other parameters to compensate. Currently the frameworks supported are Tensorflow distributed with Horovod, Keras (distributed either with Horovod or the built in, single node multi-gpu-model functions) and Pytorch. While this is by no means a complete list of all the deep learning frameworks out there, it’s enough for me to step back and see how well they work before adding more.
Now that I seem to have working code, the natural thing to do with it is to run some experiments for a sample problem. The problem in question is image classification. More specifically training a neural network to distinguish between the different categories of image in the CIFAR-10 dataset. This is a standard collection of 32×32 pixel images (some of which can be seen in the featured image for this post), featuring 10 categories of object, including dogs, ships, and airplanes.
Diagram of Resnet Architecture
The neural network being used for these experiments is Resnet, a standard algorithm for this type of problem. The specific version being used has 44 layers and over 600,000 trainable parameters. While this sounds like a lot, training a network of this scale would be considered a medium sized task in deep learning. This is helpful because it means my experiments take a few minutes rather than hours or days to run but it should be noted that many state-of-the-art algorithms are a lot more complex.
The Results are in
This is the section where I include a large number of graphs which came from the results of the experiments described in the previous paragraph. The first one, above, shows the average throughput (number of images processed per second during training) from when the model was fitted over 40 epochs (sweeps through the entire dataset) on varying numbers of GPUs. Metrics of this type are often used by chip manufacturers and organisations who build ML software to explain why their product is worth using.
The main trend which can be seen in this graph is that Tensorflow/Horovod and Pytorch both scale quite a bit better than Keras. This may not be unexpected given that Keras is the most high-level Framework considered and may have some hidden overheads slowing everything down. There’s also the fact that when using the built in Keras multi GPU functionality, only the training is split over multiple GPUs and not any of the other potentially time-consuming steps like loading and processing training data.
While looking at throughput is a very nice, mathematical way to work see how fast your setup is running, a metric which is more likely to help you decide what setup you should actually use is the time it takes for your model to reach a certain accuracy (which in this case is the proportion of images in the test set it guesses correctly). This is shown above for a range of accuracies for the two frameworks with the highest throughput. One trend which is visible in both cases is once 4 or more GPUs are used, the benefits of adding more start to look a bit limited. Note that Tensorflow seemed to actively slow down and didn’t reach the 80% mark in its 40-epoch run once 12 GPUs were used. This is likely due to the fact that the tweaks to make the code scale better, described in the first paragraph, have the effect of increasing the batch size (number of training examples processed at the same time) to the point where it becomes too large to train effectively. It also appears that the curves for Horovod aren’t quite as smooth as those for Pytorch. This might take a bit longer to explain than would be reasonable for one post but the short answer is that Horovod cuts some corners when handling weights for something called Batch Normalisation and this causes the accuracy to bounce around a bit early in a run.
In the last of the graphs for this post, the speedup (1/walltime when using one GPU) is shown for both the throughput and time to reach 75% accuracy for the two best frameworks. In each case, these seem to reinforce the evidence seen so far that Pytorch scales better than the other frameworks considered. As can be seen, as far as throughput is concerned, both frameworks scale well for a small number of GPUs with discrepancies creeping in once more than 4 are used. However, it’s quite clear from looking at the difference in time to reach 75% that the changes to the training process needed to get decent scalability can slow down training and add to the inefficiencies caused by splitting the work over too many processors.
What comes next?
So now I have these initial results, the main question is what I do in the next three weeks I have left to make this project more interesting or useful. The first item on the agenda is to test the most promising Frameworks (currently Torch and Horovod) on larger problems to see how my current results scale. More comprehensive CPU only experiments could also be worthwhile.
Beyond running more experiments, the main priority, is adding support for a few more frameworks to the code. The standard built in method for distributed training in Tensorflow is currently in the pipeline. This could be interesting given that its perceived flaws served as Uber’s motivation for creating Horovod in the first place. Apache’s MXNet could also be making an appearance in future results, time permitting. Another important job from my last three weeks is to make this code a bit more general so it can benchmark a wider variety of tasks in a user-friendly manner. Currently everything is fairly modular but it may not be easy to use in tasks that are radically different from the one described above, so there’s still a bit more to be done to reduce the work needed for other users to use it once I leave.
So, as I mentioned in my previous post, my next task was to set up a virtual machine with KVM, the Linux hypervisor, and libvirt, a virtualization API, and add an encrypted disk to it. This I achieved yesterday with the LUKS encryption mechanism supported by libvirt.
But enough with throwing around ominous acronyms, what I did was fairly simple. As I said, KVM is a Linux hypervisor, which basically means it supervises the virtual machines you have running on your Linux system. To create or edit a VM, you can use libvirt, an API that registers components (like disks or networks) with KVM and also creates virtual machines according to settings provided by the user and hands them over to KVM. Basically, it acts as a bridge between KVM and the user. To tell libvirt what you want, it is easiest to write an XML-file as a sort of “recipe” for the new virtual component. For example, if you were to create a new virtual disk, it could look like this:
luksvol.img500/guest_images/luksvol.img
With this XML-file, you tell libvirt that your new virtual disk shall be named luksvol.img and have a capacity of 500MB. In your host system, it is located at /guest_images/luksvol.img. The field encryption is what is most important for my project: It says to use encryption format LUKS and specifies the properties of the encryption.
LUKS is short for Linux Unified Key Setup: it was invented to standardize hard disk encryption in Linux. As you might know, there is a lot of encryption algorithms and before LUKS, every algorithm hat their own tool with their own commands and you needed a different tool to take care of your key management. Just like when you have different e-mail-addresses with different e-mail providers, and on one web portal the logout button is on the right, on the other one on the left. LUKS is the Mozilla Thunderbird for encryption of hard disks. It lets you select the disks you want to encrypt, and specify the encryption algorithm it should use to do so. It takes care of the key management and regardless of the algorithm you are using, the commands to encrypt a hard disk are always the same.
But what does key management mean? To encrypt a message (and a hard disk is nothing more than a potentially very long message), you need a (master) key. If you are not sure how that would work, have a look at the Vernam cipher. But as I’m sure you are all aware, you should change your passwords frequently. If you do so with this key, the whole message would have to be encrypted all over again, which would take considerably long for a long message. This is why LUKS makes you define a user key which then in turn encrypts the master key generated by the system. The master key is used to encrypt the hard disk. When you change the user key, only the master key has to be re-encrypted, but the hard disk stays the same. This is called a two-level key hierarchy.
LUKS also takes a few measurements for increased security. While I don’t want to go into detail here, all those measurements follow the same principle: Make the attacker’s life hard. There is no practical way to set the unbreakable password, but there are many ways to slow down the password’s calculation on the attacker’s side, so LUKS makes sure an attacker has to go through many CPU-intensive operations to find the right password. That doesn’t make breaking the encryption impossible, but unlikely.
That’s it for now, next up is some benchmarking on this setup. I’m very excited for that since it is something I have never done!
After World War 2, the story started with “a simple question” from a great scientist named Stanislaw Ulam.
In his convalescence days, he entertained himself by playing solitaire games. While playing, the question “ What are the chances that a solitaire laid out with 52 cards will come out successfully?” appeared in his mind. This question started an interest in determining the probability of winning and calculating the outcome of each event in the game. With this interest, he explained the rest with his words:
When he shared his ideas with John von Neumann who is also a great scientist, they decided to name the method after the gambling spot in Monaco, “Monte Carlo”. The first formulation of Monte Carlo computation for an electronic computing machine was outlined by Neumann to solve neutron diffusion and multiplication problems.
Since then, Monte Carlo Method helps to simulate the experiments with the outcome. The magic behind the Monte Carlo Method is repeating the experiment with random sampling. As long as you repeat the experiments, the results will be better. Nowadays, Monte Carlo is an important and handful method for finance, physics, and even Quantum Mechanics. Consequently, Monte Carlo Method became one of the most important factors for the birth of High-Performance Computing.
Let’s start with the classic example:Estimating Pi with Monte Carlo
Imagine you are playing dart with this funny board:
Figure 1: The Board. [Adapted from the reference: nicoguaro, CC BY 3.0, via Wikimedia Commons]
Imagine the darts are only allowed to hit inside of the square with 1cm² area, and the darts that hit inside the quarter circle with π/4 cm² are counted. If of the number of darts that hit inside the quarter circle divided by the total count of darts (n) thrown, and multiplied by 4, pi can be estimated.
Figure 2: Simulation of Darts and Estimation of Pi. [Reference: nicoguaro, CC BY 3.0, via Wikimedia Commons]
However, you need to throw too many darts to have a good estimation, and nobody likes to wait for the calculations. That’s why High-Performance Computing is also important.
Did you like the Monte Carlo Method and HPC? If you are interested withHigh-Performance Computing, why don’t you look at the PRACE Tutorials.
Let’s catch up with what my tasks have consisted in this far, one month into the SoHPC.
Since my very first day at STFC-Hartree, I was introduced to the DL_Meso simulation package, particularly to the Dissipative Particle Dynamics library, or DPD.
Briefly, this code allows simulating the physics of the mesoscale. Let’s assume that we are trying to perform an experiment with one or more fluid phases: a classical example could be a mixture of oil and water.
Representation of a two-phase mixture, for example mixed water and oil.
By mesoscale I mean the range of size between molecules (micro-scale) and droplets (macro-scale). The DL-Meso DPD code deploys a finite number of particles, or beads, whose dynamic will be simulated; as result, we will obtain information on our experiment accordingly to the size of our beads and to the simulation volume considered.
Clearly, as the number of beads increases, we have to move to more powerful computers. Moreover, the code can benefit from the usage of GP-GPU accelerators (General-Purpose computing on Graphics Processing Units) by means of CUDA C, therefore its possibilities are enormous in terms of running configurations.
The main task I am facing during the SoHPC ( https://summerofhpc.prace-ri.eu/author/davideg/) is to understand how the DL-Meso DPD code scales, or, in simpler words, how the time required to provide an experimental result is affected by the number of exploited resources.
Strong scaling experiment, where the simulation size is kept constant while the number of resources is increased. In blue the experimental results, in orange the ideal scaling. Ideal speed-up happens if, for doubled resources, we obtain half the execution time.
As you can see from the picture above, which represents the first obtained results, the difference between ideal speed-up and my data is of some importance. From this, we understand that the problem is not that simple, as many factors influence the performance of our code: above all Input/Output, number of simulated beads.
In conclusion, the exploration of the capabilities of the DL-Meso Code is keeping me busy during the SoHPC; furthermore, as I gain more confidence with the software and the supercomputer (We are talking about Piz Daint, the greatest supercomputer in Europe, which allows me to use up to 5704 CPU-GPU nodes: https://www.cscs.ch/computers/piz-daint/. Some caution is mandatory, especially when launching jobs with 2000 of them!), some extensions will be performed, in order to increase the possible scaling experiments or verify new features.
After completing the training, my final project has now begun!
DAY 37/60
After the initial week of training in Italy and the month of training in Greece, my final project has begun! I am so grateful for this in-depth training period as it taught me an array of knowledge ranging from molecular dynamics simulations to the in depth biology and chemistry that explains some of the science governing them. As a theoretical physicist, this has been very new territory for me. However, it has shown me that my skills acquired from physics have been transferrable. In addition, for the past couple of years I have been intrigued by biophysics and this internship has been inspiring and has assured me that introducing biophysics modules into my master’s degree next year would be of great benefit to me. If they are anything like this internship, then I know I will find them extremely engaging.
This image is of me in the lab whilst working on visualising the K-Ras protein using a visual molecular dynamics simulation, VMD.
As a recap from earlier on in the internship, a Nanoscale Molecular Dynamics (or as some call it, Not Another Molecular Dynamics program), (NAMD) tutorial was completed, after writing a report on the inter- and intramolecular bonds in proteins and protein-ligand complexes. Last week, the work I had done for these tasks were reviewed and after meetings with my mentor, helpful updates were included, allowing my work and my understanding of the science to develop further. I also completed another two tutorials that aided my understanding of the process behind drug design and the biochemistry that governs the interactions and bonds in proteins. For example, this included studying haemoglobin, a protein in red blood cells. The protein data bank (pdb) file, which is needed in order to visualise the structure in VMD ,a visual molecular dynamics program, was downloaded from the RCSB (Research Collaboration for Structural Bioinformatics) website. This protein is shown below. A range of other proteins I investigated are also shown in the following images. Some of the images represent the same protein but are displayed using different drawing modes. These different ways to view the same structure have shown to be very useful throughout my project as depending on the reasoning for looking at the protein, the most suitable drawing type to use can be determined. For example, if the general structure of the protein is to be examined, the ‘NewCartoon’ drawing method is helpful. On the other hand, to see the individual bonds, the ‘licorice’ drawing method may be the most appropriate.
This figure shows the pdb file of haemoglobin (obtained from the RCSB website) visualised in VMD.
This image shows beta sheets of model silk fibroin peptides using the ‘licorice’ drawing method.
This image shows the same protein as the previous image but using the ‘NewCartoon’ drawing method.
My favourite parts of this intership so far have been visualising the proteins and the final simulations at the end of each step in the process. I find the capabilities of these molecular dynamic simulations and supercomputers so incredibly fascinating and I feel so lucky to be able to learn about how to utilise them. I am so excited to see what the future brings, as I hope to one day join the scientists working to use these simulations, and others like it, to reduce animal testing. I am so pleased to be working on a project that is teaching me essential knowledge required for this field of study.
After finishing these tutorials, I completed a Glide and a SiteMap tutorial- the latter of which I will expand on now. The location that a potential drug could bind to a protein is usually known if its structure is known. However, in some cases, the binding site of a protein is unknown, making it hard to know the types of drugs that could bind to it. Therefore, computational studies can be used to help find these potential sites without much prior knowledge of the protein structure. Initially, a couple of regions on the protein’s surface, called ‘sites’, are identified as potentially promising. These are located using a grid of points called ‘site points’. Various further stages entail which all help ensure the best chance of successfully finding a good binding site. This information is provided to Maestro in order to visualise the process. (Source: SiteMap User Manual, Schrodinger Press, 2009)
The very first task of my final project was to write a document explaining the functional domains, the mechanism of action and how the mutation affects both of these for the K-Ras4b protein. I found this task so beneficial to clarify both my understandings and mis-understandings of these processes and ensured that I was clear about the properties of this protein. The NAMD tutorial, completed previously, was almost like a trial run for me. The tutorial focussed on the protein ‘lysozyme’ and took me through multiple steps such as minimisation, heating and equilibration. After learning about these processes, along with the other tutorials, I am now ready to start my final and main project. I will be performing similar minimisation, heating and equilibration steps (just to name a few) on K-RAS4b proteins (one of which is shown below) which include both the wild type (referring to the unmutated state) and the G12D mutated protein, meaning that the amino acid in position 12 has mutated from a glycine to become an aspartic acid. I will then use SiteMap and Normal mode analysis to do binding site identification in order to determine whether any cavities exist on the proteins in question.
This is a visualisation of a K-Ras protein using VMD. Source: RCSB
This clip shows a K-Ras wild type protein bound with GDP. It is visualised using VMD.
Still to come… As the final few weeks approach, there may just be a few more tasks to complete (as well as the project itself) however they may be the most crucial of all! There will be one more blog post by me in two weeks time which will summarise my project, give some more details about how it was carried out and finally, a summary of the results obtained- I promise to make sure it includes plenty of really interesting videos of the simulations and photos to accompany them! I am also in the process of creating a 5-minute video to explain my project from beginning to end and also a final report that will be submitted to PRACE. Moreover, I will be delivering a 20 minute presentation and then showing my video to my colleagues at BRFAA which will finish off my internship completely. I hope you can follow me on the remainder of my journey! I would also love to hear if you have done any similar projects this summer or if you have an interest in something in the same area of study! I’d love to hear about all of your thoughts in the comments below!
[Video] Hi from Dublin everybody! Welcome back to my blog or welcome for newcomers.
If you have missed it, my previous blog post is available here! No worries, there is no need to read it before this one to understand today’s topic. You can read it after to know more about me if you want, but follow and stay tuned until the end of this post because I have a special announcement for you to do!
Before explaining how you can make your computer run your applications faster, I just wanted to come back quickly to our training week in Bologna, Italy, and make a point with you on my current situation.
Training week summary
As a picture is worth a thousand words, I let you discover by images the summary of this amazing training week.
SoHPC 2019 training week summary by images
Where am I now?
Since the 6th of July, I am in Dublin with Igor, another PRACE SoHPC 2019 student. We are both working on our projects at ICHEC (Irish Centre for High-End Computing), an HPC Center. Paddy Ó Conbhuí (my super project mentor at ICHEC) and I are dealing with a parallel sorting algorithm: the parallel Radix Sort, which is my project.
Enough digressions, you are most probably here to read about application and computer speed.
Pursuit of speedup
To make our applications and programs run faster, we have to find first where, in our programs, computers spend most of the execution time. Then we have to understand why and finally, we can figure out a solution to improve that. This is how HPC works.
Identify the most time-consuming part of a program
Understand why it is
Fix it / Optimize it
Iterate this process again with the new solution until to be satisfied with the performance
Let’s apply this concept to a real-world example: how can we find a way to improve programs in general?
Not a specific algorithm or program but most of them in a general way?
It is an open and very general question with a couple of possible
answers, but we will focus on one way: optimize sorting algorithms. We are going to see why.
How do computers spend their time?
We have to start by asking what takes time and can be improved in computer applications. Everything starts with an observation:
“Computer manufacturers of the 1960’s estimated that more than 25 percent of the running time of their computers was spent on sorting, when all their customers were taken into account. In fact, there were many installations in which the task of sorting was responsible for more than half of the computing time.”
It is become rare to work with data without having to sort them in any way. On top of that, we are now in the era of Big Data which means we collect and deal with more and more data from daily life. More and more data to sort. Plus, sorting algorithms are useful to plenty more complex algorithms like searching algorithms. On websites or software, we always sort products or data by either date or price or weight or whatever. It is a super frequent operation. Thus, it is probably still true that your computer spends more than a quarter of its using time to sort numbers and data. If you are a computer scientist, just think about it, how often did you have to deal with sorting algorithms. Either wrote one or used one or used libraries which use sorting algorithms behind the scene. Now the next question is what kind of improvement can be done regarding sorting algorithms? This is where the Radix Sort and my project come into play.
Presentation of the Radix Sort
It has been proved that for a comparison based sort, (where nothing is assumed about the data to sort except that they can be compared two by two), the complexity lower bound is O( N log(N) ). This means that you can’t write a sorting algorithm which both compares the data to sort them and has a complexity better than O( N log(N) ). You can find the proof here.
The Radix Sort is a non-comparison based sorting algorithm that can runs in O(N). This maybe sounds strange because we often compare numbers two by two to sort them, but Radix Sort allows us to sort without comparing the data to sort.
You are probably wondering how and why Radix Sort can help us to improve computer sorting from a time-consuming point of view. We will see it after explaining how does it work and go through some examples.
How does Radix Sort work?
Radix sort takes in a list of Nintegers which are in base b (the radix) and such that each number has d digits. For example, three digits are needed to represent decimal 255 in base 10. The same number needs two digits to be represented in base 16 (FF) and 8 in base 2 (1111 1111). If you are not familiar with numbers bases, find more information here.
Radix Sort algorithm:
Input: A (an array of numbers to sort)
Output: A sorted
1. for each digit i where i varies from least significant digit to the most significant digit:
2. use Counting Sort or Bucket Sort to sort A according to the i’th digit
We first sort the elements based on the last digit (least significant digit) using Counting Sort or Bucket Sort. Then the result is again sorted by the second digit, continue this process for all digits until we reach the most significant digit, the last one.
Let’s see an example.
Radix Sort example with an equal number of digits for all the numbers to sort Source: https://brilliant.org/wiki/radix-sort/
In this example, d equals three and we are in base 10. What if all the numbers don’t have the same number of digits in the chosen base? It is not a problem, d will be the number of digits of the largest number in the list and we add zeros as digits at the beginning of other numbers in the list, until they all have d digits too. This works because it doesn’t change the value of the numbers, 00256 is the same as 256. An example follows.
Radix Sort example with an unequal number of digits for the numbers to sort. We fill most significant digits with zeros for the numbers which don’t have as digits as the largest number in the list to be sorted. Here, they all have three digits except 9 which becomes 009. Source: https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/sorting/radix-sort
Keep in mind that we could have chosen any other number base and it will work too. In practice, to code Radix Sort, we often use base 256 to take advantage of the bitwise representation of the data in a computer. Indeed, a digit in base 256 corresponds to a byte. Integers and reals are stored on a fixed and known number of bytes and we can access each of them. So, no need to look for the largest number in the list to be sorted and arrange other numbers with zeros as explained. For instance, we can write a Radix Sort function which sorts int16_t (integers stored on two bytes) and we know in advance (while writing the sort function) that all the numbers will be composed of two base 256 digits. Plus, with templates-enable programming languages like C++, it is straightforward to make it works with all other integer sizes (int8_t, int32_t and int64_t) without duplicating the function for them. From now, we assume that we use the base 256.
Why use counting sort or bucket sort?
First, if you don’t know Counting Sort or Bucket Sort, I highly recommend to read about it and figure out how they work. They are simple and fast to understand but expose and explain them here will make this post too long and it is not really our purpose today. You will find plenty of examples, articles and videos about them on the internet. Sorting algorithms are at least as older as computer science and first computers. They have been studied a lot since the beginning of programmable devices. As a result, there is a lot of them and it is not only good but also important to know the main ones and when to use them. Counting and Bucket sorts are among the most known.
They are useful in the Radix Sort because we take advantage of knowing all the value range one byte can take. The value of one byte is between 0 and 255. And this helps us because in such conditions, Counting Sort and Bucket Sort are super simple and fast! They can run in O(N)when the length of the list is at least a bit greater than themaximum (in absolutevalue) of the list and especially when this maximum is known in advance. When it is not known in advance, it is more tricky to make the Bucket sort runs in O(N) than the Counting. However, in both cases, the maximum and its issues have to be managed dynamically. They can run in O(N) because they are not comparison-based sorting algorithms. In our case, we sort according to one byte so the maximum we can have is 255. If the length of the list is greater than 255, which is a very small length for an array, Counting and Bucket sorts in Radix Sort, can easily be written having O(N) complexity.
Why not use only either counting or bucket sort to sort the list all of a sudden? Because we will no longer have our assumption about the maximum as we are no longer sorting according to one byte. And in such conditions, the complexity of Counting sort is O(N+k) and Bucket sort can be worse depends on implementations. Here, k is the maximum number of the list in absolute value. In contrast, with Radix Sort, you will have in worst case O(8*N) which is O(N) and we will explain why. In other words, since Radix Sort iterates through bytes and always sorts according to the value of one byte, it is insensitive to the k parameter because we only care about the maximum value of one byte. Unlike both the Counting and Bucket sorts whose execution times are highly sensitive to the value of k, a parameter we rarely know in advance.
The last point of why we use them in Radix Sort is because they are stable sorts and Radix Sort entirely relies on stability. We need the previous iteration output in the stable sorted order to do the next one. No better way to try it by yourself quickly on a paper sheet with a non-stable sort to understand why. Actually, you can use any stable sorting algorithm with a complexity of O(N) instead of them. There is no interest in using one with a complexity of O(N log(N)) or higher because you will call it d times and in such a case, it is just worse than call it once with the entire numbers to sort them all of a sudden.
Complexity of Radix Sort
The complexity is O(N*d) because we simply call d times a sorting algorithm running in O(N). Nothing more. The common highest integer size in a computer is integer on 8 bytes. So assuming we are dealing with a list of integers like that, the complexity using Radix Sort is O(8*N). The complexity will be lower if we know that they are 4 or 2 bytes integers as d will equal 4 or 2 instead of 8.
LSD VS MSD Radix Sort
The algorithm described above is called LSD Radix Sort. Actually, there is another Radix Sort called MSD Radix Sort. The difference is:
With the LSD, we sort the elements based on the Least Significant Digit (LSD) first, and then continue to the left until we reach the most significant digit
With the MSD, we sort the elements based on the Most Significant Digit (MSD) first, and then continue to the rigth until we reach the least significant digit
The MSD Radix Sort implies few changes but the idea remains the same. It is a bit out of our scope for today so we will not go in further details, but it is good to know its existence. You can learn more about it here.
How can Radix Sort be useful to gain speedup?
The Radix Sort is not so often used whereas well implemented, it is the fastest sorting algorithm for long lists. When sorting short lists, almost any algorithm is sufficient, but as soon as there is enough data to sort, we should choose carefully which one to use. The potential time gain is not negligible. It is true that “enough” is quite vague, roughly, a length of 100 000 is already enough to feel a difference between an appropriate and an inappropriate sorting algorithm.
Currently, Quicksort is probably the most popular sorting algorithm. He is known to be fast enough in most of the cases, although its complexity is O(N log(N)). Remember that it is the lower bound for comparison-based algorithms and the Radix Sort has a complexity of O(N*d). Thus, Radix Sort is efficient than comparison sorting algorithm until the number of digits (d parameter) is less than log(N). This means the more your list is huge, the more you should probably use the Radix Sort. To be more precise, from a certain length, the Radix Sort will be more efficient. We know the Radix Sort since at least punch cards era and it can do much better. So why it is not so used in practice?
Typical sorting algorithms sort elements using pairwise comparisons to determine ordering, meaning they can be easily adapted to any weakly ordered data types. The Radix Sort, however, uses the bitwise representation of the data and some bookkeeping to determine ordering, limiting its scope. Plus, it seems to be slightly more complex to implement than other sorting algorithms. For these reasons, the Radix Sort is not so used and this is where my project comes into play!
My project
My project is to implement a reusable C++ library with a clean interface that uses Radix Sort to sort a distributed (or not) array using MPI (if distributed). The project focus on distributed arrays using MPI but the library will most probably also have a serial efficient version of Radix Sort. The goal is that the library allows users to sort any data type as long as they can forgive a bitwise representation of their data. If the data are not originally integers or reals, they will have the possibility to provide a function returning a bitwise representation of their data.
Challenge
The time has come to make the announcement! I will soon launch a challenge, with a gift card for the winner, which simply consists in implementing a sorting algorithm that can sort faster than mine, if you can… A C++ code will be given with everything and a Makefile, your unique action: fill in the sorting function. Open to everybody, the only rule: be creative. Can you defeat me?
All information concerning the challenge in my next blog post so stay tuned! I am looking forward to tell you more about it.
To be honest, this post should have been written before the previous one. But as they say, better late than never.
If the previous post consisted of describing the theoretical principles behind an aerodynamics simulation, this one is going to deal with the actual user-level experience of doing it in a supercomputer. Again, the aim is that absolutely anyone that ends up in this post is able to understand what I am writing about. Feel free to drop a comment if something is not clear enough.
From now on, let’s assume that, by whatever reason, you have been given the opportunity to use a supercomputer. The post is going to be about how the experience would be, and what you would have to do in order to be able to use it, as opposed to a regular computer.
How do I turn it on?
That is indeed a logic question. When you want to use your personal computer, the first thing that you have to do is to push the on/off button. However, things work completely differently in that respect in a supercomputer. A supercomputer is always on.
As I said in the previous article, one can think of a supercomputer as a regular computer that features many more processors, much more RAM and, eventually, a much bigger size. But there are many other differences that I purposely omitted last time.
To begin with, a supercomputer is so large that it is rarely used just by one user at a time. Typically, many different users are connected to it, performing completely independent tasks. And the word connected is highlighted because it is very relevant: you do not physically use the supercomputer, bur rather you remotely connect your computer to it: you gain access to it by typing an address (a username) and its associated password, and then you can use it from your computer as if you were using your own computer. For you, the supercomputer will be just one more window in your computer. With the key difference that the simulations that you will run will not be physically run by your computer (i.e. you will not hear the sound of the fans revving up). They will be integrally run in the supercomputer, which may be on the other side of the wall, or thousands of kilometres away from. For those of you who have already connected to another computer via Team Viewer, this concept will be more familiar.
The as if you were using your own computer sentence is also a little optimistic and, essentially, inaccurate. There will indeed be some major differences, especially from a Windows-user perspective.
The most intuitive difference will be that, in the first instance, you will not be able to access the files and the programs that are installed in your personal computer when you are in the supercomputer window. So if you want to make use of some of your PC/laptop files in the supercomputer, you will have to find a way to transfer them. Of course, there are easy solutions to accomplish that task. But is something that must be kept in mind: even if you see the supercomputer just as a window opened in your computer, it is not possible to do things like dragging and dropping files into it. It is still a different computer.
Windows? What is that?
The other key difference (and this is why I mentioned the Windows-user perspective aspect above) is that the vast majority of supercomputers run Linux instead of Windows or any other operating system, for a variety of reasons (customizability, licensing or potential users knowledge). So if your PC runs Windows and you have never used Linux, this also constitutes a great difference.
Screenshot of my login running session in the Salomon supercomputer, taken at the very moment I was writing this post. Own elaboration.
Of course, Linux is far from being something exclusive of supercomputers, and this post is not intended to describe the differences between Windows and Linux. But still, I think that a general view of how to deal with a supercomputer cannot be provided without saying something about Linux.
When working with Linux, the most efficient way of navigating through the computer and to perform tasks is not by making use of a mouse and double-clicking in icons, which is the conventional way of working in Windows. In Linux, once you are connected to the supercomputer, you enter into the folders or into the files by writing commands in the terminal, something that is represented in the image above. For example, if we are on the desktop, we can type ls and hit enter, and we will get a list of folders and files out there, even if we cannot see them directly. And if we want to enter a folder, we will write cd + name of the folder and hit again enter, instead of double-clicking on the folder. ls and cd are just two of the many commands that make that, once dominated bash and the shell (which can be said to be the language and the internal program on which all this is based), working in Linux is very efficient.
Another different question is how one can connect to a Linux-based supercomputer from a Windows-based personal computer, which is often the case. However, this is something more difficult to explain than to actually do and does not add much value to the post, so it will be set aside on this occasion.
I want to use a supercomputer. Is this possible?
Yes, you can use a supercomputer if you want. But you have to truly want it, which means that you have to present a project that is deemed to be worthy of those resources. If you are successful, you will be provided with a given amount of core hours, which basically represent the amount of time that you can make use of the supercomputer. The actual time will depend on how many of its processors you use (that is the reason why the time is distributed in core hours and not simply in hours of usage).
For example, the European computers that belong to the PRACE network periodically hold a contest in which the available core hours of the supercomputer for a period of time are assigned to the best projects. In particular, the rules applying to the current contest for the supercomputers’ resources here in Ostrava can be found here.
What happens if the supercomputer is overcrowded?
Since the core hours are distributed among the users only once every several months, the users have a fair amount of freedom to choose when to make use of the supercomputer. It may happen that, suddenly, many users want to work with the supercomputer at the same time. This is something that happens from time to time, and in fact, it was happening at the moment in which I took the image above (the capital Q’s that mean that the simulations that I want to run are in the queue, waiting for others to finish).
This issue is so important, that there is a whole world of investigation around it. Because, how do we decide which user should hold preference when using the supercomputer? In order to make it as fair as possible, complex algorithms (called job scheduling algorithms) are implemented to distribute the time among the users in an appropriate way, taking into account factors such as how much time has that user already consumed, how long will their simulations take, etc. A very nice introduction to this topic was provided by one of last year’s Summer of HPC participants in its final presentation.
Conclusions
The objective of this post was to make you imagine how it would be to use a supercomputer, specially in the case that it is a situation you will never face in your life. I do not know what level of immersion I have reached, but at least I hope this will be useful for you to have a rough idea of what this world is about. Next time, when people tell you about supercomputing, you can answer: “Hey, you are not going to impress me with that. I know what you are talking about!”.
——————————————————————————————————–
Ahora, ¡en español!
¿Cómo se usa un superordenador?
Siendo honesto, este post debería haber sido escrito antes que el anterior. Pero más vale tarde que nunca.
Si el post anterior consistía en describir los principios teóricos de una simulación aerodinámica, este va a tratar sobre la experiencia real de hacerlo en un superordenador a nivel de usuario. Una vez más, el objetivo es que cualquier persona que acabe en este post sea capaz de entender de lo que estoy escribiendo. Siéntete libre de escribir un comentario si algo no está lo suficientemente claro.
De ahora en adelante, supongamos que, por cualquier razón, te han dado la oportunidad de usar un superordenador. El post va a ir sobre cómo sería la experiencia, y qué tendrías que hacer para poder usarlo diferente de lo que harías en un ordenador normal.
¿Cómo lo enciendo?
Esta es una pregunta bastante lógica. Cuando uno quiere utilizar su ordenador personal, lo primero que tiene que hacer es pulsar el botón de encendido/apagado. Sin embargo, las cosas funcionan de forma completamente diferente en un superordenador. Un superordenador siempre está encendido.
Como dije en el artículo anterior, uno puede pensar en un superordenador como un ordenador normal que cuenta con muchos más procesadores, mucha más RAM y, en definitiva, un tamaño mucho mayor. Pero hay muchas otras diferencias que omití a propósito en el último post.
Para empezar, un superordenador es tan grande que rara vez es utilizado por un solo usuario a la vez. Lo normal es que muchos usuarios diferentes estén conectados al mismo tiempo, realizando tareas completamente independientes. Y la palabra conectados está resaltada porque es muy importante: el superordenador no se utiliza físicamente, sino que el usuario se conecta a él de forma remota: se accede a él escribiendo una dirección (un nombre de usuario) y su contraseña asociada, y se utiliza desde el ordenador personal. Para ti, el superordenador será sólo una ventana más en tu ordenador, con la diferencia clave de que las simulaciones que se ejecutarán no serán ejecutadas físicamente por el ordenador (es decir, no se oirá el sonido de los ventiladores de tu ordenador acelerando). Se ejecutarán integralmente en el superordenador, que puede estar al otro lado de la pared, o a miles de kilómetros de distancia. Para aquellos que ya se han conectado a otro ordenador a través de Team Viewer, este concepto será más familiar.
La frase como si estuvieras usando tu propio ordenador es un poco optimista y, esencialmente, inexacta. Realmente, habrá algunas diferencias importantes, especialmente desde la perspectiva del un usuario acostumbrado a Windows.
La diferencia más intuitiva será que, al principio, no podrás acceder a los archivos y programas que están instalados en su ordenador personal cuando estés en la ventana del superordenador. Por lo tanto, si quieres utilizar algunos de los archivos de tu PC/laptop en el superordenador, tendrás que encontrar una forma de transferirlos. Por supuesto, hay soluciones fáciles para llevar a cabo esa tarea. Pero es un ejemplo de algo que hay que tener en cuenta: incluso si se ve el superordenador como una ventana abierta en el ordenador, no es posible hacer cosas como arrastrar y soltar archivos en él, o copiar y pegar sin más. Sigue siendo un ordenador diferente.
¿Windows? ¿Qué es eso?
La otra diferencia clave (y es por esto que mencioné el aspecto de la perspectiva del usuario de Windows arriba) es que la gran mayoría de los superordenadores están bassados en Linux en en vez de en Windows o cualquier otro sistema operativo, debido a diferentes motivos (capacidad de personalización, licencias o conocimiento de los usuarios potenciales). Por lo tanto, si tu PC está basado en Windows y nunca has utilizado Linux, esto también constituirá una gran diferencia entre usar tu ordenador habitual y usar un superordenador.
Captura de pantalla de mi sesión conectado al superordenador Salomon, tomada mientras escribía este post. Elaboración propia.
Por supuesto, Linux está lejos de ser algo exclusivo de los superordenadores, y este artículo no pretende describir las diferencias entre Windows y Linux. Pero aun así, creo que no se puede dar una visión general de cómo tratar con un superordenador sin hablar un poco sobre Linux.
Cuando se trabaja con Linux, la forma más eficaz de navegar por el ordenador y realizar tareas no es utilizando el ratón y haciendo doble clic en los iconos, que es la forma convencional de trabajar en Windows. En Linux, una vez conectado al superordenador, se entra en las carpetas o en los archivos escribiendo comandos (líneas de texto) en la terminal, algo que está representado en la imagen de arriba. Por ejemplo, si estamos en el escritorio, podemos escribir ls y darle al enter, y nos aparecerá una lista de las carpetas y archivos que hay, aunque no los podamos ver directamente. Y si queremos entrar en una carpeta, escribiremos cd + nombre de la carpeta y le daremos al enter, en vez de hacer doble click encima de la carpeta. ls y cd son solo dos de los muchos comandos que hacen que, una vez dominado bash y la shell (que digamos que son el lenguaje y el programa interno en el que todo esto está basado), trabajar en Linux sea muy eficiente.
Otro tema diferente es cómo es posible conectarse a un superordenador basado en Linux desde un ordenador personal basado en Windows, que suele ser el caso. Sin embargo, eso es más difícil de explicar que de hacer y no añade mucho valor al post, por lo que se esta vez no voy a hablar de ello.
Quiero usar un superordenador. ¿Es posible?
Sí, puedes usar un superordenador si quieres. Pero hay que querer de verdad, lo que significa que hay que presentar un proyecto que se considere digno de esos recursos. Si tienes éxito, se te proporcionará una cantidad determinada de horas de procesador, que básicamente representan la cantidad de tiempo que podrás hacer uso del superordenador. El tiempo real que puedas usarlo dependerá de cuántos de los procesadores del superordenador utilices (esa es la razón por la que el tiempo se distribuye en horas de procesador, y no simplemente en horas de uso).
Por ejemplo, los ordenadores europeos que pertenecen a la red PRACE celebran periódicamente un concurso en el que las horas de procesador disponibles en el superordenador durante un período de tiempo se asignan a los mejores proyectos. En concreto, las reglas que se aplican al concurso para los recursos de los superordenadores aquí en Ostrava se pueden encontrar en este enlace.
¿Qué ocurre si demasiada gente quiere usar el superordenador a la vez?
Dado que las horas de procesador se distribuyen entre los usuarios solo una vez cada varios meses, los usuarios tienen una gran libertad para elegir cuándo hacer uso del superordenador. Puede ocurrir que, de repente, muchos usuarios quieran trabajar con el superordenador al mismo tiempo. Esto es algo que sucede de vez en cuando, y de hecho, estaba sucediendo en el momento en que tomé la imagen de arriba. Las cus mayúsculas (Q) significan que las simulaciones que quiero ejecutar están en la cola, esperando a que otras terminen.
Este tema es tan importante que hay todo un mundo de investigación a su alrededor. Porque, ¿cómo decidimos qué usuario debe tener preferencia a la hora de utilizar el superordenador? Para que sea lo más justo posible, se implementan unos algoritmos complejos (llamados algoritmos de planificación de tareas) para distribuir el tiempo entre los usuarios de forma adecuada, teniendo en cuenta factores como el tiempo que el usuario ya ha consumido, el tiempo que tardarán sus simulaciones, etc. Una introducción muy interesante a este tema fue dada por uno de los participantes del Summer of HPC del año pasado en su presentación final.
Conclusiones
El objetivo de este post era hacerte imaginar cómo sería usar un superordenador, especialmente en el caso de que sea una situación que no ocurrirá en tu vida. No sé si he conseguido un gran nivel de inmersión, pero al menos espero que te sirva para tener una idea aproximada de lo que es este mundo. La próxima vez, cuando la gente te hable de supercomputación, puedes responder: “Oye, no te creas que me vas a impresionar con eso. ¡Sé de lo que hablas!”.
So… Remember the test environment setup I was working on last time? I had to drop it last week. The PCOCC installation took me some time and a lot of patience, but it did run through at the end. Unfortunately, I was already a bit behind my schedule when that happened, and when I couldn’t start up a cluster of virtual machines with it within a few days, it definitely threw me out of my time window. Time to find a new approach…
As frustrating as dropping everything I had worked on for 3 weeks might be, these things can happen. It’s holiday season, a lot of people are out of the office, and everyone who stays behind is busy because they have to deal with the work left by those on leave. In my case, there was only very few colleagues with the technical knowledge about PCOCC and all the tools beneath it required for my setup, the documentation linked by PCOCC is a bit outdated and it doesn’t seem to have a big user base (no really, when you google “PCOCC”, the fifth result is my project description). I also didn’t have access to the SURFsara documentation about their PCOCC setup and only a regular user account on Cartesius, so I couldn’t look at all the configuration details there either. And that is how it came that I spent half a day doing git-diffs on different versions of SLURM to find out which would be compatible with a plugin needed. At the end, due to the narrow time frame of this project, it didn’t make sense to continue working on the test environment for more then half of my time here.
Luckily, my mentor returned from holiday last Wednesday so we could talk it through and he proposed I instead set up a virtual machine with KVM, the Linux hypervisor, and libvirt, a virtualization API, and add an encrypted disk to it. This will act as a proof of concept for HPC with encrypted disks and as a basis for some benchmarking. If you’re interested in some technical details, I’ll describe those in another blog post!
Of course not, but I have input for benzene and I did the first tests of the electron density computation on this system. Before showing the results (Pretty figures!), I would like to tell you about the helical symmetry (something that makes the computation faster) and give some insight into the nanotube code.
Helical symmetry of nanotubes
Tyger Tyger, burning bright, In the forests of the night; What immortal hand or eye, Could frame thy fearful symmetry?
William Blake: The tyger
Symmetry is something that is pleasing for human eye in general, and it has an important role in chemistry as well. The symmetry of molecules is usually described by point groups that contain geometric symmetry operations (mirror plane, rotational axis, inversion center). For example, the water molecule belongs to the C2v point group: it has a mirror plane in the plane of the atoms, another one at the bisector of the bond angle, and a twofold rotational axis, and the identity operator. (I shall note that there is another way to describe the symmetry using permutation of the identical nuclei, which is a more general approach.) Point group symmetry can be used to label the quantum states and it can be incorporated in quantum chemistry programs in a clever way to reduce the cost of the computation.
In my previous post I wrote about how to “make” a nanotube by rolling up a two-dimensional periodic layer of material. The resulting nanotube is periodic along its axis, it has one-dimensional translational symmetry. In the case of a carbon nanotube with R=(4,4) rolling vector the translational unit cell contains 16 carbon atoms (see figure, yellow frame). However, it is better to exploit the helical and translational symmetry (pseudo two-dimensional approach), where the symmetry operation is a rotation followed by a translational step. It this case it is sufficient to use a much smaller unit cell (only two atoms) for the R=(4,4) carbon nanotube. The small unit cell is beneficial because it makes the computation cheaper. The figure below shows the atoms of the central unit cell in red, and the green atoms are generated by applying the rotational-translational operation twice in both directions.
Helical symmetry of the R=(4,4) carbon nanotube
What does the code do?
So, we have a program that can compute the electronic structure of nanotubes (the electronic orbitals and their energies) for a given nuclear configuration. I would like to tell you how it works but I cannot expect you to have years of quantum chemistry study and I don’t want to give you meaningless formulas. Those, who are familiar with the topic can find the details inJ. Comput. Chem. 20(2):253-261 and here.
Basis set: Contracted Gaussian basis. The wave function is the linear combination of some basis functions. From the physical point of view the best choice would be using atomic orbitals (s-, p-, d-orbitals), but for a computational reason it is better to use (a linear combination of) Gaussian functions that resemble the atomic orbitals. The more basis functions we use, the better the result will be but the computational cost increases as well.
Level of theory: Hartree-Fock. This is one of the simplest ways to describe the electronic structure. The so-called Fock operator gives us the energy of the electronic orbitals. It is a sum the two terms. The first contains the kinetic energy of the electron, the Coulomb interaction between the electrons and the nuclei, and the internuclear Coulombic repulsion. The second term contains the electron-electron repulsion. The Fock operator is represented by a matrix in the code, and the number of basis functions determines the size of the matrix.
To get the basis function coefficients and the energy we have to compute the eigenvalues of the Fock matrix. To create the matrix we already have to know the electron distribution, so we use an iterative scheme (self consistent field method): Starting from an initial guess, we compute the electronic orbitals, rebuild the Fock matrix with the new orbitals, and diagonalize it again, and so on. The orbital energy computed with an approximate wave function is always greater than the true value, so the energy should decrease during the iteration as the approximation gets better. We continue the iteration until the orbital energies converge to a minimum value. This is how it is done in a general quantum chemistry program, but in our nanotube program it is different, as you will see in the next section.
Transformation to the reciprocal space. The translational periodicity allows us to transform the basis functions from the real space to the reciprocal space (Bloch-orbitals). This way the Fock-matrix will be block diagonal. We diagonalize each block separately, which is a lot cheaper than diagonalizing the whole matrix. This part is parallelized with MPI (Message Passing Interface), each block is diagonalized by a different process.
The results are orbital energies and basis function coefficients for each block. So far the main focus was on the energies (band structure), but now we will something with the orbitals as well.
Finally: the electron density
Now let’s get down to business. The electron density describes the probability of finding the electron at a given point in space, and it is computed as the square of the wave function. Now we are not interested in the total electron density, but the contribution of each electron orbital. For the numerical implementation we have to compute the so-called density matrix from the basis coefficients of the given orbital, and then contract it with the basis functions. First, we compute the electron density corresponding to the central unit cell in a large number of grid points. The electron density of the whole nanotube is computed from the contribution of the central unit cell using the helical symmetry operators.
Let’s see an example, the benzene molecule. (I know it’s not a nanotube, but it’s a small and planar system that is easy to visualize). The unit cell consists of a carbon and a hydrogen atom, and we get the benzene replicating and rotating it six times, and we do the same way for the electron density. This particular orbital shown in the picture is made of the pz-orbital of carbon and the s-orbital of hydrogen. On the left electron density figure, you can see the nodal plane (zero electron density) of pz between the two peaks; and the merging of the pz and the s-orbital to a covalent bond.
Linear combination of atomic orbital to molecular orbital
Generating the electron density of benzene from one unit cell (Grid points are in the plane of the molecule)
The next steps
Recently I have been testing the parallel version of the electron density computation to see if it gives the same result as the serial code or not. I hope it does, so “parallelization” is not among the next steps. (However, one can always improve the code). The next challenge is the visualization of the electron density in 3D.
This week, I show how I created animations for collective communications and transferred them to Wee Archie (with videos) and introduce the wave simulator. I’ll also talk about issues with my system on Wee Archie and how this will affect my goals going forwards. Make sure not to miss the cute animal pictures at the end!
Animating Collectives
Since my last post, I’ve created a prototype of every type of MPI communication needed for an outreach demonstration. You can see the collective communications in the video below, run on a Wee Archlet. For details on how I made these animations, and what a Wee Archlet is, see the previous post. These aren’t quite the final animations, but are a good proof of concept which I’ll polish as I create the final interface for the demos.
Collective Communications on Wee Archlet
For full context, a collective communication is a one-to-many or many-to-many communication, vs. the simpler one-to-one communications of last post. Some of the most commonly used collectives are shown above. In one-to-many, a master Pi is needed, one which controls the flow, gives out all the data, or receives all the data. Here, this is always the bottom Pi.
Gather will collect data from every computer in your group and drop it all on one. Reduce will do the same, but apply an operation as it goes to the data – here it sums it up. Broadcast will send data from the root to every computer in the group. Scatter will do the same, but rather then send them all everything, they each get slices of the data.
Animating for Production
Once I had finished prototyping my animation server on the Wee Archlet, and created animations for many simple operations, it was time to get it working on Wee Archie. Transferring my animations to render on Wee Archie was easy in theory… It ended up taking less time than I thought it was going to, but there were still many complications that need to be ironed out! Below you can see one of my first demonstrations on Wee Archie.
How would you broadcast a message from one computer to every other one on a network? Would you just get it to send it to them all, one by one, or try and send them all at once? Neither is perfect – what if the message is large, or the network is slow to start a connection on? A different approach to sending these messages is shown below, which is the best way to make full use of network where all the computers are connected to each other, and the actual processing of the message will take a while compared to the sending.
Broadcasting Explanation on Wee Archie
Here, you can see the broadcast as shown earlier, but broken down so you see how it happens inside MPI (a simplified version, anyway). Here, one of the central Pis will send a message to every other one. There are 16 of them, so with the implementation I show, you can do it in 4 steps! Assuming all of your Pis are connected to each other, and all connections are as fast as each other, this is a very fast way of getting a message across.
Initially, the central Pi will send the message to a neighbour. This means the neighbour also has the message, so they can both pass it on! When they do this to their neighbours, now 4 have the message. Then they will all pass on the message to another neighbour, and then we have 8. If there were more, this would continue on, but we’re done after the next step, when the 8 send on their message, and all 16 Pis have it!
This demonstration will be the final one in a series of tutorials building up to it, illustrating the basics of parallel computing on Wee Archie. They include all the previous demos I’ve shown in some form, with explanations and motivation as you go. They’re mostly written now, and when the first version is complete, they’ll all be posted on the Wee Archie Github repository.
The main issue with the current system is lack of synchronisation. Because I use HTTP requests to queue all my animations, they often end up out of sync – starting a connection takes time. A potential fix for this is setting up a network of pipes using WebSockets, which I’ll investigate soon. The other main issue is playback speed – I need a global control for frame rate on the various animations, as well as the ability to change this as the circumstance needs it. Hopefully making this part of the server doesn’t prove too difficult, as it would improve the feel of the animations a lot!
Introducing the Sea
So, the famed wave simulator, or coastline defence simulation – what is it?
Initially developed by my mentor Gordon Gibb, it’s the most recent demonstration for Wee Archie. It asks the user to place a series of coastal defence barriers – represented here by gray blocks. You have a budget, and each block has a fixed cost. The coastline has several different locations pictured. They are more or less susceptible to being damaged, and damage costs various amounts.
For example, the beachfront is cheap and easy to damage. The cliff-top housing is hard to damage and expensive. The library, expensive and easy to damage. The shopping market is somewhere in-between them all.
Wave Simulator (Current Version)
When you place the barriers and set it running, it will run a fluid simulation, and show the waves hitting the shore, calculating the damage they do, and how effective your barriers are! At the end of it, the amount of money saved/spent is your score. It’s quite simple to use, and very popular at outreach venues such as Edinburgh Science Festival.
All the current demonstrations use a framework where Wee Archie runs a server on the access Pi – the left of the two on top in the video. This server will run the simulations when called, and give the data back to the client, a graphical Python application running on the connected laptop.
Upcoming Goals
So this week I’ll be working on getting the wave demo working using my version of MPI, showing the communication as it happens. I won’t be able to show all of them, as to render the simulation, there are hundreds of operations per second, but I’ll display the main logic of the program once or twice out of those hundreds.
The nice thing about visualising communication is that unlike the data in the program, the way the computers communicate stays the same throughout the execution. It also doesn’t need advanced mathematics to understand. This makes it a good thing to sample and explain in detail!
The next goal after this will be to change the graphical interface from Python to running in the browser. Once I set this up, I’ll be able to host a website on Wee Archie, and all the client computer will have to do will be connect! The website will be able to host all the tutorial I write and the demonstrations, and easily include extra details, explanations and resources.
If it goes well and porting over the old demos is fast enough, I anticipate it being the future for Wee Archie. If I have all that working by the end of the project I’ll be very pleased with the progress made, as it will have improved the experience of using Wee Archie enormously. Rather than a collection of independent demos, it can be a cohesive whole which lets students explore at their own pace and learn as they go!
Sightseeing Segment
It’s been a couple of weeks, but it’s been very busy. I haven’t had time to see too much, though last weekend I visited the beautiful Pentland Hills regional park. The swan featured in this post is from the reservoir there in Glencorse. It’s a beautiful park, with many options for different visitors, though I stuck to some trekking through the fields – getting a bit sunburned in the process. I hope to go swimming in the reservoir with the others if the weather holds!
Harlaw Reservoir
Dalry Cemetery
Sheep in Glencorse
Swan in Glencorse Reservoir
Sadly, I am not much better at ping pong then I was previously. I don’t think I was cut out for this line of rapid reflex wrist-flicking work. I’ve managed to take one game off Benjamin, but it was in the face of many! I hear the Fringe has some public tables which are better than the one in our accommodation, so perhaps that will help?
I spent the first few weeks at Computing Centre of the Slovak Academy of Sciences getting familiar with the nanotube code. We actually changed the goal of the project a bit. The original plan was the further development of the MPI parallelization, but now I am working on an extension of the code with a new feature. The task is to compute (and visualize) the electron density of the orbitals that come out of the simulation.
In this blog post, I am giving a short introduction to quantum chemistry in general, and I will tell about the interesting things I have learnt about nanotubes.
How to make a nanotube in 3 easy steps
Due to their extraordinary properties, nanotubes are in the focus of material science, nanotechnology and electronics as well. Carbon nanotubes are cool things indeed: they are one of the strongest materials, they can behave either like metals or semiconductors, they have good thermal conductivity, and they are the material of Vantablack, too.
You can make a nanotube this way in a thought experiment (the real synthesis is not this, of course):
Step 1: Take a layer of a two-dimensional periodic material, for example, a sheet of graphene. Periodicity means that you can construct the whole material by translating a repetitive unit, the unit cell. In the picture, the unit cell (blue) contains two carbon atoms, and has two unit vectors.
Step 2: Define the rolling vector (purple) by connecting the centers of two unit cells of the graphene sheet. Cut out a rectangle whose one side is the rolling vector.
Step 3: Roll up the nanotube by connecting the two ends of the rolling vector.
Congratulations, now you have a carbon nanotube!
This is how we can imagine making nanotube from a graphene sheet.
Its material and the rolling vector characterize the nanotube. The rolling vector determines the diameter, the structure (armchair, zigzag or chiral), and the conductivity (metallic or semiconducting).
Nanotubes with different rolling vectors
Quantum chemisty in a nutshell
The potential engineering applications made the nanotubes an important topic of computational science. I am working on a code that computes the electronic structure of the nanotube. But before going into details about the particular code, I would like to tell you about quantum chemistry in general.
Some people, who asked me about my field within chemistry, were really surprized when I told them I am not working in a laboratory, but I am doing computer simulations. They were thinking that chemistry is only about boiling colorful liquids in flasks and experimenting with explosives. But it is not true, at least for the last few decades, the age of computational quantum chemistry. Applying quantum mechanics for molecular systems helps us to explain many experimental phenomena: Why can the matter absorb light only at certain wavelengths? Why can two atoms make a bond? What is the mechanism of a chemical reaction on the molecular level?
The first step to answer these questions is to solve an eigenvalue-equation. In quantum mechanics, we always solve eigenvalue-equations, because physical quantities are described by operators whose eigenvalues are the possible values of the physical quantity, and the quantum states are described by the eigenvectors. The Hamiltonian operator is the operator of the energy, and its eigenvalue-equation is the famous Schrödinger-equation.
Unfortunately, the Schrödinger-equation can be solved analytically only for simple (model) systems, for more complicated cases we do it numerically using computers. How is it done in practice? The wave function will be the linear combination of some basis functions and the Hamiltonian is represented by a matrix. Then, we diagonalize the matrix to get the basis expansion coefficients and the energy.
Goal: Electron density of the nanotube
So, we have the nanotube code that computes the energy of the electron orbitals. My task is to construct the orbitals and the electron density from the basis function coefficients. Electron density tells us the probability of finding the electron at a given point. This way we can visualize the chemical bonds and the nodes of the orbitals. Now, I am working on the serial code and testing it on simple systems like benzene, and I get plots like the one below.
Testing the code: Electron density for one orbital of benzene. White areas indicate high values.
This is enough for now maybe, I will write about the nanotube code and how we do the electron density computation (and hopefully some results) in the next blog post.
Hello from Luxembourg. If you read my previous post, you may be aware that this is where I’ll be spending my summer as part of the PRACE Summer of HPC programme. The official title for the project I’ll be doing here (at the University of Luxembourg to be precise) is “Performance analysis of Distributed and Scalable Deep Learning,” and I’ll spend this blog post trying to give some sort of explanation as to what that actually involves. However, I’m not going to start by explaining what Deep Learning is. A lot of other people, such as the one behind this article, who know more than me about it have already done an excellent job of it and I don’t want this post to be crazily long.
The Scalable and Distributed part, however, is slightly shorter to explain. As many common models in Deep Learning contains hundreds of thousands, if not millions of parameters which somehow have to be ‘learned’, it is becoming more common to spread this task over a number of processors. A simple and widely used example of this is splitting up the ‘batch’ of data points needed for each training step over a number of processors averaging the relevant results over all processors to work out change needed in each trainable parameter at the end. That last part can cause problems however as, like every other area of HPC, synchronization is expensive and should be avoided if possible. To make things more complicated, a lot of deep learning calculations are very well suited to running on GPUs instead of CPUs, which may add more layers of communication between different devices to the problem.
Despite being a fairly computationally expensive task Deep Learning seems to be quite a popular way of solving problems these days. As a result, several organisations have come up with and published their own, allegedly highly optimized, frameworks to allow programmers to use these techniques with minimal effort. Examples include Tensorflow, Keras, MXNet, PyTorch and Horovod (see logos above). As these libraries all use their own clever tricks to make the code run faster, it would be nice to have a way of working out which is the most suitable for your needs, especially if your needs involve lots of time on an expensive supercomputer.
Myself and Matteo (also doing a project in Luxembourg) outside the university
This takes us to the surprisingly complicated world of Deep Learning Benchmarking. It’s not entirely obvious how evaluate how efficiently a deep learning program is working. If you want your code to run on a large number of processors, the sensible thing to do is to tweak some hyperparameters so you don’t have to send the vast numbers of parameters mentioned in the previous paragraph between devices very often. However, while this can make the average walltime per data point scale very well, there’s no guarantee the rate at which the model ‘learns’ will improve as quickly. As a result, there are multiple organisations who have come up with a variety of ways of benchmarking these algorithms. This includes MLPerf, which publishes results “time-to-accuracy” results for some very specific tasks on a variety of types of hardware, DeepBench, which evaluates the accuracy and speed of common basic functions used in Deep Learning across many frameworks, and Deep500, which can give a wide range of metrics for a variety of tasks, including user defined ones, for PyTorch and Tensorflow. There are even plans to expand some these to include how efficiently the programs in question use resources like CPU/GPU power, memory, and energy efficiency.
My project for the summer is to set up a program which will allow me to run experiments comparing the efficiency of different frameworks on the University’s Iris cluster (see picture) and to help cluster users choose the most appropriate setup for their task. Ideally, the final product will allow you to submit a basic single device version of a model in one of a few reference frameworks and which would then be trained in a distributed manner for a short time, with some metrics of the kind described above being output at the end. So far, I’m in the final stages of getting a basic version up and running which can profile run with image classification problems Tensorflow, distributed using Horovod. The next few weeks, assuming nothing goes horribly wrong, will be spent adding support for more Frameworks, collecting more complicated metrics, running experiments for sample networks and datasets, and making everything a bit more user friendly. Hopefully, the final version will help users ensure that appropriate framework/architecture is being used and identify any performance bottlenecks before submitting massive individual jobs.
The University’s HPC group, Matteo and I at our office
I’m doing this work from a town called Belval, near the border with France. While the university campus where I work is the main place of interest here, it’s only a short train ride from Luxembourg city, with various towns in nearby France, Germany and Belgium looking like good potential day trips. From my visits so far, the most notable feature of the city centre is the relatively large number of steep hills and even cliffs (one of which requires a lift to get down). This makes walking around the place a bit slow but at least it means I’m getting some exercise. The one low point of my time here was when heat got a bit excessive in the last week. Irish people are not meant to have to endure anything over 35° and I’m not unhappy that I probably won’t have to again this summer. However, there was a great sense of camaraderie in the office, where the various fans and other elaborate damage limitation mechanisms people had set up were struggling to cope. I imagine it would have been a lot less tolerable if the people around me weren’t so friendly, helpful and, welcoming.
Buon giorno! I am currently sitting on a plane headed for Amsterdam where I will begin my HPC project, and I thought I would take this opportunity to compose my next blog update.
We have officially finished our orientation week in Bologna! This past week was full of so much learning. With classes from 9 to 6, Monday to Friday, we managed to cover a lot of content: principles of HPC, Parallel Processing, OpenMP, MPI syntax in C (note: I am not a C programmer), CUDA, remote visualization with Paraview, Blender, CINECA’s Galileo and Marconi clusters, and more…
Hanging out with Marconi
All of the above were essentially foreign to me before this week. The SoHPC coordinators have taken great effort to ease even the least computer science-y of us into the realm of HPC in an engaging way.
My favorite part of this past week has been getting to know my fellow interns. We have since dispersed to different corners of Europe: eleven HPC sites across eleven different European countries host the students for the remaining of the summer, where we are each allocated to our own unique project. But for this first week, we all had the opportunity to get to know one other, help each other out with classwork, enjoy the Italian heat, and bond over (too much) pizza, pasta and gelato!
The SoHPC 2019 crew
If this first week was any indication of what’s to come, I’m up for a fantastic Summer of HPC!
Stay tuned for my next update, coming at you from Amsterdam 🙂
Applications are open from 18th of January 2019. See the Timeline for more details.
PRACE Summer of HPC programme is announcing projects for 2019 for preview and comments by students. Please send questions to coordinators directly by the 15th of January. Clarifications will be posted near the projects in question or in FAQ.
About the Summer of HPC program:
Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 20 top applicants from across Europe will be selected to participate. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. The programme will run from July 1st, to August 31th.
For more information, check out our About page and the FAQ!
Ready to apply? Click here! (Note, not available until January 15, 2019)
As I said in my first blog post, I came from a city called little Amsterdam because of bikes. Therefore it was obvious that I would buy a bike here. I wanted to buy the cheapest bike possible. You can buy some old rusty bike for 50-80 euros from someone. But there was the first shock – I was thinking about Amsterdam as a city where you can let your bike without locking and after two months it will be still there. I was totally wrong. There is a huge problem with stealing bikes. And if you buy a stolen bike, you will have a serious problem. For that reason, I went to the bike shop looking for a second-hand bike. Damn, it is much more expensive. But I was lucky. I found it. 100 euros – “cheapest but one of the ugliest bikes”, as the salesman said. Except for the look, the salesman had to use the hammer to adjust the seat, but it is a bike, my bike now. From that day I‘ve never used public transportation in Amsterdam.
The first day, I discovered the motto of Dutch people: “Why would you be in the traffic jam in your car, when you can be in the traffic jam on your bike”. Most people here go to work with their bike, and you can go everywhere by bike. Amsterdam is owned by cyclists. If you walk on the cycling path they won’t stop and you will be driven over. (Not really, however, I am not going to try that.) They even have separate traffic lights for cyclists everywhere and a lot of underground garages for bikes.
It took me two or three days to adapt to their rules. Because nobody respects the rules here. Is it a red light? No problem. You have right of way and somebody crossing your way? No problem he just accelerates to be faster than you. If you adapt to these unwritten rules you will have a nice time on a bike here.
The opposite story is the city center where most of the cyclist (actually most of the people) are tourists. They don’t know how to ride a bike here and also pedestrians do not respect cyclists there. That’s why I parked my bike in the parking boat every time I went to the city center. Yes, park your bike in a place which is supposed to, otherwise, it is quite possible that your bike will be towed away.
I was cycling to work every day. It was 10 km from where I was staying. If instead I used public transportation it would take me the same amount of time and I would have to pay for it. So it is also a free work-out. If only other cities will be the same.
I would like to start with a joke: Cat walks into a bar… and doesn’t.
That is Schrodinger’s cat which shows us how bizarre the quantum world is. Some people understand Schrodinger’s cat experiment in a way that you can not know if the cat is dead or alive until you open the box. But in reality, this experiment shows that the cat is actually dead and alive at the same time, and after opening the box it defines the state – being alive or dead. So you can actually kill the cat by opening that box. But how can something be dead and alive at the same time? And how can the state change only by looking at it? That is the point of Schrodinger’s idea to point out that the quantum world is completely different from our view of the world. If a quantum particle is in manystates at the same time it is called in superposition between these states.
Now let’s introduce quantum probabilities. To do this, we can use our everyday quantum measurement device – polarized sunglasses. If the photon reaches the polarized filter of sunglasses and it is polarized on the same axis as the filter, it has a 100% chance to go through the filter. Contrariwise, if the polarization of the photon is 90 degrees to the filter, it has 0% chance to go through. And finally, if the angle is 45 degrees it has a 50% chance to go through. So, from 1000 photons, around 500 photons stop and 500 photons go through. But the quantum weirdness is that we can’t actually know before this measurement which photon will come through and which one stops. We only know the probabilities. And of course, the measurement changes the polarization of photons. This quantum non-determinism worried physicists for decades, Einstein commented it with a famous quote “God doesn’t play a dice”. He and many more physicists thought that there must be some “hidden variable” which we have to find, or we can‘t, but still has to be there. But the Bells inequality test experiment showed us that there is no hidden variable.
Quantum entanglement. In the 1930s, Albert Einstein was upset about quantum mechanics. He proposed the thought experiment where according to the theory, an event at one point in a universe could instantaneously affect another event arbitrarily far away. He called it “Spooky action at a distance” because he thought it was absurd. It seems to imply faster than light communication – something his theory of relativity ruled out. But nowadays, we can do this experiment and what we find is indeed spooky. Let’s imagine two entangled photons and we are going to measure them at a 45 degree angle from their polarization. We find out that if we measure them at the same time, in the same direction we get the same result. Both photons stop or both go through. But it is strange that entanglement works on any distance instantaneously. Measuring one photon instantaneously affect the result of measuring the second photon at any distance in our universe.
The quantum internet and quantum computers are based on these strange principles. That is the reason why they are so different from anything we are used to and why we can do things we cant before.
Traveling to foreign countries for longer periods of time is always a great experience, especially because you have a chance to really get to know the culture and to meet new people. Making friends abroad might often be a challenging task, however, it is easy if you play Ultimate Frisbee. If you still believe that playing frisbee means that people just randomly throw a frisbee in a park, you should better check the rules out…. Ultimate as a self-refereed sport is based purely on the Spirit of the Game and because it somehow brings together people with alike thinking it is so easy to make friends. All you need to do is literally shout out Ultimate Frisbee (… and maybe to check local Frisbee Facebook page).
Edinburgh Ultimate Summer League
My first experience with Frisbee in Scotland was a bit hectic. Just on a Tuesday evening I managed to join a pick-up team for the Summer league that was starting on Wednesday. So the next day I went to a park I have never heard of before, to play a match with a bunch of people I have never met before. Guess what? Yeah, we lost the game. Maybe also because we could barely recall each others names. To fix the problem with names and also to supplement the calories burned, we went to a pub for a burger and a beer. Over the next 5 weeks of the League we won 2 games, finished up 4th (out of 6 teams) and consumed dozens of burgers together.
Team White after the last game of Summer League.
Trainings
Selfie with delicous hand-made cookies we got after training.
Twice a week there are regular training sessions happening in Edinburgh. One restricted to women only and the other one open for anyone. These take place at Meadows park, a huge open-access park just in the middle of Edinburgh with a beautiful view of Arthur’s seat as well. Surprisingly, all these trainings had a great quality and we got some highly experienced players to lead the trainings, everything for free.
Beach
Definitely the best part of Ultimate in Edinburgh is the beach. I have mentioned how beautiful Portobello beach in my first blog post and it gets even better if you get to play Frisbee on the beach. If the weather allows, there is a pick-up game happening twice a week, players of all levels are most welcome. As romantic as this sounds it also has a downside. By the end of the game you have sand everywhere, even in your ears usually and even a short swim in the North sea we had after each game does not really help to get rid of it.
Frisbee on Portobello Beach.
My ultimate Ultimate Adventure
After a couple of weeks in Edinburgh, I thought that playing here only is not that adventurous any more and decided to explore the UK Ultimate scene a bit more. A frisbee friend of mine helped me to get in touch with the captain of the Reading womens team and just a week later, I took off from Edinburgh and landed at Gatwick to get a lift with Avril (who I’ve never seen before) to Edenbridge where the tournament took place. So the treat of this summer for me were actually the flight tickets. No regrets at all. It was amazing. Two days of high level Ultimate and unbearable heat, two nights in a sleeping bag on a floor in a Scout Hut, all with people I met for the first time in my life. Unforgettable experience!
After two months of intensive work here in the heart of Scottish pride plains, a time has come to conclude the results and verify the initial expectations.
My project was developed as a part of a data processing framework for Environmental Monitoring Baseline (EMB). The idea is simple: there are plethora of sensors around the UK that measure various environmental indicators such as water ground levels, water pH or seismicity. These datasets are publicly available on the website of British Geological Survey.
In terms of data analysis we are far past manual supervision therefore in addition to the proper dataset labelling and management, there is a need for a robust processing framework that also aids valid data acquisition. The task is fairly simple when we can identify our key needs but this entry stage of targeting the crucial aspects of our future system should not to be underestimated as it tends to have a greater impact the more advanced the work on the system is. Fortunately, environmental sensors, being a part of the IoT world, exhibit some particular characteristics that more or less translate to the nature of the data that will be consumed by our processing framework. Firstly, it can be expected that the data stream will be quite intensive and in the first approximation, continuous – when multiple sensors report every several minutes, the aggregated data flow rate becomes a significant issue. Furthermore, the we need the analysis to happen in the real-time. The third requirement of the system would be that the data might be of use later, therefore it needs to be stored persistently. At last, we need to run our system on HPC because it is Summer of HPC, that’s why. But seriously, systems like these require a reasonably powerful machine to run.
Soa either by extensive search on the web, or by experience, we can relatively quickly find the necessary tools to match our requirements. Whereas we might not be able to pinpoint precisely every aspect, there is no need to worry because we will take advantage of a software engineering paradigm called modularity. In modularity approach we want our software components separated and grouped by functionality, so that we can replace one, that does not exactly fit our needs with a more suitable one later in the project’s timeline. It is very much like creating our tiny Lego® bricks, then grouping and finally putting them together to form a desired shape.
To conclude the above reflections on how important our initial problem identification is, let’s dig into the software-requirement matching phase. All software components are open-sources, with majority of them being released on the Apache license.
Combining the first requirement with the third one, we have a demand for the high-velocity, high throughput persistent storage database. To satisfy this, we can use the Apache Cassandra database which is a column-wide no SQL, distributed database that supports fast write and read, while maintaining well-defined, consistent behaviour. So we can safely retrieve data while we are simultaneously writing to the database. Real-time analysis for HPCs can be managed by using large-scale data processing called Apache Spark that supports convenient database-like data handling. Additionally, Spark code being written in a functional style, naturally supports parallel execution and optimizes for minimal memory write/read – that is especially visible when we use Scala programming language to code in Spark. In order to make our processing architecture HPC ready, reproducible and easy to set up we will containerize our software components. This means putting each “module” in our modular system in a separate container. Containers provide a way of placing our application in a separate namespace. What this roughly means is that our containerized applications will not be able to mess with our system environment variables, modify networking settings or as an another example, use restricted disk volumes. As a bonus, containers use little system resources and are designed to be quickly deployed and run with a minimal system babysitting. Basically, we put our rambunctious kids (applications) in a tiny, little playground (containers) with tall walls (isolated namespace) which also happens to be inflatable so we can deploy it anytime (rapid, automated startup).
How to build a successful processing system. This useless IKEA styled diagram does not explain it. Idk, just get some containers using basic tools, add some messaging and programming magic and it might work. When it doesn’t, no refunds
Hold on a minute – we have ended up with modular architecture but hey, since everything is written in a different language, uses different standards, how can we communicate all those modules with each other? This is our hidden requirement – the seamless communication across different services and to solve that issue we use Apache Kafka. Kafka is very much like a flexible mailman-contortionist. You tell him to squeeze into a mailbox – done, oddly shaped cat-sized door? – no sweat, inverted triangle with a negative surface? – next, please. Whatever you throw at him, he will gladly take and deliver with a smile on his face (UPS® couriers please learn from that). So in programming terms, we will inject a tiny bit of code into each of our applications – remember, those cheeky rascals sit in the high security playgrounds but we are working on communicating all the playgrounds together so they can transfer toys, knives or whatever else between themselves. This tiny bit of code will be responsible for either receiving messages or sending messages via Kafka or both. Moreover, in order to tame our messy unrestrained communication we provide the schemas which will tell each application what it can send or receive and in what format it has to write or read the data.
By following the breadcrumbs we have arrived to our final draw-out of the architecture. Putting it all into a consistent framework is a matter of time, amounts of coffee and programming skills, but all in all, we have managed to come up with a decent makeup that meets our needs. Shake your hand. You’ve earned it.
I have got a name already — the FåKopp. The name roughly translates to get a cup (of coffe obviously, or scotch since we are in Edinburgh, right?) and relates to the amount of coffeine required to complete the project.
That’s all folks.
Featured image This is a sneak-peak of an actual architecutre
In my previous post I discussed Job Scheduling algorithms, now I will tackle how we are approaching working with them.
In my project, the NEXTGENIO research group (which is part of PRACE) have made a HPC system simulator that allows us to implement different job scheduling algorithms and test their effectiveness, without having to use ARCHER (Edinburgh’s supercomputer), which is expensive to use. It is not a finished project yet and there is still some software engineering to go through, output features to be added and tests to be done. It also had to be made compatible on MAC OS as it had been developed in LINUX, which I did in my first two weeks, this just involved some software engineering.
My supervisor wanted a selectable output feature to be added to the project. Two output types that were developed at the Barcelona Supercomputing centre called OTF2 (Open trace format) and Paraver (browser for performance analysis) were desired to be chosen between to show the output. This has been implemented and been written so that other types can be added to the project too with relative ease.
Simulator design.
Algorithms
Initially, the simulator only had the First-Come-First-Serve algorithm. In this project we strive for better performance, so let’s look at some other algorithms. First off, the score-based priority algorithm which sorts jobs according to scores where we incorporate a fair share scheduling weight to adjust score based on the total number of compute nodes requested by the user, the number of jobs they are running, their recent history and the fraction of their jobs completed. Next, we have multi-queue priority which incorporates numerous queues with different levels of priority and there are certain conditions required to be in each queue. Finally, we have backfilling. Here we opportunistically run low priority jobs when insufficient resources are available for high priority jobs. That’s our list of job scheduling algorithms which sort the order of the job waiting queue. The scheduling algorithm runs on job events such as when a job starts, finishes, aborts or arrives and if there are no events in the past 10 seconds it runs anyway.
The task of mapping algorithms determine which compute nodes to run the job on. The goal is to minimise the communication overhead and reduce cross-job interference. Random mapping is considered the worst case scenario. Round-robin keeps the nodes in an ordered list and when a job ends the node is appended to the list. Over time, the fragmentation of the list becomes significant and the communication overhead drastically increases. For Dual-End we set a threshold value for time which groups every job into short or long. Short jobs search for unoccupied nodes from one end of the list, long jobs search the other end.
Each job has a priority value P(J) with the wait queue being ordered in terms of highest priority. The priority is the sum of five different heuristics. Minimal requirement specifies details like the number of nodes. Aging helps avoids job starvation, and is shown in detail on the right, where age factor is a multiplicative factor of choice. Deadline maximises the number of jobs terminated with a deadline. License heuristics gives a higher score to jobs requiring critical resources on nodes such as NVRAM which hasn’t been implemented yet. Response minimises the wait time for jobs with the shortest execution times by a boost value, which is the backfilling component of this algorithm. This paper goes into good detail about backfilling.
Free time
I’ve been making good use of my time off. First, Jakub and I went up to explore the Royal Observatory on Braid and Blackford Hill beside where we work. The temperatures have been so high here that there were fires on top of the hill and firetrucks were needed to put them out. The observatory was closed when we went up but I think we will revisit and see if we can get to the top.
The Royal Observatory.
Candid photo of me catching some zzzz’s on the hill, taken by Jakub.
Fire trucks on top of the hill, there were bush fires due to the heat.
In my previous post I summed up what it is like to build up a Raspberry Pi based “supercomputer”. Since Raspberry Pi is a versatile device there are many more fun things one can do with it besides just running programs on it. One possibility is to connect a small LED lights panel to it to allow, for example, real-time visualisation of computations.
Hardware
All you need besides the Raspberry Pi is a LED Backpack. In my case, for the Raspberry pi cluster I was provided a set of 5 Adafruit Mini 8×8 LED Matrix Backpacks which can be connected directly to a Raspberry Pi:
Adafruit LED Matrix Backpack connected to a Raspberry Pi.
Unfortunately, the lego cases I have for the Raspberry Pis are not quite suited for the use of LED lights, so my small supercomputer does not look that cool any more. It turned more into a random bunch of cables with different colors.
“Supercomputer” with LED lights.
Software
Programming the LED lights might sound difficult at first but it is actually quite simple. The two main pieces of software one needs are:
Freely available Adafruit Python Led backpack library. It is a Python library for controlling LED backpack displays on Raspberry Pis and other similar devices and it provides instructions for both installation and usage.
Python PIL (or PILLOW) library, more specifically Image and ImageDraw modules.
Programming
To avoid lengthy explaining of the mentioned libraries, lets have a look at an example straight away. The following piece of code implements one of the basic programs for LED lights which is a simple print of an image that consists of 8×8 – 1 bit pixels:
Python code.
So first, with the aid of the Python PIL library, an 8×8 1 bit image is created and the desired pixels are set to nonzero values using the ImageDraw functions – draw.line and draw.point. Secondly, the LED light display is initialized and the created picture is simply printed on the LED light using the Adafruit Python library functions. As easy as it gets right? Could you guess from the code what the result will look like?
The result 🙂
The Adafruit library provides a few more simple functions such as:
function that sets a chosen pixel of the 8×8 matrix directly to either on or off without the need of creating a PIL image
horizontal scrolling function that: “Returns a list of images which appear to scroll from left to right across the input image when displayed on the LED matrix in order.”
vertical scrolling function that works similarly to the horizontal one but the image appears to scroll from top to bottom
However, there are many other functions that one might find useful, such as rotation of an image or backwards horizontal/vertical scrolling. Even though these functions are not part of the Adafruit library one can quite easily implement them on their own.
Provided these powerful tools, all the rest is up to the users creativity. From my personal experience I would say that programming the LED lights is fun. The best part is the fact that you have a visible result and you can see it almost immediately.
Didn’t understand?, well that served it’s purpose then!
This time I would simply like to boast the awesomeness of my Sensei, here at Cineca, who has been indirectly, gelled with some fascinating and intriguing conversations, teaching me HPC kung fu! Well, it is me grasping more than him actually teaching, but you get my point! His official designation for my project is of Site-Coordinator, but Sensei sounds much more awesome, just like he actually is!
Hello again, this is my 2nd blog for my Journey as a PRAACE Summer of HPC 2018 participant. This time I’ll walk you through the some of the sneak peaks of my work so far, as I have developed, debugged and tested the brand new software called Catalyst, which is a subset of another visualization software called Paraview
What does this do?, well it shows images (results), right off the bat, while you’re simulation is running (in my case a CFD simulation). That means with this bad boy, you could have your initial results just as quickly as possible without having to wait untill the last step of the run.
Just think of this software as another game changer for CFD research and workflows! Before Catalyst, this was not possible. Which is why I get to be the lucky one to test it with a full scale – use as many cores feasible – SuperComputer level- HPC test case. So that researchers, academics , or just another curious nerdy kid, could later easily post process their setups, making their life easier. Amazing, right!
As to what exactly I am running. That is an injector which is slightly different than the ones used in car, but it is a model after all! And also the damBreak Case of OpenFOAM which comes in default with the software. Both of them have two phases of fluid involved. So we basically at the end of our test, observe to see one fluid morph into other as time proceeds.
Easier said than done, there are the following elements involved, if I need to test this right.
1) The CAD aspect, (the geometry and mesh should be correct, as is required for any CFD setup)
2) The OpenFOAM aspect, this sets up the geometry or rather the mesh, to be specific. The solver, think of it as a fancy calculator that solves big equations, and output details i.e to say how often do you want your results to be stored and so on. All of which, of course should be set correctly if one wants to avoid the “Garbage in, Garbage out” results!
3) The HPC aspects would involve:
a) The supercomputer, which is the giant tool that actually does the operations and calculations mentioned above, in parrallel.
b)Its access, so that you don’t have to wait in line for your calculation.
c)Your partition, or the place where you run your job on a supercomputer, must have enough resources, like the processing power and so on.
d)The installed and supported packages (so that apples are calculated with respect to apples and oranges with respect to oranges).
The last aspect is especially important because that’s what I am here for. When you have an OpenFOAM job to be submitted to a cluster, the software that works with OpenFOAM , namely Paraview, has been widely tested and appreciated. What has not been tested is the Catalyst side of it. A fairly recent development of sorts, and hence the necessity for its research.
So it is obvious to run into an error every now and then , which may come from any of the above mentioned aspects. And because such a testing is being done for the first time, the number, instances, flavour and variety of errors encountered, increase even more. Mind well, these aspects, in actuality, have much complex dependencies with one another as well as among themselves than what is mentioned.
So the other day, I was couple of hours in, fixing this error I’ve got. “Attribute error: Name not defined”. To give you a context, if one wants to find the vorticities of a flow in this software, a filter would be required, which filters out only the required calculation needed to compute vorticies. And it didn’t seem to work, no matter what I tried.
So, my Sensei comes in and asks for the updates, I brief him about the developments uptill now, and he smiles and tells me that this error was not up for me to fix!!. Turns out, the feature didn’t exist in Catalyst’s current edition (as of 10.08.2018) , and had to be edited in the current edition with a patch file, then recompile it again for the cluster, and then install afresh for this feature to work.
This is when I see the HPC kung fu of my Sensei. He very calmly does the above one by one, explaining to me what’s going on along the way. Few steps extremely intricate because the supercomputer already has many versions of some packages/softwares already installed, which he must navigate through and adjust accordingly, mentally keeping a tab on where he needs to change a setting if required and simultaneously foreseeing any dependencies that might affect at a later stage of compilation.
And then after 2-2.5 hours of tweaking, (remember compiling the whole software again is a lengthy process), the thing is installed (well the workaround at least). Oh and did I mention, it worked in the first attempt.
Just check out the preliminary results!!
I may be stating the obvious here, but to just see him do his jam, was very thought provoking. I may have just scratched the surface of how much consideration goes into software development, especially when you do it on a HPC scale, just by watching him. I’ve never had a mentor before, let alone a one on one mentor. And I must admit, to be striving toward a goal with a mentor is a whole different ball game. The one which you have already won, because you have learned so many things.
Which is why I appreciate so much, that I got this wonderful opportunity to work at Cineca on this project. Those simple conversations that are the by product of his experience, and his wisdom are simply amazing. I at least for one, always look forward to strike up a conversation with my Sensei, just to see what new thing I might acquire in the process!
I perhaps wouldn’t have been able to acquire and learn so many things, if it were not for him. The least I could say is
Arigatto Sensei!!
To deal with QCD, we take a four dimensional space-time lattice and let it evolve step by step until it can be considered to be in thermal equilibrium and then take a snapshot every n-th or so step to get a set of lattices (called a Markov chain) to do measurements on.
So why does this require supercomputing?
Well, to start with, we are not satisfied with just using a small lattice. Nature happens to take place in a continuum and the introduction of a lattice also introduces errors, which get worse as you take smaller grids. So lets take something reasonable like a size of 8 for each space dimension and 24 for the time (which still is quite small). That way you already have 12288 points and on each of those lives a Dirac spinor of another 12 complex-numbered entries. Now, to let the lattice evolve, we need to, as always, calculate the inverse of a matrix, which contains the interactions between all points. So this is some kind of 147456×147456 monstrosity (called the Dirac matrix), which is thankfully sparse (we only consider nearest neighbor interactions). Oh, and all of this needs to be done multiple times per evolution step. So supercomputing it is.
But to deal with the above, we still need to introduce some trickery. For example, one could notice that you can distinguish between even and odd lattice sites like on some strange, four-dimensional chess board. Then you only interact by nearest neighbor with sites of the same color, which allows you to basically halve the Dirac matrix and deal with even and odd sites separately.
Also, you do not save the entire Dirac matrix, only the interactions between neighbors. These are described by SU(3) matrices, which are quite similar in handling to the 3D rotation matrices your gaming GPU uses to rotate objects in front of your screen. With the introduction of general purpose GPUs, this probably has become an exclusively historic reason, but it sure helped to get some speed up in the early days.
But we talked enough now, lets look at some speedups!
Speedup comparison between a smaller and a medium sized lattice.
As you can see, using parallel code is quite pointless for small lattices. It even gets worse at eight GPUs since you need to have a second dimension parallelized to support that many nodes (Yes, distributing 24 sites on 8 nodes would still work, but you need an even number of sites on each node.). But lo and behold, look at the speedup once we use a reasonable sized lattice. This is how we like it. Not a perfect speedup, of course, but sufficiently working and just waiting to be tuned.
So lets see how well this baby performs at the end of the summer and stay tuned for the next update!
Bratislava is a great city. The first thing that caught my attention when my airplane landed at Bratislava airport was the big green landscapes. The first impression of the city was really great and I couldn’t wait to explore the city. The first days I was introduced to the people in the Slovak Academy of Science in which I was working for my project. They were willing to help me and answer all my questions, not only related to the project, and I learned a lot from them during my stay. I also visited the center of the city which is beautiful with huge and impressive buildings, but the best part is that on one of the bridges that crosses the Danube river, there is a tall construction that people call it UFO, because its shape looks like an UFO, from which you can get a beautiful view and see the whole city. Fillip, who is the head of the HPC in Bratislava, organized a canoe trip along the Little Danube, which is a branch of the Danube, for us and all the people in the HPC center that wanted to participate. We started our journey round 10 am. Canoeing through the forest, in a beautiful river was a unique experience for me because I have never done something similar. I would describe the feeling as hiking with your hands. It is actually pretty similar only at the end of the trip, your hands are the ones that are burned out rather than your legs. The trip was six hours long and we covered more than 20 enjoyable kilometers. Overall, my experience in Bratislava was very good and I am happy that I got to meet such kind people and I am looking forward to visiting the city again to see the places that I missed during this summer.
Grey cast iron, white cast iron, ductile iron, malleable iron,……. Oh my gosh, so many types of cast iron! What is the difference? This was the question which always used to annoy me when I was graduating as a mechanical engineer. Well, the differences are in the chemical composition and the physical properties. All of the cast iron types tend to be brittle but there is one of them which is malleable. Yeah you have guessed it right, it is the malleable iron which has the ability to flex without breaking. Malleability is a common term used in material science and manufacturing industry and is defined as a material’s ability to deform under pressure. But can a HPC application be malleable too? This is the question I am tackling within my project in the PRACE Summer of HPC programme.
Polycrystalline structure of malleable iron at 100x magnification (Source: Wikipedia)
I had no idea that the term “malleability” is also being used in the HPC jargon until I started working on this project. Soon I came to know that like malleable materials, codes running on supercomputers can also be dynamically hammered into whatever size and shape we want. Many times, we have a lot of jobs submitted to the supercomputer, but because they can’t fit into the available compute nodes, they keep on waiting in the queue for long time. This reduces the throughput of the cluster and also the users have to wait longer to get the results. But if the already running jobs can be resized dynamically, then they can allow other incoming jobs to fit in the cluster and expand or shrink according to the available resources. More throughput and lower turn-around time can reduce the cost of the HPC system. Isn’t it amazing? To look into this problem, let’s have a glimpse on the basic categories of jobs according to their resize capability.
Five categories of jobs
Rigid: A rigid job can’t be resized at all after its submission. The number of processes can only be specified by the user before its submission. It will not execute with fewer processors and will not make use of any additional processors. Most of the job types are rigid ones.
Moldable: These are more flexible. The number of processes is set at the beginning of execution by the job scheduler (for example, Slurm), and the job initially configures itself to adapt to this number. After it begins execution, the job cannot be reconfigured. It has already conformed to the mold.
Evolving: An evolving job is one that can dynamically request its resource requirements during its runtime. The job scheduler then checks whether the requested resources are available or not and allocates or deallocates the nodes according to the request.
Malleable: Now comes the fourth type, the jobs which are malleable. These can adapt to changes in the number of processes during their execution. Note that, in this case, it is the job scheduler that takes the decision to resize the jobs in order to maximize the throughput of the cluster.
Adaptive: This kind of job is the most flexible one. The application can itself take decisions whether to expand or shrink, or it can be hammered by the job scheduler according to the status of the available resources and queued jobs.
Malleability seems conceptually very good, but whether this scheme works in the actual scenario? To answer this question, I needed a production code to test it. So my mentor decided to use LAMMPS, to test malleability. LAMMPS stands for Large-scale Atomic/Molecular Massively Parallel Simulator and is being maintained by Sandia National Laboratories. It is a classic molecular dynamics code which is used throughout the scientific community.
A dynamic reconfiguration system relies on two main components: (1) a parallel runtime and (2) a resource management system (RMS) capable of reallocating the resources assigned to a job. In my project I am using Slurm Workload Manager as RMS, which is open-source, portable and highly scalable. Although, malleability can be implemented by only using RMS talking to the MPI application and using MPI spawning functionality, but it requires considerable effort to manage the whole set of data transfers among processes in different communicators. So, it was better to use a library called DMR API [1] which combines MPI, OmpSs and Slurm which substantially enhanced my productivity in converting the LAMMPS code into malleable, thanks to my mentor for developing such a wonderful library.
The most time-consuming part of my project was to understand how LAMMPS work. It has tens of thousands of lines of code with a good online documentation. But once understood, it was not very difficult to implement malleability. In fact, it only needed few hundred lines of extra code using DMR API.
I can show you a basic rendering of malleable LAMMPS output. The color represents the processor on which the particles reside. The job was set to expand dynamically after every 50 iterations of verlet integration run. The force model was set to Lennard Jones and rendering was done using Ovito.
I am now in the end phase of my project, testing the code and getting some useful results. In the next blog post, I will show some performance charts of malleability with LAMMPS. I leave you here with a question and I want you think about your HPC application – Is your code malleable?
References:
[1] Sergio Iserte, Rafael Mayo, Enrique S. Quintana-Ortí, Vicenç Beltran , Antonio J. Peña, DMR API: Improving cluster productivity by turning applications into malleable, Elsevier 2018.
[2] D.G. Feitelson, L. Rudolph, Toward convergence in job schedulers for parallel supercomputers, in: Job Scheduling Strategies for Parallel Processing, 1162/1996, 1996, pp. 1–26.
What is a HPC system ? High performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly. Typical users are scientific researchers, engineers, data analysts.
In the race for Exascale supercomputer systems, there are significant difficulties that limit the efficiency of the system. Beyond all of this, Dennard’s scaling up energy and energy consumption through the end is beginning to show its impact on limiting the highest performance and cost-effectiveness of supercomputers.
A new methodology based on a range of hardware and software extensions for fine grained monitoring of power and aggregation for rapid analysis and visualization is presented by researchers. To measure and control power and performance, a turnkey system that uses the MQTT communication layer, NoSQL database, precision grain monitoring, and future AI technology is recommended. This methodology has been shown to be an integrated feature of D.A.V.I.D.E. supercomputer.
D.A.V.I.D.E. consists of 45 nodes that are connected efficiently. Infiniband EDR 100 Gbytes network connection, the highest performance of 990 TFolt total and estimated power consumption less than 2 Kwatt per node. Each node is a 2 Open Unit (OU) Open Computation Project (OCP) form factor and hosts two IBM POWER8 Processors with NVIDIA NVLink and four Tesla P100 data center GPUs and has an optimized internal node communication scheme for optimal performance. It stands at # 440 of the TOP500 list and # 18 at GREEN500 November 2017 list.
Energy Efficiency is one of the most common problems in the management of HPC centers. Obviously, this problem involves many technological issues such as web visualization, interaction with HPC systems and timers, large data analysis, virtual machine manipulations, and authentication protocols. Obviously, data analysis and web visualization can help HPC system administrators. To optimize the energy consumption and performance of the machines and to avoid unexpected malfunctions and abnormalities of the machines.
Some example grafhs on the Grafana
Grafana is one of the tools used for data analysis and web visualization and Grafana is open platform for beautiful analytics and monitoring and general purpose dashboard and graph composer. Grafana supports many different storage backends for a time series data (Data Source). Each Data Source has a specific Query Editor that is customized for the features and capabilities that a particular Data Source exposes. Although the use of Grafana seems a bit complicated, you will see that they will simply describe everything you’ll need when you follow the proper documentation. The link for youis : http://docs.grafana.org/
How I can use Grafana ? D.A.V.I.D.E has two databases ; Cassandradb and Kairosdb. Kairosdb is set up for Grafana as a datasource. Basically I can connect to the Grafana server and create and manage my dashboards from my script. Firstly I managed my dashboards as running dynamically. Also I need to managed them as statically. Because as statically without data source. I can visualize results of more data that doing that it is possible to use snapshot for grafana.
visualization Fan Power and System Power for Node davide43 as using line graph
visualization total power of all metrics separately for nodes which has a job with using pie graph
This Friday the JSC’s summer students arrived, which meant that I was able to tag along for their introductory tour of the Jülich Föreschungszentrum. We cycled around the campus with our guide, Irina Tihaa, a PhD student studying “neuroelectronics”. I wanted to hear a bit more about her field, so she kindly agreed to sit down with me after the tour for a small interview! So what is her group actually studying?
PhD student Irina Tihaa and me after our tour.
“In our institute we are mainly doing electronics for the brain. This is mostly fundamental research; we look at which kinds of materials and electronics we can use, how can we listen to the brain and understand the signals it is sending, and how can we communicate with the brain, via these electronics. First of all, we need to work on the different electronics, but we also need to understand the language of the brain. I’m working mostly on the latter.”
And what kind “medium” is this communication done across? What kind of data is involved?
“A typical example would be that we take a chip, with lots of electrical channels. We cultivate our cells on top. Neuronal cells communicate via electrical signals, so then we can measure the voltage difference across our chip. I also have a lot of optical data. You can introduce fluorescent sensors into the sample, which are activated by light and also emit light. From this you get a lot of video data, because you’re measuring the difference of the intensity.”
Quite an inter-disciplinary field! I asked Irina about the kinds of scientists she works with during a typical week
“In my institute we are mostly physics actually, because we’re a bio-physics institute located in a physics department. We have some biologists, we have some nanotechnologists, we have quite a lot of chemists… Bio-medical engineers… Some medics also, and electrical engineers of course. And then software developers! Almost all of our electronics is institute-made, which means we need software developers and a lot of electrical engineers to construct it. When it comes to analysing the data, because neuronal communication is quite complex, we need complex algorithms. I don’t have the background for this, so I also talk to people from the computational neuroscience institute, who have the expertise to create these algorithms for analysing neuronal signals.”
The research center’s fire brigade at our guest house after someone’s cooking was slightly distracted…
And these kinds of interactions are not at all uncommon at the FZJ. During our tour, every stop reminded me why research facilities with different institutes and people from different fields are so important. Nuclear research means that there is a fire station and highly trained guard dogs on campus. Those dogs also help us forgetful scientists by tracking lost keys when needed! Medical engineers are helping the biology department to do MRI scans of live plants. And as it turns out, Irina’s institute, like mine, also has a graphene connection! Some of the chips she and her team have been building have used graphene. As a carbon-based material, she told me it is chemically quite suitable for our carbon-based bodies. But how does one start studying a field like “neuroelectric interfaces”? It’s hardly something that you come across during your high school or bachelor studies. I was very curious to hear about Irina’s path to Jülich.
“Originally I’m from Kazakhstan, but I’ve been living here in Germany for long time, over 20 years. I was always impressed by the natural sciences; physics, chemistry, biology, mathematics… But the brain kind of impressed me the most, because at some point I realised that even small changes can affect behaviour so much! Like if one part of the brain is growing faster than another during puberty for instance, this can lead to aggressive behaviour. Then it’s changing back, because the other parts are taking back control. This really fascinated me, so I decided to study biology, with a focus on neurobiology. I still had a lot of interest in the other sciences, like modulation, simulations, and physics, so when I met my professor and saw this institute, I realised it was perfect, because there is everything: chemistry, biology, nanotechnology (with physics inside), and engineers working on the electronics. I can really introduce my point of view there.”
And of course, data analysis and and simulations require computers. But not quite on the HPC level yet!
“Up to now we are working more on the smaller level. The simulations that we’re running are a maximum of few days on a very-good “normal” computer. But this is just the beginning. Right now, we’ve been mimicking brain parts on a chip, so in 2D. But we want to of course go 3D, since the brain is three-dimensional! Maybe you have heard of “organoids”, small structures that are similar to the organs we have? And they’re in three dimensions. As you get data from three dimensions, simulations of course get a lot bigger. So at some point also the super-computer will probably be involved.”
I ended by asking Irina if she had any “words of wisdom” for young people considering the sciences.
“I would encourage it, because first of all it is the future! Earlier, we had experiments and we had theory. These were the two pillars of research. But now adays we have this third new pillar, simulation science, which is getting more and more important, and I think it will play a big role in the future!”
Welcome back everyone! Finally, we are ready to talk about the project. Recalling what I said in my previous post, when doing lattice QCD simulations, I said that we need to “let evolve” the quarks. More precisely, we need to compute the propagators of the quarks from one point of the grid to another. This is done by solving the following linear system:
Let’s explain what each element means: D is the Dirac matrix, which contains the information about the interaction between the quarks and the gluons; b is the source, which tells us where we put the quark at the beginning; and x is the propagator, what we want to compute.
Naively, one would have the temptation to solve this problem by just doing x=D-1 b. But if I tell you that the Dirac matrix is usually a 108 x 108 matrix with a lot of elements equal to zero (sparse matrix), computing the inverse is not an easy step (almost impossible).
There are multiple methods that can handle this type of matrices (BiCGStab, GMRES, …), but the one that we are interested in is the multigrid method.
With this method, as its name says, we create multiple grids: the original one (level 0), and then coarser copies (level 1, 2…). If we want to move between different levels, we use two operators: the prolongator (from coarse to fine) and the restrictor (from fine to coarse). For example, if we apply the restriction operator on the system, we get a smaller grid, which means that the matrix D is going to be smaller, so it will be easier to solve:
Schematic representation of the multigrid method.
After solving the reduced system, we have to project it back to its original size using the prolongator. The construction of these projectors is not trivial, so I won’t go into details, but it requires a lot of time to construct them (comparable to the time it takes to solve). So, to sum up, the steps are the following:
Construct the projectors.
Start with an initial guess of the solution, and do a few steps of the iterative solver on the fine grid (called pre-smoothing).
Project this solution to a coarser grid, use it as an initial guess, and solve the coarser system (should be easy).
Project it back to the original grid, and do some steps of the iterative solver (called post-smoothing).
To convince you that this algorithm really increases the speed of the solver, let me show you a plot comparing different methods:
Mass scaling of diferent methods, comparing the time it takes to compute a propagator [1].
Looking at the plot, as we decrease the mass of the quarks (x-axis), the time it takes to solve increases (the y-axis is logarithmic!), but compared to the other methods, the multigrid (the one labelled as DD-αAMG) is always the fastest one, in particular at lower values of the quark masses (the physical values for the up quark is labelled as mu, and for the down quark is md).
So, if the multigrid method is already fast, what can I do to make it even faster? The solution is pretty simple, and it’s the main purpose of my project: instead of solving the linear system for one b at a time, solve it for multiple b‘s at the same time. And how is it done? You’ll have to wait for the next blog post to learn about vectorization, which consists, in a nutshell, in making the CPU apply the same operation simultaneously to more than one element of an array. Cool, right?
Much has been going on here in Athens, as I have now delved into my project. I’m studying the effects of the E545K mutation of the PI3Kα (phosphatidylinositol 3-kinase α) protein using molecular dynamics with enhanced sampling techniques, in particular metadynamics.
The protein PI3Kα is an enzyme that, upon binding to membrane proteins, catalyzes the phosphorylation of phosphatidylinositol bisphosphate (PIP2). The product of this reaction (PIP3) then participates in a cascade of intracellular communications that has implications in cell growth, metabolism, survival, etc. Because of the key role of PI3Ka in promoting all these processes, mutations in this protein are actually the most common protein mutations that lead to the development of cancers. The E545K mutation in particular causes the disruption of interactions between two domains of PI3Kα, facilitating the detachment of theses domains from one another and leading to overactivation.
Detachment of helical and sNH2 subunits of the PI3Kα protein with the E545K mutation. The wild-type protein would have aminoacid 545 as Glu instead of Lys.
Knowing exactly the effects that the E545K mutation provokes at the molecular level is fundamental to develop drugs that can target specific structures in the protein and prevent the overactivation that leads to cancer. Molecular dynamics simulations can help towards accomplishing this goal, as they provide a detailed atomic description of the dynamic evolution of the mutated protein.
However, standard MD simulations, even on modern hardware and large clusters, can only achieve timescales of up to microseconds. This is not enough to study large protein domain movements such as the one we’re interested in. This is where metadynamics comes in. It is a way to more efficiently explore the energy surface of our system, in order to arrive at relevant stable (minimum energy) configurations that describe our process, along predetermined geometric/functional variables (so called collective variables).
Sounds complicated, but let me describe the basic principle with an analogy. Imagine you are Noah, and you are stuck in a valley with the Ark, having already saved all the animals in there. You can’t push the Ark out of the valley by yourself of course, the mountains are impassable. But then God lends a hand. He starts filling the valley with water. Suddenly, the Ark starts floating and keeps going up. Finally, when that valley is completely flooded, you are able to surpass the mountains, into the next valley. You explore this new valley, saving all the creatures in there. Then God repeats the process, filling in that valley and allowing you to escape to the next one. After a while everything will be flooded, but there will be nowhere left to explore either. I refer you to this excellent Youtube video where you can observe what I just described in a simple way.
In this analogy, the valleys are the energy minima of our system, and the Ark is our MD engine, which allows us to explore the “energy landscape”. Metadynamics is the flooding process: in technical terms Gaussian functions are constantly added to our original energy surface, preventing us from being stuck in the minima surrounded by very large energy barriers (“mountains”). It’s a very powerful method, and has various parallel implementations as well. I’ve been analyzing results from metadynamics simulations already made, and I’m setting up a simulation myself to run on ARIS.
I leave you with another picture from my various trips across Athens, this time an incredible view of the city from the top of Mount Lycabettus. See you next time!
The official title of my project is “Parallel Computing Demonstrations on Wee ARCHIE”.
Wee Archie
One would therefore probably expect me to be spending most of my time playing around withWee Archie. For those who don’t know yet, Wee Archie is a small portable suitcase-sized supercomputer that is built out of 18 Raspberry Pi. EPCC uses it to illustrate how supercomputers work to a general audience. It has become so popular that it is quite hard to get to working with it now. Therefore, to be able to develop new applications for Wee Archie, my initial task was actually to build a smaller version of Wee Archie on my own.
What is needed
On the very first day at the EPCC, my co-mentor Oliver gave me a box full of stuff and told me that everything I need to build a Raspberry Pi cluster is in there. Also, he said that a ten year old kid can do it so it should not be a problem for me. This is what was in the box:
All you need to build a small cluster.
set of 5 Raspberry Pi B+ (quad core) with power cables
5 lego-compatible Raspberry Pi cases
ethernet switch
6 ethernet cables
5 micro sd cards with adapters
HDMI cables (useless if you dont have a monitor to connect to)
Building the cluster
To be honest, Oliver was right. Physically building up the „cluster“ is really easy, provided you read the instructions carefully step by step. Not doing so might have some unpleasant consequences. My beloved friend James summed up my experience nicely into a meme:
A take-home message? Raspberry Pi has no build in memory and trying to “ssh” to it without plugging in an SD card with an operating system will highly likely not be successful, no matter how hard you try.
Networking
The network setup however, was a bit more challenging of a task, especially when using Ubuntu and the very first instruction part for Linux only said: TODO. Luckily, Raspberry Pi seem to be so popular, that Stack Overflow and other geek forums have answers to literally any problem one can have with them. Otherwise, the instructions were quite clear and now I believe that a smart ten years old kid could do it on their own within a couple of hours. For pure mathematicians like me, it might take a bit longer, but two days later I finally had a working “supercomputer” connected to my laptop.
Running programs
Once all the setup is done, running programs on a Raspberry Pi based cluster is no different from doing it on a standard cluster. Except, maybe, the fact that running a parallel program on all 20 cores of this small “supercomputer” is still slower than running the same program sequentially on a standard laptop. The performance is however not what we are looking for with such a cluster. It’s the fun one can have playing around with this cute device, such as running the parallel “windtunnel simulation” that EPCC developed and now uses for outreach purposes:
Hi from Bologna ! 🙂 It has been 5 weeks since I came to Bologna.
It is a small and enjoyable city. I stay in the center of the city. CINECA, Italy’s HPC center is a bit far off the center. I use the bus to go to CINECA every day during the week. Bologna is a different city with its houses and ways. The houses have an old structure and are in general colorful. It is famous for delicious foods such as pasta, pizza and ice-cream – all of which I have tried 🙂 I think Special Bologna pasta is the name of their spaghetti bolognese – which for me is the best. Bologna is a musical city like other European cities and I always see someone who plays an instrument in the centre when I return. It makes me smile. People who live in Bologna are very helpful and cheerful. The city has no more tourists so it makes it calm.
Spaghetti Bolognese !
CINECA is a non-profit consortium and made up 70 Italian universities. It is established in Casalecchio di Reno, Bologna and hosts the most powerful supercomputer in Italy as stated in TOP500 list of the most powerful super computers in the world. One of these is Marconi composed of Intel Xeon Phi’s and is ranked at the 14th position of the list as of June 2017, with about 6 PFLOPS of power.
When I saw the supercomputers for the first time they were and are very special machines for me. They are very big with many nodes. They also need a lot of power. I often wonder how they can be so powerful, and how they can stay at a low operating temperature? Also they need to more power. How can they run more quickly ? How can they be cold? Actually these questions are asked with the question of “What is the HPC? ”
I was learning everything step by step. I have previously worked with super computers without seeing them, I just imagined them. After seeing them, absolutely yes, they are so incredible.
I met many cheerful people who work at CINECA. My site coordinator name is Massimiliano Guarrasi. He introduced me to everyone. He is also very helpful 🙂 My project mentor works in Bologna University so he is not be at CINECA every day. Up to now, we have defined my project progress. When I have a problem with the project we talk with my supervisor by email or skype. I work at CINECA but my project mentor is always with me and replies to my requests.
My project goal is to implement Web visualization of energy load in HPC systems. I use Grafana to visualize, and python to manage and implement the system. How can I do this ? What is the Grafana? What does the data look like when visualized? Which super computer do I use? I will explain all this in another post.
It has been a bit more than a month since my project began. Initially, the learning curve was rather steep as the project I am working on, ABySS, a bioinformatics software for assembling DNA sequences is complex and consists of many intertwined parts and processing stages. Still, as I got to explore its workings, at this point, there’s measureable progress.
Assembling genome is a long process and involves manipulating large amounts of DNA data, especially if the genome is long, as is the case with humans. As such, performance is of great importance, as you might be assembling the DNA multiple times (or of various species) and you only have so much time to do that. These jobs can last days, so even a seemingly unimportant performance boost can save you a day or two.
My first step was to determine the bottlenecks, parts of the code whose execution time directly impacts the execution time of the whole program. To quicken the process, I have contacted the original program creators at Canada’s Michael Smith Genome Sciences Centre. They were happy to quickly respond and point out possible improvements, as well as what I can investigate more into.
My first task was to improve an in-house DNA file reader. DNA is stored in FASTA or FASTQ file format, which is a text format holding nucleobase strings (A, C, T, G), along with optional DNA read quality, in FASTQ files. The main problem with the existing reader was that buffering was not done right and the reader was making more reads than necessary, and file reads are slow. What you ideally want to do is read files in big chunks and then process that in-memory. There is a library that does this for FASTA/FASTQ files called kseq, so my goal here was to integrate it with the existing reader. The integration was done in such a way that the original implementation persists, but the program will prefer to use the new library if possible (based on file content).
A FASTA file content.
After benchmarking the new file reader, I have found that it was twice as fast as the old one, so it was a success. However, that was only performance of reading single files without any processing whatsoever. Upon testing the whole program on real data, I have found that the performance improvement isn’t as much (up to 10% difference in wallclock time), but that was to be expected, as reading files is only a small part of the whole pipeline.
After this, I have moved on to see if I can make more significant speedups. Turns out, after profiling the execution, there is a particular data structure that slows down the process a lot. So called Bloom Filters, which are a probabilistic data structures for testing whether data exists in a set have to be constructed from DNA data. These filters are built directly from FASTA/FASTQ files and basically tell whether a particular DNA sequence exists in a set. As these filters are large, their build process is multithreaded in order to speed it up, but to make sure this process works correctly, each thread has to exclusively lock the part of the data structure it’s building, in order to prevent other threads from meddling. This causes contention between threads and a massive slow down, as threads have to wait one another to access different parts.
The solution? Lockless filters. Get rid of locks altogether (almost). Instead of having threads fight one another for access rights, what I have done is give each thread its own copy of the bloom filter to build and then merge them in the end. The speedup was very noticeable, and the process lasts half of what it did before. There is a catch, however, I have realized only later. The size of these filters varies from 500MB up to 40GB, and is a configurable parameter. If you are dealing with filters of size 40GB on a 128GB RAM node (which is the case at Salomon), you can not have more than 3 copies of the filter at a time. But it is possible to alleviate this problem, as you can specify how many threads should have their private copies of the lockless filter and the rest of the threads use a single filter with locks. This way, the more memory you have, the better the speed up, with a copy of the filter for every thread in ideal case with sufficient memory.
While I was doing all this work, I received an email from my mentor – I was to give a presentation about my progress to him and maybe a few other people. Sure, why not, this would be a good opportunity to sum it up for me, as well. Few days later, when I was about to present, I realized it was more than a few people! Most of the seats were filled and the IT4I Supercomputing Services director, Branislav Jansík himself came. The presentation lasted for about an hour and most of the people present had several questions, which was pretty cool, as they managed to understand the presentation, even though not all of them were in technical positions.
Giving a presentation on my work so far.
The presentation was a very useful experience though, as it was, in a way, practice for the final report. It was far from what a final report should look like in terms of quality, but thankfully, I have received a lot of feedback from my mentor that will help me get the final presentation right. So big thanks to Martin Mokrejš, who also took the photo above!
It’s been already 1 month here in Castello de la Plana, Spain for the PRACE Summer of HPC programme and one thing I have realized for sure is that the time passes really fast when you are having fun. But, what do I mean by fun? I will start explaining it by saying how I spend my weekends here. Trips to the towns next to Castello, for example Valencia, walk around the city and visiting the beach are some of the things I do after I finish work. It’s really relaxing.
Palau de les Arts Reina Sofia, Valencia.
But, SummerofHPC is not only about trips on weekends of course, but also working on interesting projects during the week. So, what is my project about ? It’s called ‘Dynamic management of resources simulator’ and I know that the title doesn’t say too much, that’s why for the rest of this blog post I will try to explain it.
Let’s decode the title. First, what does ‘management of resources’ mean? In HPC facilities, many applications can run concurrently and compete for the same resources. Because of this, ‘something’ has to boss around these applications by giving the resources to some applications while others are waiting. Here is where the resource manager (the ‘something’) kicks in. A program that has only one specific job, to distribute the resources to the applications based on a policy.
So far so good, now we can understand the part of the title which says ‘management of resources’. What about ‘Dynamic’? Something that is not obvious of course from the project’s title is that in this project we will deal with malleable applications, or let’s say better malleability. Malleability is nothing more than a characteristic that an application can have to change dynamically the number of its resources while it is running (on the fly). From this, we can see that an application can run with less resources by allowing more applications to run concurrently, increasing by this the global throughput. It makes sense now to have a resource manager which exploits malleability.
We have explained 80% of the title, the other 20% is the word ‘simulator’. Imagine that you want to run some applications on the HPC facility, what do you have to do ? First you have to login to the cluster, wait for the nodes to be freed (to allocate the nodes), then upload all the applications you want to run on the cluster and run them. Finally, pray to God that no-one interfered with your results. Too much effort. What about having the same behavior even on single-core machine (2018, hard to find a single-core…)? This is what the simulator is for. Let’s say that after this blog post, you get very interested in malleability and after some time you come up with an algorithm that exploits malleability. Will you try test your algorithm on the actual system? Well, of course you will do it at some point later, but first test it on the simulator without any cost.
To summarise, the purpose of this project is the introduction to the field of resource managers the concept of malleability by developing a simulator that executes a workload. By tuning different reconfiguration policies, we can determine the best configuration for fulfilling a given target without running the workload on an actual system. We are using Python and SimPy (simulation framework). We have implemented the simulator but right now we are in the phase of testing it. I am sorry that I don’t have any picture related to my project but I am writing code that I run it on my laptop, not even on an HPC system. On the other hand, since I am close to Valencia, I can share some paella…
Me, Sukhminder, our supervisor, paella and the SpongeBob SquarePants tablecloth…
SOLID2000 is a simulation program for systems that have symmetry in three dimensions, such as crystals. The code is written mostly in FORTRAN77 and FORTRAN95.
My project is to parallelize the calculation of the band structure using MPI, which is basically the energy of the electrons inside the solid. Such solutions are impossible to find analytically for the real word systems, that’s why we have to use numerical methods to sample part of our solution.
The finest (dense) our sampling is, the better our plot will be, and the more calculation time is required. For every sample we make, we have to solve an NxN matrix, i.e to find its eigenvalues. Those eigenvalues are the solution we desire, i.e the energy of the electrons, and the matrix is called the Hamiltonian matrix which describes the whole energy of our system, potential energy + the kinetic energy. Normally the size of the matrix is infinite so it’s up to “cut” the matrix.
The time required to calculate the eigenvalues scales as N3 and therefore we want our matrix to be as small as possible but not too small because the accuracy of our results will be very low. Our strategy to parallelize the band structure is to assign every sampling to a procedure up until all the sample points are done.
This will be done by a message that the master will send to all the slaves to signal that the band structure calculation is about to start and each procedure will calculate its sample points. The work will be distributed equally throughout the procedures in order to have the maximum efficiency. At the end of the calculation, the slaves must send their samples to the master so the master can write the band structure in a file. The next picture shows an example of a band structure of a carbon nanotube which was generated using this code.
Band structure of a carbon nanotube where y-axis is the Energy and x-axis is the k-point (or the sample point)
James and I went on a tour to visit the machines we’re working with, Salomon and Anselm supercomputers. We were guided by the IT4I institute director Branislav Jansík, and it was soon apparent he is acquainted with just about every detail of the inner workings of the centre and its supercomputers.
Salomon Supercomputer
The tour started in a meeting room, behind thick glass through which we could see the computers running. And you can believe that glass was thick, as we could only faintly hear the noise coming from the other side. As we later learned, supercomputers are so loud you will have trouble holding a conversation next to it without it looking like a game of telephone. If that’s not enough, the noise produced by all the components that are in a centre like this can be very much heard outside the actual building.
Soon after, we were led into the spacious room where the supercomputers stood. Before entering, however, we were warned that the oxygen level in the room was around 15%, compared to the normal ~21%, which can cause dizziness or even passing out. There’s a good reason for this, however – oxygen at this level prevents fire from spreading in case of an emergency. As a matter of fact, if you were to a light a candle, it would just be smothered soon after by the lack of oxygen. Fortunately, we haven’t experienced any problems and proceeded to view the components of a supercomputer.
Branislav Jansík showing us a rack with storage.
The components of a supercomputer are neatly divided into storage, networking and processing units. Most of the racks you see the in the picture are processing units, which do the heavy lifting of high performance computing. A few others are dedicated to connecting these nodes in such a way that they can communicate efficiently, which usually involves multidimensional geometry in order to make sure no node is too far away from another topologically. And of course, there are storage nodes which hold all the data. There is actually so much data, more than a petabyte, in this case, it’s impractical to back it all up, so usually only one small portion of it is.
The computer room is far from being the only large part of a supercomputing centre, as the next stop was the infrastructure required to sustain this beast.
The heating system of IT4I.
After leaving the computer room, we headed towards the roof and stopped at a section which deals with drawing away all the generated heat. And there is heat being generated alright! There is so much of it, they use it to keep the building warm, even during the winter, with no additional source! Moreover, only 10% of the generated heat is sufficient for this purpose.
The electricity bill is no joke, however, and if you’re interested, it’s in the order of tens of thousands of euros a month. Moving on, we went to the roof.
Roof of IT4I.
Arriving atop the building, we were mostly surrounded by fences around the cooling towers and chillers, and we happened to find a plant thriving without a care in the world among all the metal around it.
Anyway, this is where the tour ended, wherefrom Branislav showed us a secret shortcut door that led straight to our office, in which we could, again, tap into Salomon’s power solely through a console interface. It’s incredible to see what actually goes into the machinery that we have been using for weeks now through a very convenient and straightforward terminal on our laptops.
Life is fantastic in Ljubljana. Working in the surroundings of wonderful mountains makes me feel like a hero of a fairy tale. We’ve changed the month in the calendar but I still can’t believe how fast time passes.
During the PRACE Summer of High Performance Computing programme I am working on big data using RHadoop. The data set is a set of 80 million ionizing radiation level measurements in the world. To explain everything clearly, I have to do a little introduction.
Beautiful evening near Ljubljana castle.
Theory
First of all we have to see the difference between types of radiation. We can divide them into two basic types:
– Non ionizing radiation – electromagnetic radiation that does not carry enough energy per quantum to ionize atoms or molecules.
– Ionizing radiation – carries enough energy to liberate electrons from atoms or molecules, thereby ionizing them.
Ionizing radioactivity can be both our ally and enemy – it all depends on the size of the dose. It comes from the cosmos, industry, mines and even from another person. It is also very useful tool in medicine. There are many treatments based on ionizing radiation such as x-rays or computed tomography.
In most types of examinations connected with it, the value is so small that it requires hundreds or thousands of repetitions to threaten our health. The most important thing for us to know, is how much radiation can we receive.
There are many units describing radiation level (Rad,Gray,Curie,Sievert) – I will use Sieverts ( 1Sv = 1J/1Kg).
Exposure to 100 mSv a year is the lowest level at which any increase, is clearly evident a cancer risk. For example, a chest x-ray provides only 20 µSv, while a mammogram procedure provides 3.0 mSv.
People have contact with radiation every day in many fields. Yearly dose for natural potassium in the body is 390 µSv and even eating one banana is connected with a dose of 0.1 µSv.
Dependence of wavelength on energy
HPC adventure
We’ve installed RHadoop software on the HPC infrastructure which allows to manage really huge data. In my case, the text file has 10 GB of data, but RHadoop is ready to manage much bigger files. In the data set there are radiation measurements provided by a volunteer science project – ‘SAFECAST’.
What I intend to do with the data set is to use clustering algorithms to detect most frequent regions with high level of radiations. Different clustering algorithms process the data from many sides – for example we can check the highest (lowest) radiations levels only in January.
We use RHadoop which allows for concurrent processing. Primarily, a master-node divides the data into smaller, independent chunks. Then each worker node works on one of them. It uses a Map function which performs filtering and sorting. The next step is Shuffle – worker nodes redistribute data based on the output keys. The output goes to the Reduce function which processes each group of output data, per key, in parallel.
So What’s next ?
I am planning to use machine learning ideas to support clustering algorithms and to make an independent way of research. To be honest – I can’t wait to compare the results !
Some of the first results of Hadoop processing. European places and CPM (Counts Per Minute – another radiation unit) values in logarithmic scale.
It’s already august. This summer is going by really quickly. We have already worked a lot, but there are still many things to do till the end of of this month!
About myself and about Slovenia, almost nothing has changed since the last post. Ljubljana is still wonderful and Slovenia is still fantastic. The more places I visit, the more I like it.
Slovenia is full of beautiful caves!
The weather is not so hot and everything is really green. Also, the people in the faculty here are very friendly, and it’s always nice to work with them.
About my project, we have already created not only a Plugin for Paraview, but also a tool that makes very easy to create those plugins! We have also managed to do everything in Python, which has some advantages in relation to the “traditional” C++ way:
Comparative scheme between the C++ way” and the “Python Way” to create a plugin for ParaView.
With this Plugin, we can easily take data from ITER’s so called IMAS database and play with it. For example, we can actually extract the information about some quantities such as temperature or magnetic strength and plot them in the grid, or study the evolution of the dataset with time.
A simple animation that shows the evolution of Magnetic field in the ITER’s vacuum vessel during a shot, loaded with our Plugin
We will continue upgrading the plugin and adding more features to it in the following days. Making things easier is not an easy task!
If you want more information, I’ve been recently interviewed about this project in a funny TV show that actually does not exist. The following video contains the full interview. Enjoy!
I’m well underway into the PRACE Summer of HPC program and my project is going well. I am working with Dr. Nick Johnson (link) on the NEXTGENIO project (link). Here is an introduction to my project.
HPC systems are expensive machines, and running jobs on them uses up precious resources and funding. The study of how jobs get allocated to different nodes could maximise the usage of these resources. The goal of my project is to explore how we can utilise our resources in an optimal way by testing different job scheduling algorithms.
When a person wants to do some work on a HPC system they are allocated some nodes on which they can run their code for whatever their purposes may be, a person may be given 8 nodes typically. To give some scale about the capacity of the HPC system called `Archer’ at EPCC, it has 4920 nodes on which users can run simulations, calculations or whatever their desire may be. Each node has 24 processors, so that’s over 100,000 processors. There are lots of jobs running on Archer all the time and also lots of requests for nodes to be allocated. So it is of vital importance to study how these nodes are allocated to different users so that we minimise the number of inactive nodes during a time slot and maximise the number of jobs performed.
Diagram of jobs running on a HPC system. Some nodes are left vacant while other perform multiple jobs. Note how the diagram highlights the time spent Reading and Writing. This can be reduced by using NNVRAM if the same nodes are used again. This is one of the topics of investigation during this project.
Different users may request different numbers of nodes (depending on the size of their project) and different lengths of time for how long they will need them. Suppose Alice requests 100 nodes for 8 hours and then Bob requests 10 nodes for 2 hours. If we were to operate on a first-come first serve basis (a very näive approach), then both Alice and Bob must wait for the 100 nodes to become free before Alice can run her jobs and then Bob can run his. However, suppose 20 nodes were free the whole time and Alice must wait more than 2 hours before 100 nodes are free to begin her job. A more efficient algorithm would allow Bob to skip Alice in the queue and let him perform his job on these nodes that she can’t use. He will not be delaying the time at which she begins her jobs as she will still be waiting for the other 80 of the 100 nodes to become free when Bob is finished. This is a very simple two job example to demonstrate how we can have 2 different job scheduling algorithms. Clearly the second algorithm will reduce the overall time to perform the two jobs in this situation.
This is a simple case, but it serves as an example to see the importance in job scheduling on a HPC system, otherwise queues for jobs could be days long. If you’d like to learn more, here is a paper I’ve been reading about it (link).
Free time
I’m a very keen golfer, having represented my university for 4 years, and Captaining the team. Being in Scotland, the glorified home of golf, it is pretty special, especially at this time of year when the Scottish and British Opens are on back to back weeks and in close proximity to Edinburgh. So it was on my agenda to make sure I went to one of golf’s 4 majors, the Open at Carnoustie golf club. My dad flew in and we made a trip of it. Seeing Tiger Woods return to some of his old form, McIlroy hit Driver-Sand Wedge to a par 5, Jason Day hit a wedge stone dead to a foot and the eventual winner Francseco Molinari were the standout moments. A great day, to get the head cleared after a tough week of coding.
Me at the British Open
The Iconic 18th hole, where tournaments are won and lost; just ask Molinari and Van de Velde.
Hi everyone! Welcome back! It’s has been three weeks since we arrived in Nicosia. As an introduction to Cyprus culture, our site coordinator (shout-out to Stelios!) took us out for a meze the night we landed. I’ve never seen so many dishes for just 3 people! On top of that, everything tasted amazing, specially the halloumi, a mixture of goat and sheep cheese which is usually served grilled.
On the next day, we went around the old part of Nicosia (inside the Venetian walls), and it was quite difficult to find our way, since it looked like a labyrinth. But now we know it like the palm of our hand. Also, who would have known that the usage of Greek letters in physics problems would turn out to be really useful (even though we don’t understand Greek).
Fig. 1: Walking route marker found around the old part of Nicosia. Notice the shape of the Venetian Walls.
Back to work. As I mentioned in the introductory post, here I’m going to explain what my project is about and what I’ll have to do. But on second thought I’m going to give you guys a little bit of background.
First of all, we need to know a bit of physics (don’t be scared). In school, we all learned that there are 4 fundamental forces that govern how particles interact with each other: gravity, electromagnetism, strong and weak. The one that I’m interested in is the strong interaction, and the theory behind it is called Quantum Chromodynamics (or QCD).
With only two ingredients, quarks and gluons, QCD tries to explain a wide range of phenomena: from the collisions taking place at LHC (very high energy, around the TeV scale) to how the protons and nucleons are squeezed together to form the nucleus of the atoms (very small energy, around the MeV scale). The problem is that, while for the first case we can manage to understand the process using pen and paper, for the second one it is very difficult, since all our mathematical tools break down at this energy regime.
This obstacle has not stopped physicists from making predictions at this scale, even though they can no longer be from first principles and depend on experimental data. If we want to use QCD directly and compute (for example) how the mass of the proton emerges from the interaction between the quarks and the gluons, the only way to do it is by using computers. Or, to be more precise, supercomputers.
To do that, we need to program QCD on a computer. This has been known for a couple of decades: lattice QCD (LQCD). It has the word lattice in it because what we do is discretize the space-time in a 4-dimensional grid, placing the quarks on the nodes of this lattice and the gluons on the links. Then, if we want to simulate the proton, we put 3 quarks (2 up and 1 down quarks) on the lattice, let them evolve, and see if at the end we get a proton or not. But why is it so computationally demanding? We have to evaluate a path integral, or in other words, we have to compute every possible path that a quark can take from one point of the lattice to each and every other. And then repeat this computation several times to get enough statistics.
As an example, in Fig. 2, the masses for different baryons (particles made up of 3 quarks, like the proton) have been calculated by different collaborations, and then compared with experimental data:
Fig. 2: The colored points are the masses of the baryons computed using LQCD, and the black line (with gray band) is the experimental value [1].
As you can see, LQCD gives a result which is in perfect agreement with the experimental data. But this type of simulations couldn’t be done in previous years. In the beginning, LQCD had to use a higher value for the masses of the quarks than the physical ones, since it is easier and faster to perform the simulations. To get to the physical point, new algorithms had to be developed.
One of these is the multigrid solver. To know what it is you’ll have to wait for the next post, where I’ll also introduce my project. See you then!
So I’m over here in Edinburgh working on a project in the field of “High Performance Computing” (HPC), but what does that mean? I didn’t know what that was when I was first introduced to it a year ago, so I understand if it’s raising a few eyebrows. Maybe what’s running through your head is the scene from House of Cards where Lucas Goodwin tries to hack into AT&T’s server farm and is arrested by the FBI. That’s actually what a high performance computer looks like, below is a picture of the supercomputer ARCHER at the University of Edinburgh! But first, what is HPC?
ARCHER, EPCC’s supercomputer.
What is HPC?
First off, high performance computing is the aggregation of computers with the goal to achieve a higher performance than one would get from a single machine. If we view each computer (imagine your own laptop) as a node, a HPC system is a cluster of nodes each with its own operating system. So as I mentioned in the first blog, if we have a job to do, we can optimise the performance by sending different parts of the job to a different node. Alternatively, this is known as parallel computing.
How do you work on one of these computers?
It’s like working normally in the terminal. You log onto these computers remotely through ssh. For example, ARCHER (EPCC’s supercomputer) is 10km south of Edinburgh, at the Advanced Computer Facility, but you can work on it anywhere with an internet connection. So when working on the HPC system, you can write your programs in any coding language you want: C++, Fortran and Python are the popular options. However, optimising performance is always the end goal, so if possible, people tend to avoid Python because it is “cripplingly slow” from what I learnt in the training week. Furthermore, one must also deal with the added complexity of dealing with communication between the nodes that are being used, where we use MPI (Message Passing Interface) as the standard way to perform these routines.
Entrance to the Advanced Computing Facility. I was surprised it was on Google maps!
Where are these computers?
We actually got to visit Edinburgh’s Advanced Computing Facility and take a tour around. As you can see, it’s not a very welcoming site. We weren’t allowed to take photos inside because it’s seen as a security risk. However, they said they would supply me with stock photos of the supercomputers hahaha. It was a very thorough tour and we were accompanied by a group of people who were summer students here 20 years ago in the midst of their reunion. They were telling us stories about doing research on computers back then, they had no portable machines. So they have memories of doing all-nighters in the computer lab in order to get their work done. I don’t envy them!
Security is very high at the ACF, much higher than my home university where the smaller HPC system is on campus behind a locked door.
How much do they cost?
These computers are insanely expensive, ARCHER the latest supercomputer cost £43 million and its electricity bill is about £1 million a year. These computers are so powerful that the majority of the electricity is used for their cooling systems, otherwise fires would occur.
What work is done on them, and does it affect us in any way?
The final question to answer for this quick overview is what are examples of work done with HPC systems and will they affect our lives? Many branches of science and industry make use of them. Weather forecasting, Molecular dynamics, Aerodynamics, McLaren are doing research into developing a ‘hypercar’ using a HPC and the NHS use HPC to manage their database and analyse big data. In fact, we also had a seminar on how HPC is used for drug design. Martin Lepsik from CERMAV Grenoble described computer aided drug design and how they can perform virtual screens on drugs, simulate how they will bind with proteins and perform calculations to score the strength of these bonds. This saves lots of time which would otherwise be spent on clinical testings and enables them to try many more combinations than possible in a laboratory. I found it quite interesting as these scoring functions had a link to some of the work I did in my final year thesis involving Quantum correlation energies.
Picture of Jakub and Eva on our first day at EPCC, the James Clerk Maxwell Building is in the background.
Now the training week is over, we’ve began our projects and met our supervisors. Here is a picture of Jakub and Eva as we start our first day of work at the James Clerk Maxwell building.
Next post I will discuss my project. If you’ve any questions just ask in the comments and I will happily reply.
Doing a summer internship might feel like skipping a well-deserved vacation after a long year at uni, but fortunately most supervisors are pretty flexible about giving us students some days off. Last week I flew back to Manchester for a couple of days for my graduation ceremony, and even though I did feel slightly silly asking for time off only a week after getting to Jülich, everyone was very understanding. And it was definitely worth it! My family came over from Helsinki, and we were all in awe of the mysterious British robes and hoods everyone was wearing. Did ancient British students have three Simpsons-like fingers, or what is the story behind those sleeves?
After three years in Manchester!
After returning, I quickly got back to work. Three blog posts in, and I finally have something to say about my project! My supervisor and his colleagues here at JSC have developed C++ code that performs simulations of electron motion in graphene. You can read their paper here, since I don’t want to bore anyone by rambling about Hybrid Monte Carlo simulations and all that. The important idea is this: we have C++ code that generates descriptions of where electrons are in a graphene lattice at a given time, and how they are interacting. All of that information can be stored into a matrix (essentially a grid of numbers).
The fermion matrix for graphene is obviously a LOT bigger.
But the thing about computing, is that every branch of science has their own preferred programming language that they like to use to analyse data. It’s kind of a like a mother-tongue; speaking anything else seems to take a lot of effort for some people. Without trying to offend anyone (there are plenty of exceptions!), the stereotypical conversation goes something like:
Paula the Physicist: Python is great! It’s easy to read and write, so I can make nice plots of my data really quickly, and move it around in different files.
Matt the Mathematician: Oh no, Matlab is much better! You can write everything in vector-form, so that it’s very intuitive to read. I mean who doesn’t love to read things like V = 1/12*pi*(D.^2).*H?
Chloe the Computer scientist: What, Matlab? Python? Properly compiled languages are sooo much faster. C, C++ or Fortran are much better.
Fortunately, the eternal debate creates lots of small projects for us summer students. This week I’ve worked out how to save the matrix created by the C++ code we already have, so that it can be imported into Matlab. The idea is then to pass that data on to Matt, who will be happy that we are speaking his language, as it were. Now that we have this starting point, the critical question in High Performance Computing is whether we can do it faster. If you run a program on a small data set, a 10% speed increase does not matter much; it only makes you wait maybe a fraction of a second less. However, if you run the same program in parallel, using many computers and much more data, that 10% increase can become a 90% increase, and that fraction of a second can turn the code from something that would take weeks to execute to something that can be done in hours. More work for the rest of the summer then!
Hi, my name is James Lowe, I am 23 years old and I was born and raised on the north-side of Dublin, Ireland. I have just this summer completed my bachelors degree in Electronic and Communications engineering at the Dublin Institute of Technology. I am taking part in the Summer of HPC in Ostrava Czech Republic at IT4Innovations where I am working on a visualization tool for the performance of data.
Me myself and I
As part of my bachelor degree I was required to partake in a final year project. The project was to simulate the Ising Model of ferromagnetism. This looked at the change of magnetic state of a material with temperature and applied magnetic fields. The model required a HPC solution as simulation times would take far too long on consumer grade PCs. To combat this I built my own Beowulf cluster from old PCs and also applied for time on Ireland’s national supercomputer Fionn with the Irish Center for High End Computing (ICHEC). This was my first introduction to High end Computing and I really enjoyed it! From this, my thesis supervisor Dr. Kevin Berwick recommended that I apply to the SoHPC and here I am today!
The summer of HPC kicked off with a training week in Edinburgh, Scotland. Where David Henty gave a great introduction to Parallel computing using MPI. Edinburgh is a beautiful city and we got to see such sights as Arthur’s seat and the cafe where JK. Rowling wrote Harry Potter and also did a bus tour around the city.
The project I am working on in Ostrava is the visualization of performance data. There are many tools for measuring performance data such as Scalasca. They allow the programmer to optimize their code by showing the time a given function takes to execute along with other performance metrics. From this it is possible to make alterations to the code and improve its efficiency. However the feedback from the current tools are not intuitive. The aim of this tool is to take the data from these tools and produce a view on the data upon the communication’s model .
Ostrava is a beautiful industrial city with a lot of character. There are many interesting places here that i would like to visit. I had the pleasure of going to see the coal mines which i found interesting. The people at IT4inovations are friendly and welcoming and so far i am thoroughly enjoying my stay here.
After graduating my bachelors in Mechanical Engineering, the quest for finding the better possible version of myself began. I always had a passion for understanding the physics of objects around me. Fortunately, I found a job where I was able to participate in the building and simulation for a plethora of automotive assemblies, using conventional front-end tools, tested and validated under various criteria. Though fascinated in the beginning, I started to realize that it is more fun working on the back-end intricacies like solving a complex problem numerically and bringing out a visual appeal from the computed data.
My Master’s program in Computational Materials Science provided that opportunity by introducing me to multi-scale material modeling and computational mathematics through which I was able to gain insights on how to approach the art of problem-solving. In recent months, the field of machine learning was brought to my notice while browsing through the contents of material informatics. Also, it stipulated my level of curiousness to explore, how differently it works from typical simulations, where the equations governing pertinent phenomena are solved generally by conventional methods.
Fortunately, I found myself in a position to work as an intern in the PRACE Summer of HPC programme. I strongly feel that this opportunity can uplift my knowledge and confidence levels before starting my Master thesis and bridge the inter-space between my technical expertise and my shortcomings. My project is on the hybrid scaling of Convolutional Neural Networks, using High-Performance Computing.
St. Margaret’s Well, Edinburgh
The better start than I’d have probably asked for was the training week in Edinburgh. The lectures were on point, followed by a chance to run the code on ARCHER, the UK national Supercomputer. Along with classes on MPI and its applications, it was a fun-filled week. The spontaneous afternoon beach plan, the mid-week trek, restaurant hunts, mid-night walks, friends, hills, castles, a trip that will go down memory lane.
From the land of Bagpipes and convivial people, we took a cup of kindness and left to our destined sites. In my case, it is Amsterdam (Trust me! I’m not on a vacation). Until the end of August, I will be accessing Cartesius, the Dutch National Supercomputer at SURFsara, to run my simulations. In the coming posts, I will write more about my project. Until then, Goodbye from Amsterdam!
I am a 21 years old student at Poznań University of Technology in the field of Computer Science. In September I’m going to start my last semester of my BSc degree. In a nutshell, I’m a geography and floorball lover, never too tired of trying something new (isn’t life all about trying new things?).
Me and George during the bus tour around the Edinbourgh.
My programming journey started in the last year of high school. In the beginning I claimed that programming is boring, but then I saw some people writing code for calculation purposes. I must say that I got interested. Underestimating programming was a mistake, but also a valuable lesson. I learned how to discover things deeper. That was a turning point which changed my life plans.
During my studies, I try to be active in many fields. My main programming language is Python. I am also familiar with frameworks like Django for Python and jQuery for JavaScript. The big challenge for me is my Bachelor thesis. The aim of it is to classify some artifacts on crystallography photos using machine learning tools. It introduced me to concepts of neural networks, decision trees, regression algorithms and in general statistics.
I am a participant of machine-learning circle GHOST( check it out ! http://ghost.put.poznan.pl/en/ ). When our tutor Mateusz Lango showed me the HPC site I was very excited. I imagined how effective it would be to work with such huge computing power. I was inspired by supercomputers and that’s why I am a part of the PRACE Summer of HPC programme right now.
In the training week in Edinbourgh, day by day I gained more experience in C++ and MPI. Despite beeing familiar with MPI, I found out and learnt some really interesting information.
In the capital of Scotland was the place where I met people from many faculties. It was very inspirational to observe physicists, mechanical engineers and computer scientists developing their code. Moreover, I had a chance to experience many foreign habits like the ‘irresistible need of keeping sunglasses with you wherever you are’ (thank you George). I couldn’t believe that someday I will use to word ‘lunch’ ( in Poland in general we don’t use it), but now it’s in my blood. My family tried to help me gain weight from many years, but what they really should do is send me to Edinburgh to have some lunches. It helped. Really.
It took me a bit to get used to left-hand traffic in UK.
All good things must come to the end. From Edinbourgh I flew to Ljubljana where I started a new adventure with HPC. You won’t belive how beautifull this city is…
At the Univeristy of Ljubljana I am working on data about radiation levels in the world using Hadoop and R programming language. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming models. Hadoop works with assistance of the MapReduce algorithm. It allows me to get the most important information from big data (i.e txt files of size 20 Gb).
I am really glad that I have a chance to meet the world of HPC!!! Who knows, maybe one day you will be able to say the same words !
(Apologies as this is quite late in being posted, as I’ve had a few hectic weeks, with moving accommodation, getting started on the project and just adjusting to living here.)
Enough excuses, back to the blog post.
Well, my name is Conor O’Mara and I am 22 years old (this is giving me secondary school French oral flashbacks ahhhh!). I am from Dublin, Ireland and have just finished a Bachelors in Theoretical Physics at Trinity College Dublin. I have been lucky enough to receive the privilege of being awarded a place on the PRACE Summer of HPC programme where I will be based at the University of Edinburgh’s Parallel Computing Centre (EPCC).
During my final year thesis I was given some exposure to High Performance Computing (HPC), where I used Trinity’s Kelvin cluster (named after Lord Kelvin, the Scot-Irish scientist who has the absolute temperature scale named in his honour – The Kelvin scale) to perform some calculations. These calculations were in the field of Quantum Many Body theory which I undertook with the supervision of Professor Charles Patterson. While I was doing my final year project, another final year student called Dovydas Mickus, who also had the same supervisor as myself was working on a project that focused on parallelising some of the code for calculations in order to increase the efficiency and speed of these calculations in the allocated node time. As we met as a group to discuss our projects with our supervisor, this was my first exposure to the idea of parallelising calculations across many nodes (ie. sharing the job by splitting it up into many pieces) to speed up the calculation. So I have both of those guys to thank for my introduction to HPC and parallel computing.
Here is a picture of (L-R) myself, Dovydas Mickus and Prof. Charles Patterson.
Picture (L-R): Me, Doyvdas Mickus, Prof. Charles Patterson.
In Edinburgh, I will be working on a project which simulates and tests different job scheduling algorithms for how a HPC allocates nodes. The project will encompass some software engineering, testing and some work with algorithms. It is quite a computer science heavy project, which is a little daunting for me and my physics background. However, I am looking forward to the challenge as I view it as a good opportunity to develop my skills and further my knowledge in this field.
Other aims of this blog post are to communicate my experience during my time here and also advertise this program so that hopefully more people from Ireland will apply to the PRACE Summer of HPC programme. There are 23 people from across Europe who have been accepted to work across difference locations in Europe. 3 of us are based here in Edinburgh, while different people are in places ranging from Nicosia, Bologna and Barcelona. PRACE fully supports us with a stipend for accommodation, academic supervisors to help us with project and living expenses. So far, I would highly recommend you apply if you are from a Physics, Chemistry, Maths or Computer Science background.
The program began on the first week of July where everyone on the program met in Edinburgh for a training week. The week itself was great, the standard of teaching in the university was very good and I got an nice overview of HPC that knitted in well with my current knowledge as well as covering lots of new material I had never seen before (perhaps I will discuss about a subject or two in a future blog post). But the emphasis was shared between social elements too and we ate out together most nights, visited most of the main tourist attractions and most importantly had a laugh the whole time which is the type of laid back environment I like to work in.
Here are some photos from the first week featuring an aesthetic African restaurant (thanks to Chef Mo for having us), a Rick & Morty themed bar, Edinburgh castle, the group on top of Arthur’s seat and some other gems.
I’ll leave you until next time with a quote from the famous Edinburgh writer Sir Arthur Conan Doyle, which will stand as a reminder for me to blog more.
“A trusty comrade is always of use; and a chronicler still more so.”
It’s going to be an awesome journey fellas, and yes the title could be made into a movie!
Hi, my name is Atul and I am 25 years old. I hail from India, and I am currently enrolled as a MSc student in Computational Science and Engineering at the University of Rostock, Germany. These are (and will be) a set of posts on my thoughts, experiences and work, as a participant, of the PRACE Summer of HPC 2018 programme, where I’ll be working on a HPC (High Performance Computing) project for 2 months at one of the supercomputing centers in Europe. So come with me as I embark on this journey and take you through one of the most amazing experience of my life.
Not so long ago, I was building cars and all terrain vehicles as college projects couple of years back.
The Invader, a project I was a part of during my Bachelors. Looks Familiar, right?
That was done during my Bachelor degree in Automobile Engineering. I’ve now developed, as a participant of this programme, to build on a small infrastructure for HPC applications (think software). This is again inline with my Masters’ as it explores the simulation aspect of engineering and sciences.
How did I got here? I suppose, it has more to do with the positive cosmic forces that drive our universe. I may have caught and synched with an appropriate frequency.
The PRACE Summer of HPC programme is one such beautiful instance of an “event” (Get it?) which came my way and each day has been absolutely phenomenal so far. I’ve had the privilege and blessing to visit the beautiful city of Edinburgh along with 19 other participants coming from various European universities. Oh and did I mention, I am also going to spend the better part of this summer in Bologna, Italy. Not everyday does a student get such an amazing opportunity to work on a challenging project, and be able to see Europe as a detour from work. Surely the makers and organizers of this programme deserve applauds for thinking this through, or at least a big shoutout from one very happy participant (or perhaps 20, each year)!
I remember the conversation that I had with myself when I applied for this. The guy with the thorns on his head saying “Why bother! You won’t stand a chance against all the PhD’s going for this, that are from across European institutions, which is as good as saying , students from all over the world.” His demeanor was understandable as for more than half of the projects open for applications, I understood only very few of them and was eligible for only a couple of them. However, the guy with the ring on his head said, “Look, dude, if you won’t apply you will definitely not get it, but if you “just” tried, you may probably get it. That is statistically more likely, if not highly unlikely”(Yeah, my subconscious talks in probabilities sometimes)! The choice was obvious. I had to give it a shot!
I was busy running some errands for my brother’s wedding (yeah, that was in April of 2018), when I got accepted. The day just became brighter, more lively (as is the case in most of Indian weddings) and of course even more fulfilling because I was already in a place, both emotionally and physically where I could actually celebrate with my whole family. My sister seemed more happier than me, “Wow, you’ll be seeing Italy, now eh!” she says. To which I reply, “I’ll be living there, sista!”
I couldn’t have asked for more, every aspect of it has been amazing so far. The design is just spot on, accompanied with the apt resources and infrastructure to make it such a brilliant success between both participating individuals/company. The training week in Edinburgh gave me (and us) a taste of power of a supercomputer and what it could do when used properly (and also, how my laptop is not that powerful, as I once thought as a kid while buying it).
The awesome training week, at the awesome city of Edinburgh, with an awesome supercomputer, awesomely named Archer with other fellow awesome participants!
The exercises during that week introduced and educated us on how we as a participants should be leveraging these powerful resources to the best of our abilities for the projects that will be assigned to us.
Post classes, there was of course, Edinburgh to explore. Just look at this animation google made for me, it will basically give you a “feel” for it. Spoiler alert, that’s Edinburgh in the backdrop when the seagull takes the epic nosedive!
I then, after the completion of the training week, flew to Bologna, where Cineca, a supercomputing center and my placement site, is located. I must say, there is something new to learn, every single day. As much as it sounds cliché, getting to do the real thing, beats all classroom lectures. Even my classroom projects look cute when I compare them with what I learnt in the 1st week on one partition of a fully fledged supercomputer. There is simply no comparison.
I’ve been invited to work on a CFD project that is currently, and has been for quite a while now, a hot topic of research among the research community. (Think CFD as simulating and learning about the behavior of fluids interacting on or with some object, which can be air, water or any other fluid. There are of course other complexities involved). This has importance especially because large part of it depends on computation that should produce results as quickly as possible. Hence making the supercomputer extremely useful in such an endeavour. The applications are many, for e.g. The airplane takeoff leaves a vortex so strong behind it,that the other plane behind it must takeoff after this vortex diminishes. Understanding this vortex for example, could help to better organize the take off times of flights at airports, perhaps increasing the airport output in terms of revenue.
Simply put, my project would be to create programs that help visualize similar such computations, while it is being computed on the supercomputer! So think, saving resources on a machine that saves time. There is a also a word for it, “In-Situ Visualization”. Sounds awesome, right! Well, that’s because, it is.
Ladies and Gentleman, presenting , with 20 PFlops/s, featuring more then 300,000 cores, the one and only. Marconiiiiiiiii
Not to mention that I get to do it in Cineca, which is currently the world’s 18th most powerful supercomputer (Check here the top 20) in the world.
I simply feel blessed to be ableto work on it with such amazing individuals helping me in my learning curve. These people running such centers are obviously the best in their fields (which is most of the time, Mathematics with honors or magna cum laude, I’ve observed). This simply means that the water cooler moments also become interesting, engaging and of course fun (nerd jokes are not only gregariously appreciated, but also almost always responded with another one).
Well I know by now, you must be totally intrigued about this, but cliffhangers are always awesome as they, well, leave us hanging on a cliff!
So that’s all for my introduction for now and for this post, stay tuned as I’ll keep publishing the goodies in the time ahead as this project develops and this beautiful summer blossoms. For this is, and is going to be a “Summer of Highly Piquant Circumstance” of mine!
Hi everyone! My name is Pedro Santos, and I’m participating in the PRACE Summer of HPC 2018 programme.
I’m from Coimbra, Portugal, home to one of the oldest universities in Europe (dating back to 1290), which has very rich and unique academic traditions, which I carry in my heart. I have obtained a degree in Chemical Engineering at the University of Coimbra, after finishing my MSc thesis in September 2015 in the area of polymer science. Afterwards, I worked for a year in that field, but I grew dissatisfied with the routines of a chemistry laboratory. I wanted something more challenging.
That’s when I was introduced to the exciting world of molecular modelling. It became a way to express my love for chemistry and computer science at the same time. This is a rapidly growing field with immense opportunities for research. Since March 2017 I’ve been working on molecular simulation for materials design, as part of a research project, and I’m starting my PhD in this area next year, at the University of Coimbra.
It was in this period that my supervisor first told me about PRACE Summer of HPC. As I was still taking my baby steps into molecular simulation techniques, I immediately thought this was the perfect opportunity to take a big step forwards in my learning curve. On one hand, I get to learn about High Performance Computing (HPC), the “beating heart” that drives all chemical simulations today (and without it the exponential growth in that field would have been impossible) and also, as a bonus, brush up my programming skills. On the other hand, being accepted in a project that involves the use of molecular dynamics and metadynamics, two techniques I will extensively use in my PhD, is an amazing opportunity to get expertise from people with much more experience in those particular areas.
PRACE Summer of HPC also presented somewhat of a personal challenge to me. This was the first time I traveled abroad and by airplane. At first I was afraid, but I took on the spirit of the Portuguese discoverers of 500 years ago and surged on into the unknown. I actually had a lot of fun in all my travels, and even got a sneaky peek into Paris in my first connection. Still, I would say I prefer the quiet and cosy setting of Edinburgh, which I had the pleasure to explore in great company during the training week. It was an intense but rewarding week, as I flexed my programming muscles with MPI, and shared stories with wonderful people from around the world.
And now, as I settle in Athens and in the lab supervised by Dr. Zoe Cournia at BRFAA (to whose members I thank for the warm welcome), a new adventure is beginning. I will be working on an exciting and extremely relevant project, regarding the simulation of a mutation in the PI3Ka protein, which is known to be related to carcinogenesis. I’m sure I will learn many concepts and expertise that will be useful for me in the future. I will also have to face the daily challenges of living in a foreign country, with a language unknown to me (even a completely different keyboard layout! o.O) – but that will only make me a better man in the end. And, most of all, I will explore the best this city has to offer and have some fun!
A photo I took of the most iconic landmark of Athens: the Acropolis!
For now, I leave you with a fitting piece of wisdom, that emerged from this country 2500 years ago:
“There is nothing permanent except change.” (Heraclitus)
I am a student of the School of Electrical Engineering, University of Belgrade majoring in Software Engineering. The Summer of HPC programme was introduced to me by my Professor who attended it a few years ago, and who thought it was an awesome experience. After looking at the offered projects, I got pretty excited, as most of them deal with real problems and data, and on a large scale too.
So far, the people I’m with in Ostrava, which is where I’m situated, have been nothing but amazing. I think I got lucky with my mentor, Martin, as he’s very responsive to any questions I have, whether it’s about the project or simply getting to know the surroundings. He’s away for a few days now, but I’m getting him a beer as soon he comes back. The other student, James, who’s with me in Ostrava for the summer is this really laidback guy – extremely easy to talk to. So yeah, I’d say I’m in pretty good company for the weeks to come.
James and I in our IT4Innovations working environment.
As for the project, I’m dealing with software that assembles genome from a large amount of short DNA sequences. In other words, there’s lots and lots of data and it can take days to finish processing. This is where I come in – my main job is to improve the efficiency of the implementation, whether by making small improvements with DNA file readers or making sure that as many stages of the assembly are running in parallel, multithreaded and multiprocessed. The whole bioinformatics field is absolutely new to me, so initially it took me a bit of time to get myself acquainted with the mechanics of algorithms being used and various new terms. I’d say it has paid off, though, as the approaches used to handle DNA data are rather clever, particularly from an engineering perspective.
As far as I’m concerned, this is a good start and I’ll be updating with new information as the situation progresses.
I am Eva, a Numerical and Computational mathematics student from Prague, Czech Republic.
My main field of interest is numerical linear algebra, specifically all kinds of matrix calculations.
Just recently, I started to focus on discrete inverse problems as a part of my Masters thesis that studies algebraic methods for single particle reconstruction, i.e. particle shape reconstruction from projections. Outside of the university, I enjoy playing sports, mainly Ultimate Frisbee (which is not just a random throwing of the frisbee in a park!), puzzle solving, travelling, eating chocolates, cooking and flipping pancakes. I am slightly addicted to coffee and my favourite movie character is Dave the Minion.
This summer I will be working at the Edinburgh Parallel Computing Center (EPCC) on Parallel Computation Demonstrations on Wee Archie. To start with, let me share my impressions from the first two weeks here.
Besides getting to know the other participants, the first week was dedicated to training and mainly focused on an introduction to HPC and programming with MPI. The highlight of the week for me, however, was a spontaneous afternoon trip to the Portobello beach (You just cannot leave a seaside city without going to the beach, right?!). It always sounds promising when a group of 9 young people from all around the world who barely know each others names decide to do something adventurous. The take-home message from the trip is that if a guy from India leaves the group claiming he will be back in 2 minutes it means that, if you are lucky, he will turn up sometime during the next half an hour, maybe.
Portobello beach. Thanks Wojtek Laskowski for the picture.
The plan for the second week was to slowly get to know our mentors, colleagues and to settle down at the EPCC and in our new accommodation at Haymarket. Unfortunately, the actual plan for me changed to getting some medicine and resting in bed. It seems to take some skill to catch a flu during a heatwave in Edinburgh but yeah, it is possible. By the way, did you know that a pharmacist in the UK might not actually speak English at all? Pantomime, however, seems to work anywhere in the world.
My name is Petteri. I am from Finland, where I am currently finishing my MSc in chemistry.
I started my studies in the university of Turku in 2013, studying geology. After taking my introductory course in physical chemistry, I decided to pursue for a degree in chemistry (although I did finish my BSc in geology on the side because rocks are nice).
My initial plan of specializing in physical chemistry changed in 2015, when I joined David Palmer’s research group in Strathclyde University for a few months. After three exhilarating months of molecular dynamics (MD) and supercomputers, I knew there was no turning back. I returned in 2016, this time to learn the (very) basics of 3D-RISM and to deepen my knowledge of MD.
Late last January I received an email from PRACE, which advertised the Summer of HPC program. Although 91% of all the project topics looked alien to me, I could spot two topics related to computational chemistry. I decided to apply for project #14, named High throughput Virtual Screening to discover novel drug candidates. The motive here was quite simple: I had done a bit of classical stuff and a bit of non-classical stuff. Virtual screening was (and still is) an area where I have no experience whatsoever.
Late last March I got the good news from the program coordinator – I was selected. Fast-forward a few months, and I found myself in Edinburgh participating in the programs starting week, which was enjoyable. Made a few friends, learned quite a few new things about message-passing programming, and enjoyed a few nice local lagers in good company.
After the training week I flew to Athens, where I started working in Zoe Cournia’s group, based in the biomedical research foundation of the academy of Athens (BRFAA). During my stay here I will be mostly focusing on using virtual screening methods to try and find a isoform selective inhibitor for an enzyme known as ALDH7A1 (aldehyde dehydrogenase, seventh subfamily, member A1).
ALDH7A1 and its first solvation shell.
The motive for this project stems from the fact that overexpression of ALDH7A1 (with other isoforms such as ALDH1A1, ALDH1A3, ALDH2, and ALDH4A1) has been found in multiple types of cancer. In these cancers ALDHs promote drug resistance and amplify the amount of cancer stem cells (a group of cells which can initiate and propagate tumors, among other things).
To remain impartial, I should probably add that the ALDH family members also do non-cancerous things which include, but are not limited to, catalyzing reactions which detoxify reactive aldehydes in the body (such as acetaldehydes), and help reduce oxidative stress.
Stay tuned for more updates and cool figures!
Hen egg-white lysozyme – just something from one of the tutorials I was working on earlier. The figure shows how the protein is moving throughout the simulation.
Further reading
C. Chan, J. W.Y. Wong, C. Wong, M. K.L. Chan, and W. Fong: Human antiquitin: Structural and functional studies. Chem.-Biol. Interact. 2011, vol. 191 (1–3), p. 165-170.
C.van den Hoogen, G.van der Horst, H.Cheung, J. T.Buijs, J. M.Lippitt, N.Guzmán-Ramírez, F. C.Hamdy, C. L.Eaton, G. N.Thalmann, M. G.Cecchini, R. C.M.Pelger and G.van der Pluijm: High Aldehyde Dehydrogenase Activity Identifies Tumor-Initiating and Metastasis-Initiating Cells in Human Prostate Cancer. Cancer Res. 2010, vol. 70(12), p. 5163-5173.
M. Luo, K. S. Gates, M. T. Henzl, and J. J. Tanner: Diethylaminobenzaldehyde is a covalent, irreversible inactivator of ALDH7A1. ACS Chem. Biol. 2015, vol. 10(3), p. 693-697.
V. Vasiliou, D. C. Thompson, C. Smith, M. Fujita, and Y. Chen: Aldehyde dehydrogenases: From eye crystallins to metabolic disease and cancer stem cells. Chem.-Biol. Interact. 2013, vol. 202 (1–3), p. 2-10.
When I was a kid, we only had my brother’s computer at home. I was 5 years old and I was only allowed to play with the computer for 15-20 minutes every day. Then one day, my brother asked me “What do you want to be in future?” and I answered “Computer Engineer! So, I can play with the computer whenever I want.” and this is how a dream started. I wanted to be a Computer Engineer since childhood.
On the Hallstatt Skywalk
My name is Enes, I’m from the beautiful country of the Republic of Turkey. I just graduated from my BSc in Computer Engineering. I’m planning to proceed with a Master’s degree and PhD outside of Turkey. But, apparently I’m not the only with the same plan, so finding a scholarship for a Master’s degree is challenging. Consequently, I’m planning to skip a Master’s degree and begin with a PhD directly because it is easier to find funding.
I see myself as a lifelong student. Because, I have a hunger to learn new programming tricks and I really enjoy programming as a hobby. I quickly get bored when I work on the same thing for a long time, so I have looked into different areas for years. Also, I love competitive programming contests, hackathons and have won prizes in these. Until the previous year, I was working in web technologies as a full stack developer. I’m experienced in many web frameworks and different languages such as: Spring, Django, Node.JS, ASP.NET and so on. I then started a Machine Learning course by Andrew-Ng and that course changed my future plans. Currently, I’m focused on Machine Learning and Neural Networks. I also play with Blockchain kind technologies in my free time.
Last semester, I took a Parallel Computing course at my university by our tutor Dr. Mete Akdoğan. We learned about memory architectures, libraries such as OpenMP and MPI, Big Data concepts and so on. We also used a cluster for the first time, the cluster of the National High Performance Computing Center of Turkey (UHeM) which is the PRACE partner in Turkey. Our tutor also told us about the PRACE and Summer of HPC programme. I then started to check the available projects and saw some which were related to machine learning. I was really excited when I applied and while waiting for the decision. SoHPC is a really good opportunity to meet new friends from new cultures and meet future scientists, to visit at least 2 countries (for most of us) and work on great projects which are designed to teach us many things.
I was surprised by the program of the training week in Edinburgh – they were really prepared and everything was on right track. There were many exercises for everybody from different programming levels. Even though I followed a similar course at my university, I learned new things in Edinburgh. But, my favorite part was walking to Arthur’s Seat.
On the Arthur’s Seat with Nazmiye and Zheqi
Last but not least, I’m very glad to be in this program. Currently, I’m working on implementing an easy machine learning K-Means algorithm with Global Address Space Programming Interface (GASPI). Last week I implemented it, then parallelised it with MPI and currently I’m trying to figure it out GASPI implementation GPI-2. So far, I enjoyed my project and learning new technologies.
I recommend Summer of HPC to everyone who wants to learn more about HPC and meet awesome people !
My name is Wojtek, I’m 25 years old and thrilled to announce that I’m spending this summer in Barcelona, Spain. Besides leaving home for 2 months for one of the most fascinating places in the world, I’ll also be working on the Adaptive multi-partitioning for the parallel solution of PDEs under the supervision of researchers from Barcelona Supercomputing Center.
But first things first, let’s put all the excitement on the back burner (for a while). Here’s a short description of myself and how I ended up in the heart of Catalonia.
Photo taken during the training week in Edinburgh. Trying to make the best use of my free time and find the perfect view of the city.
I am originally from Lublin, a small city in the South-East part of Poland. Quickly after finishing high school I moved to Warsaw, the capital city, to pursue a Bachelor’s degree in Mechanical Engineering. For my graduate degree, I sought a bigger challenge. I decided to move to Copenhagen, Denmark and start Master studies in Applied Mechanics at Technical University of Denmark.
Ever since my undergraduate studies I was interested in simulating engineering problems (fluid flow, plasticity etc) and numerical analysis. But it was only during my Master studies that I started to explore more and more what’s within the engine of computer simulation software and slowly exchange engineering classes (bye, bye commercial programs) for more specific, mathematical and programming orientated courses (hello Unix shell).
This is where I stand right now, finishing my master thesis, in which I try to design an efficient iterative strategy to speed up nonlinear wave simulations. With (hopefully) successful completion of the thesis I hope to make my own small contribution in the analysis of ocean wave propagation and wave-body interactions.
My interest in Computational Fluid Dynamics was the primary reason I signed up for the PRACE Summer of HPC programme. I believe that my new experiences in parallel programming can make my contribution a little bit bigger, as it allows for even faster simulations. On my way here I only discovered more reasons to join the event. Visiting Scotland for a week with a bunch of very smart individuals and then spending the summer in the most popular Spanish city while working on an interesting project doesn’t sound bad at all, does it?
Another photo from the Training Week. Finally free after whole day of programming.
When I am not pondering on a math problem, I either play or watch football with some breaks to look after my early cactus plantation (consisting of 2 cacti so far) and play board games with my friends in an unusually competitive manner. However, all of this will now be set aside for 2 months in favor of exploring the Iberian Peninsula in greater depth.
In the gardens of the Jaume I University in Castello de la Plana, Spain
Hello fellow reader,
Let me introduce myself – which is the most boring part in this post, but I promise after that it gets interesting. My name is Klajdi Bodurri and I am 23 years old. I was born in Albania and I grew up in Greece. Right now, I am an undergraduate student at the University of Thessaly in Greece. I am studying Electrical and Computer Engineering for 4 years and next year I hope I will get my diploma (fingers crossed). My field of interest is Computer Science and specifically developing algorithms for Distributed Systems. Currently, I am in Castello de la Plana, Spain for the PRACE Summer of HPC programme. I am enjoying the peaceful life here and working on a very interesting project (I am gonna tell you more about it later, keep reading).
But why did I apply for Summer of HPC? I think the answer to this is the curiosity. Curiosity, this little tiny thing that has built the whole scientific community. For me, it started in the summer of 2015 when I participated for the first time in Research and Development at my university and my first impression was “Cool, I should do it every summer”. Next summer, in 2016, my friends and I started building a prototype of smart auto locking system for bikes, just for fun. In the summer of 2017, I did my first internship and I saw how a start-up company works and how to be a part of a huge team. So, this summer, guess what? Summer of HPC!
To be honest though, fulfilling my curiosity wasn’t the only factor for applying in Summer of HPC. I was really lucky because last semester I was an exchange student in Finland. I met people from all around the world, I learned how to live abroad, how to interact with people from totally different cultures from my own and lastly and most important I learned how to survive in -30 °C. So, while I was in Finland, I was trying to find ways to expand the period of living abroad. In that moment, a friend of mine texted me about the PRACE Summer of HPC programme and I read that this is a programme, that offers summer placements at HPC centers across Europe. I said to myself “Time to learn more about High Performance Computing and meet new people. Let’s do it!”.
And here I am now in Castello de la Plana, Spain working on the project ‘Dynamic management of resources simulator’ which will simulate the execution of a workload composed of malleable and non-malleable jobs, over a parallel system. Other than that, I am enjoying the Spanish life. Time to take my siesta!
I am finally participating in the PRACE Summer of HPC, a programme I wanted to be part of a lot!.I followed a parallel computing course during the last semester at university and the teacher of this course shared with us application details of this program. But I saw this on the last application day and at a few hours. Initially I was very excited and tried to apply to the programme. After submitting the application form, I knew there were some fields I hadn’t filled in. The next day, I received and email from Leon. He wrote to me telling me “You didn’t provided your professor’s email, name, surname, institution, so that we can ask him for your recommendation letter. Please send this right away!”. This email made me believe that I could still be accepted for the programme. I quickly sent Leon all the required information.
3 students had applied to this programme from my University and I was impatiently waiting for the results of my application. The initial communication informed me that I was not accepted to the programme Which disappointed me. However, one of my friends was accepted and of Course I was so happy for him.
After the about 1 month, I received another email from Leon. The selected candidate for Project 1810 could not participate in the project and identified as an alternative candidate to bologna, CINECA, Italy. When I saw this email, I was climbing the walls. The project was also great for me. Everything was so perfect. After this news, I guess I could not sleep for a week :). In little time, I completed visa processes.
The training week of the program was in Edinburgh. We were trained on parallel programming every day for 5 days. We tried to solve big problems using parallel programming upon supercomputers in a remote manner. I had experienced with connecting to remote computers before. We learned the advantages, disadvantages and differences between MPI and OpenMPI. MPI which uses a distributed memory structure was the best solution for us so we analysed nearly all of our exercises and functions using MPI.
At the end of each training day, we all tried to spend time together so we could get better acquainted. We visited Edinburgh sightseeing tour and climbed Arthur’s mountain. Of course we having breakfast, lunch and dinner together. Everyone is wonderful and creative and I am very glad. I believe they will all be successful in their project.
With this my first post at the start of my project. I hope that I will be successful in my project.
See you in another of my future posts about my project in Web Visualization and Data Analysis of Energy Load of a HPC system.
I am an Electronics and Telecommunications (i.e. Networking) undergraduate at AGH University of Science and Technology in Cracow, Poland. My current research interest revolves around machine learning, Big Data architectures and spintronics. Although my background is in engineering, I hope for a future research career in an exotic mixture of artificial intelligence and physics (quantum machine learning anyone?).
As much as I enjoy myself, there is always a certain discomfort when sharing a personal note. Therefore, to indirectly evade this problem (avoiding problems and responsibilities always worked for me) I have put up a quick doodle to highlight the key aspects of my inexplicably complex personality
So far, I’ve had a chance to participate in a training week which is a part of the PRACE Summer of HPC programme – over 20 potential future researchers who a share passion and belief in science (so lofty) have met in Edinburgh to learn the craft of HPC parallel programming.
I think this is a perfect moment and place to list my expectations about my participation in the PRACE Summer of HPC programme. Defining clear goals and hopes helps in focusing on completing tasks effectively. I plan to make the most of the next two months, so here we go:
chance to work on the newest HPC technologies and participate in the trending research topics
meet interesting and inspiring people
have an opportunity to work with top-notch researchers
experience a vibrant city for 2 months, possibly do lots of sightseeing
In the last post I will attempt to compare the outcomes of my PRACE SoHPC experience against this list above, but for now I’m left with an itching curiosity and positive attitude.
That’s all folks, see you till the next post!
*footnote was removed due to copyright infringement*
At the top of Kriváň. The most famous peak in Slovakia.
Hi there! My name is Filip Kuklis. I am a 25-year-old IT guy and I am coming from Slovak “city of dreams” – Piestany. Piestany is also known as “Little Amsterdam” of Slovakia, because of its bikes. But for now, I live in the second largest city of Czechia also called “Silicon Valley” of central Europe.
The first time I got know about HPC was my during Bachelor thesis “Fast Reconstruction of Photoacoustic Images“, which was an acceleration of Matlab code using C++/OpenPM as part of the k-Wave project. k-Wave focuses on medical applications of high intensity focused ultrasound. This was the first time I worked with supercomputers (using the IT4Innovations Anselm and Salomon clusters). This year, I finished my Master‘s degree in a branch of Bioinformatics which has a strong connection with supercomputing. The title of my Master’s thesis was “Acceleration of Axisymmetric Ultrasound Simulations“ which is also the C++/OpenMP implementation of Matlab code as a part of the k-Wave project. This implementation was also carried out on the Anselm and Salomon supercomputers.
After summer, I am going to enroll in the Ph.D. programme, where I would like to continue working on the k-Wave project. My research is supposed to include many optimization techniques such as evolutionary algorithms, neural networks, deep learning etc. Furthermore, it should use extensive ultrasound models implemented on supercomputers using OpenMP, MPI, and CUDA.
In my opinion, the PRACE Summer of HPC programme is a great opportunity for me to gain many experiences. First of all, I would like to improve my HPC skills and learn new HPC approaches. I would also like to improve my language skills, as this will be my first experience working abroad. Because I am very interested in HPC, I am really looking forward to meeting new people who are specialists in HPC and who can teach and motivate me.
I am also very interested in quantum physics so I am happy that I was selected for the Summer of HPC project about quantum computing at SURFsara Netherlands.
When I am not doing any stuff on the computer I really like hiking – especially in High Tatras (smaller Slovak Alps). Of course in Brno, I like to enjoy a good Czech beer. I also love mountain biking on my beautiful bike “Shrek” (actually it is a TREK). In the winter, I like to ski in the Slovak mountains with my best ski instructor Daniela. All of the year I enjoy walking around the city and the parks in Piestany. All these activities I enjoy the most with my partner Daniela and all of my friends.
Working on some C++ codes sitting in a lab at Jaume I University, Castellón de la Plana, Spain, I just thought why not to break the ice and write my first blogpost to introduce myself. I am Sukhminder Singh, an MSc student in Computational Engineering at Friedrich-Alexander-University, Erlangen, Germany. Before beginning my Master’s studies in Germany, I studied Mechanical Engineering and worked for an automotive company in India. Since my childhood, I am very passionate about simulation technologies which can help us to simulate the real world on computers, for instance, crashing a car in the digital world and predicting what will happen to the occupants in the real world. So, after working for 3 years in manufacturing, I decided to change my career and to study Computational Mechanics and High Performance Computing (HPC). Now, it’s been two years in my Master’s studies and I have already started my two months’ journey with the PRACE Summer of HPC 2018 programme.
Last week, I attended a training week at the University of Edinburgh, organised by PRACE at EPCC. The training helped me to refresh my basics of parallel programming with MPI. I also got a chance to run codes on ARCHER (Advanced Research Computing High End Resource), which is UK’s national high-end supercomputing system. Although the training was quite intensive, I got some time in the evenings to see the beautiful city of Edinburgh. The city reminded me of the epic Harry Potter movies I used to watch when I was a kid. It is really a dreamy place for Hogwarts fans, including me. The architecture and the design of the city attracted my close attention. In the middle of the week, I had a hiking trip to Arthur’s Seat. The views from the peak were phenomenal. I was able to have 360° views of the city including the North Sea. Having my ears wired, listening to Classic FM (as far as I remember, it was 101.7 FM) and observing the beautiful sunset, I would say, that was the most wonderful and soulful experience I ever had in my life.
Photograph taken at Arthur’s Seat
I can mention one of my hobbies with which I fell in love recently. I learned to dance basic Salsa this semester and I hope to get perfect in the next semester. I will try to find some clubs in Castellón to practice it. Sometimes, I like just not to do anything, which I think is also important.
Now, I am going to work on my project for the next 7 weeks. The goal is to make LAMMPS, a classical molecular dynamics code, malleable. In other words, I am working to enable LAMMPS to be resized, in terms of number of processes, during its execution time. In the coming days, I will write more about my project and will share with you my experiences in Spain. Until then, have a nice time!
Hello everyone. My name is George and I am a 24 year old student from Greece studying for my Master in Computational Physics at the Aristotle University of Thessaloniki.
During my Bachelor studies in Physics, I had a strong interest in any kind of programming, so naturally it became a hobby of mine in my free time. Low level programming like Assembly or GPU shader programming looked really exotic and appealing to me. Starting my digging for information throughout the Internet, I made my first steps towards the huge world of programming. On my way, I met my supervisor, a computational solid state physicist, who guided me and gave me the opportunity to get involved in programming in a more serious way. Doing my BSc thesis on visualization and manipulation of crystals, I built my first big program from scratch which was my initial big step that kept me going, giving me a reason to make coding part of my studies. The thesis turned out to be a big success and was presented at the International Materials Researcher Society Conference in Mexico in 2018. After one year of developing my thesis and finishing my Bachelor, we keep building and improving the code and hope to publicly release it one day.
Combining physics and coding was a “dream come true” for me, so I immediately jumped into my Master studies of Computational Physics. There, I learned various numerical algorithms applied to fields such as quantum mechanics, electromagnetics, solid stated physics, data analysis etc. but more importantly, it was an opportunity for me to meet new people, work and interact with them and make new friends. I learned that working on what you like is fun, but working with good friends and nice company is even better. One of those friends of mine, mentioned the PRACE Summer of HPC programme and he insisted that it was a great opportunity for me. So I followed his advice and applied!
My experience in Edinburgh with the PRACE Summer of HPC programme was really great. In just one week, I met a lot of people from all around the world and made some friends. Hanging around with them, during that week, for almost the whole duration of my day, created some very joyful moments.
The Guesthouse is home to visiting students and researchers.
The Jülich Förschungszentrum lends bikes to visiting students!
Moving countries is always a bit of a hassle, even if you are only coming for a summer internship. Things can, and often will, go slightly wrong, but I don’t think it should stop anyone from signing up to new challenges! Here’s a couple of impressive blunders you can maybe avoid though:
1: Arriving at the airport, grab a train with very small time margins for transfers, rush to the wrong platform, and end up at a station called Kuhbrücke (literally “Cowbridge”). Bonus points are given for running with heavy luggage in 30⁰C, and accidentally smudging your face with dirt at some point, so that you can make a great first impression when meeting your colleague an hour late at the station.
2: Grabbing an ice cream when you go grocery shopping is actually a good idea. However, avoid eating it on a bench where an ant army invades your grocery bag, providing a nice surprise (for you and your new flatmates!) when OKunpacking in the kitchen.
3: Big research institutes, like the Jülich Forschungszentrum (=research center), can easily get you confused and/or lost in the first few days. Don’t be suprised if the staircase you took just an hour before suddenly teleports you to the opposite side of the building. These are high-tech research facilities after all, who knows what kind of quantum tunneling experiments they’re conducting?
Slight challenges aside, my first week in Germany has gone really well! My supervisor, Dr. Stefan Krieg, encouraged me to do as much back-ground reading in graphene as I like, before we start tackling the computational problem. So, days at work have consisted of reading a textbook and papers on modeling quantum fields in a lattice. When I don’t understand something, I can just knock on someone’s door and chat about it; there’s lots of friendly help available. Hopefully in my next post I’ll have some news on my actual project!
I’ve also been getting to know other students working on their MSc’s, summer projects, and PhD’s at the Jülich Supercomputing Center, as well as my flatmates at the “Guesthouse”, an 11-story building housing students and visitors to the Forschungszentrum. It’s actually very nice, even though my expectations weren’t that high after we were warned that the accommodation is “designed for a student budget”. I have a very large room on the 8th floor, with a window facing the fortress in the middle of the town. I share the kitchen with my three flatmates, but each of us has our own mini-fridge, just like you’d find in a hotel! And downstairs there’s a TV, where we’ve been gathering to watch the football World Cup, and a bookshelf I’ve already started to explore.
On my first day, I was lent a bike by the research institute, and I’ve really loved riding to work each day through the field and woods bordering the Forschungszentrum. I’m looking forward to exploring my surroundings by bike this weekend, and will try to remember to take a few pictures for the blog as well!
Hello world! I am Marius Neumann, a 22-year old physics student from Germany. Currently I’m studying for my Master’s degree at Bielefeld University, Germany. Bielefeld is located close to the border between North Rhine Westphalia and Lower Saxony (and despite certain conspiracies indeed does exist).
During my studies I haven’t managed to travel a lot yet, so I see my PRACE Summer of HPC opportunity as a chance to visit some parts of Europe – which in my case are Scotland and Cyprus, where I expect insight in a culture quite different from Germany, hot weather, little rain and much to learn.
While studying physics, I somehow got into the theoretical department by writing my Bachelor’s thesis in Lattice QCD and since these calculations tend to be quite computationally expensive, I got into high performance computing as well. I like to describe QCD with a quote from Goethes Dr. Faust as “what holds the world together at its innermost”, since it is the theory of the force which binds the quarks together to form protons and neutrons – and these make up the world as we know it.
I do not only see High Performance Computing as a way to hurry up my calculations, but also as the door to some new interesting fields such as Big Data and Artificial Intelligence, which may or probably will have an impact on our future. Thus, I consider HPC beyond physics as an important field to be trained in.
During the Summer of HPC program I will spend two months in Nicosia at the Cyprus Institute, where I will try to enable lattice QCD simulations on GPUs by optimizing solver performance on Piz Daint, the now sixth fastest supercomputer in the world.
When I´m not debugging code, I enjoy playing chess, grabbing a beer or sometimes even hiking around.
I am writing this post from Ljubljana, the capital of Slovenia, where my PRACE Summer of HPC project is taking place. The training week has been fabulous, but now it’s time to pack and start a new individual adventure!
Although we have had much fun together, and we have learnt so much from each other, the past week’s main task has been to significantly improve our HPC knowledge! Each of us has a different background, so from now on, I will talk about me.
I study Physics, I’m twenty years old, and next year I will finish my degree.
It’s me!
My interest in HPC comes from my interest in Computational Physics. I have always included different kinds of visualizations and simulations for my projects in the university, but as the problem’s size grows up, things become more and more interesting! That’s the reason why I applied for the PRACE SoHPC programme. Because combining computation and physics is pretty pretty good!
Although I am quite interested in HPC, I have never built complex programs with MPI nor have I ever had access to supercomputers like ARCHER. About MPI, I needed a few extra hours in my room in order to properly understand the lectures, but the outcome has been great, and I am so happy with that.
Referring now to ARCHER, knowing now how to use its resources, it will be great to have the possibility to make an account and use them in the future (remember, by passing the “driving test” available here ). It’s awesome the fact that we can have the opportunity to access these kinds of supercomputers.
To sum up, the lectures have been considerably useful and so well organized. The people from EPCC has actually done a great job. On the other hand, I have been wondering if I will someday see again the ones that I have met this week. I really hope we can meet again whenever. We have spent together very funny moments (such as the evening on the beach, the one in Hollyrood’s mountain, or the last night in the common room, among many more), but if I had to choose one, definitely I would select this one!:
Even the ascent was not easy, we had such a great time together in Hollywood park’s mountain. I actually will never forget that day.
On the other hand, I have already started my project at the University of Ljubljana. I am currently learning a huge amount of interesting things which will form the basis of my work, related to organized storage of data via schemes and its visualization in 3D, and I am getting very familiar with some powerful clusters around Europe. Also, my project has a huge amazing physics background, which is awesome for me. But we will leave it for the next post!
Thank you for reading me!
Check out my video on Youtube which is related to this post:
Hi there, my name is Marc, I’m 24 years old, and I’m in my first year of PhD at the Universitat de Barcelona, in Catalonia. I’ve done all my studies in Barcelona, starting with a Bachelor degree in Physics, then a Master’s in particle Physics, and now a PhD in nuclear Physics.
Don’t be fooled by the picture I have posted, it was the only recent one I had (I don’t like taking pictures, especially of myself). I’m not a beach person. I prefer the mountains. Maybe this is because I’m from a small town in the center of Catalonia, called “les Masies de Voltregà”, where one can go for a walk without needing to take the car. Or maybe because I get sunburned very easily. One of the two (or both).
My research area, as I’ve said before, is nuclear physics. In later posts I’ll talk more about it, but for now, it’s enough to say that I try to simulate the most fundamental particles of matter (quarks and gluons) and see if what we get is what we have in our “real” world. In this way we can study systems that are very difficult to test experimentally. For that, we need powerful computers to do all the calculations, and that’s why I applied for the Summer of HPC project at CaSToRC, The Cyprus Institute, in Nicosia, where they research exactly the same field of particles physics as I am studying. These two months are going to help me get familiarized with supercomputers, and how to write parallel code, so that in the future I can use MareNostrum – the Spanish national supercomputer hosted at the Barcelona Supercomputing Center (BSC). Furthermore, to be able to work with one of the leading groups in my research area is going to be very challenging, but at the same time very enriching.
When I’m not doing physics, I like to listen to music, read or go hiking, and now I’m traveling quite a lot for my PhD (as you can see by my picture), killing two birds with one stone: I learn about new areas of physics that I’ve never heard before and meet new people at the same time. To end, I’ll leave you with a cool time-lapse I took during one of my flights.
Airplane flying through clouds
I hope you like it and see you in Cyprus! (and remember to bring sunscreen, lots of sunscreen!!)
This is a great experience, I am very honoured to participate in the PRACE Summer of HPC 2018 programme.
Not only do I have access to the knowledge of high-performance parallel computing, but I have also met people and made friends from all over the world. Through various events in a brief training week – composed of study and life lessons, we really got to know each other.
My name is Zheqi Yu, I am from China and I am studying in the UK. Currently, I am completing my PhD in electrical engineering from the University of Wolverhampton. Electronic engineering is an area I dabbled in as early as the beginning of my college years. To enhance my practical skills, I joined the electronic laboratory of the University of Wolverhampton to conduct research on hardware development – including software development for embedded systems. Exposed to the dynamic environment which characterises the University of Wolverhampton, I have gained an in-depth understanding of computer science, software tools and research methods in computing. This eventually gave me the opportunity for PhD studies in my intended reseach field. During three years of research in my PhD, I have developed an embedded system for pedestrian detection. The algorithm and integrated hardware and software design has been implemented on Xilinx ZYNQ hardware. Moreover, I had improved and enhanced this project as a demo that was recognised as a finalist of the European at the Xilinx OpenHW 2016 competition.
As for my future academic or career objectives, I wish to achieve something in the frontier research of electronic engineering, and fulfill my research aims. After that, I may continue with post-doctoral research, in the hope of catching up with the technological innovation and becoming a pioneer in the field of electronic engineering. In the meantime, I already launched a road show for my projects – to gain a venture investment, with a sincere wish of contributing to the society.
I am very happy to participate in the “Automatic Frequency Scaling for Embedded Co-processor Acceleration” Summer of HPC project. For the next two months, I will be studying at the Barcelona Supercomputer Center (BSC). This project provides different experimental methods for my PhD research, which allows me to study hardware energy optimisation more deeply. Finally, I am ready to face any challenges posed by my project at the BSC of this summer. With the support of the academic foundation built up at my current research, I am optimistic that I will smoothly adapt to a new way of studying in a different culture.
“Hey, thought this might be of interest…” A one line email with a link from my supervisor this spring. The website was of course, summerofhpc.prace-ri.eu, and I did in fact, find it of interest. The opportunity to work in a European research institute, learning more about how we can use powerful computers to model the world around us? I could not wait for the application period to start!
Therefore, when in April I was invited by PRACE to work on graphene modeling at the Jülich Supercomputing Center (JSC) in Germany over the summer, I was extremely excited! As a University of Manchester student, I already had the chance to hear about graphene from the very best working in the field of nanomaterials – including Nobel laureate Sir Andre Geim himself. Now I would have the chance to explore how its electronic properties are modeled. On the other hand, I suddenly grew nervous. What if all the other participants were computer geniuses, talking jibberish Linux jargon, while I only have a bit of experience with C++ programming?
The Summer of HPC 2018 students.
Well, on this 1st of July 2018 Sunday, after arriving in the beautiful Edinburgh, I finally met the other PRACE Summer of HPC students. Everyone turned out to be very friendly and actually, there were plenty of other non-computer scientists in our group – physicists, mathematicians, engineers… The first night was reserved for getting to know each others in the traditional British way, i.e. in the pub. Sure, we talked a bit about our projects, but mostly it was like any other night out with a group of students, talking about our home countries and universities, struggling to choose from the long list of burgers…
The first day of training seemed to go well for everyone. We learned about Edinburgh’s ARCHER supercomputer, a group of 4920 computers which can co-operate to solve large problems. Although I’d never even logged on to a computer using remote access, with the training team’s clear instructions and hands-on exercises, I was soon running my first program on ARCHER. I learned a lot, and after a long day involving lots of new terminology (did you know a program can be “embarrassingly parallel”?) and screen-time, I even had time for a jog around Arthur’s seat, the prominent and beautiful hill just next to our accommodation.
Our second training day was cut short, as we left for a bus tour around the town. Despite the traffic, we had the chance to see Edinburgh castle, and maybe even a glimpse of some royal visitors in the distance (or at least a very large hat). We finished off the day at Illegal Jack’s Mexican restaurant. I have to say, vegetarian haggis actually makes for a great burrito topping!
We still have a couple of days left in Edinburgh, and then each of us will fly off to get started on our research projects. It’s great to know that we will be keeping in touch with each other though, both to hear about everyone’s interesting projects, and of course to maintain and strengthen the friendships already forming. I for one am very much looking forward to this Summer of HPC!
The 23 projects for the Summer of HPC 2018 will start with the selected students listed at the application page. The programme will start with the training week at EPCC in Edinburgh and then continue directly at 11 PRACE hosting sites across Europe.
In an age marked by data-driven knowledge, visualisation plays a major role for exploring and understanding datasets. Visualisations have an amazing ability to condense, analyse, and communicate data as if telling a story with numbers. In contrast to text-based means, the interpretation of visual information happens immediately in a pre-attentive manner. It is worth mentioning that the usefulness of data visualisation was introduced early in John Tukey’s landmark textbook Exploratory Data Analysis (Tukey, 1977).
In this post, I present an interactive visualisation that can help explore how the world’s centre for supercomputing has been changing since 2005. To put it in more detailed words, the visualisation answers which countries dominated the possession of supercomputers from 2005 to 2017. The visualisation can be accessed directly from the URL below:
The rest of the post gives an overview of the visualisation, how it was designed, and my personal reflections on the visualisation outcome.
2. Visualisation Pipeline
The visualisation is delivered as a web-based application. The visualisation was produced over the stages sketched in Figure 1. First, the data was collected from the Top500.org lists between the years of 2005 to 2017. The data was scrapped using a Python script that utilised the urllib and BeutifulSoup modules. Data included information on rankings and locations of the Top 500 supercomputers in each year. The location info (i.e. country) was utilised to get latitude and longitude coordinates using Google Maps API.
Subsequently, the data was transformed into the JSON format, again using Python. The JSON-structured data defined the markers to be plotted on the map. The JSON output is described as the “Data Layers” by Google Maps API. The map visualisation is rendered using Google Maps API along with the JSON data layers. Eventually, the visualisation is integrated within a simple web application that can provide interactivity features as well. All the code along with scrapped data are accessible from my GitHub below: https://github.com/Mahmoud-Elbattah/Top500_Viz_2005-2017
Figure 1: Overview of visualisation Pipeline.
3. Visual Design
The visualisation is provided on top of Google Maps. Map markers are drawn on the map as circles (Figure 2). Particularly, every circle is placed on a country where at least one supercomputer was included in the Top500 list. The circle radius represents the percentage of possessed supercomputers with respect to the Top500 list in a specific year. For example, Figure 2 visualises the Top500 list in the year 2014. At glance, it can be learned that the USA was the largest incubator of supercomputers in that year.
Figure 2: Visual design.
4. Interactivity
It was aimed to provide a flexible way that can portray how the world’s centre for supercomputing changed over the years 2005-2017. In this regard, a JQuery-based slider is used, which serves as a slide-able timeline. The visualisation is loaded automatically as the user slides the years forward or backward. In addition, the map markers show more info (e.g. number of supercomputers) as the mouse cursor moves over.
5. Reflections on Visualisation
It is quite interesting how the picture changed over the years. In 2005, China had only 19 supercomputers in the Top500 list, compared to 277 owned by USA. Today, China has 160 supercomputers, which makes it stand on equal footing with USA. This translates into how the world’s centre for supercomputing continues to shift eastwards.
The visualisation also shows what can be described as the rise and fall of some countries in the Top500 list. An adequate example is the case of Israel and Poland. On one hand, there were 8 supercomputers in Israel in 2005. That figure has been fluctuating till 2015, and since 2016 Israel had no supercomputers listed in the Top500 at all. On the other hand, Poland had no existence in the Top500 list before the year 2008. However, the number of supercomputers in Poland has been usually increasing since 2008, and has currently reached 6 in 2017. Those are just examples, and I am looking forward to hearing further interesting observations from the SoHPC community.
References
Tukey, J.(1977). Exploratory Data Analysis. New York: Addison Welsey.
One of the many candle-lit circles at la Rambla on the 20th paying tribute to the victims.
When I last wrote for this blog we were having a good time at “La Festa” – mild injuries notwithstanding.
This was, as we all know by now, rudely interrupted by the terrible event which took place on La Rambla on the 17th of August. Our thoughts, as the PRACE Summer of HPC team’s Barcelona division, go out to the victims of this vain act and to the great city to which I here venture to pay tribute by describing what is always glossed over – the day-to-day life in the aftermath of the attack.
The life before
Day-to-day life in Barcelona is perhaps not that different than in any other major city in Europe. Doubly true for scientists such as me and Aleksander since science is our daily bread, seasoned with culture and garnished with adventure here and there. This life is certainly a major break from the quietness and cleanliness of the small town of Tübingen, which I call my home!
The normal size of the crowd at the fountain show in front of the MNAC.
On the whole,one goes to work, enjoys the time after working hours and perhaps complains of the heat here. Since the sun is relentless during the day life in this great city only truly starts after around 17:30, when the outside becomes bearable.
With thousands of tourists being off-loaded by cruise liners periodically and hundreds flying in, the city is mostly geared towards them – at least the city centre. Especially the old city, Ciutat Vella, is almost entirely taken over by tourists now. You can’t go 200 m without hitting a restaurant, ice cream parlour or a coffee shop.
Street performers and pedlars abound, too, in the tourist spots. Small LED launchers are, apparently, the hot item of this summer during the evening hours, closely followed by beer and water.
By the way – our favorite ice cream spot was just in front of Sta. Maria del Pi, a stone’s throw from the Liceu metro station on line 3.
Beware! English language coverage drops off rapidly with distance from the main touristic spots, and even there it is not exactly perfect. So some basic Spanish or, even better Catalan, is advisable when coming here. You can survive without, but it was quite a challenge for me to piece the meaning of the sentences from my patchy knowledge of French and Italian, with English as glue.
T + 0 days
At 2300 on the 17th people are still milling around in Gracia – fewer than normal, though.
On that day, I decided, once again, on walking home from the BSC instead of taking the metro. Thus I was lucky to avoid being stuck in the metro near the place of action, seeing as our everyday transport – line 3 – passes straight under the fated boulevard. It was not until I was home at the Residencia Universitaria de Lesseps that I noticed the warning about the attack, sent to me by Aleksander.
Severe as the incident was, its influence was not immediately perceptible. That is the one positive thing one could say about this type of attack. Its potential for causing a mass panic in a city as large as Barcelona is rather low. Being just 3-ish km away from the place of action the attack was not perceptible.
The festivities canceled for the day Gracia was quiet – with few still admiring the decorations.
That evening, as is my habit, I went to explore another part of Barcelona by night. I abandoned my original destination of visiting the Cathedral in the Ciutat Vella wandering instead through Gracia in the general direction of Pl. de les Glories Catalanes. The festivities of the Festa Major de Gracia were suspended that day and the following Friday. The stream of
people at midnight was smaller than it normally would have been, but the streets were far from being deserted.
T + 1 day
The morning after came around. Friday – the last day of work for the week. Leaving for the BSC, as was my custom, at around half past eight I found the metro to be a bit less crowded than it normally was. It did not halt at neither Catalunya, nor at Liceu – the terminal stop of the van the day before.
As the clock struck twelve all were suspended in a minute of silence for the victims of the attack.
By the time I was returning home at 1700 both stops were being served again and the metro was as full as ever, though definitely more solemn.
It is a tribute to the local people, that life returned to normal as fast as it did without any major “knee-jerk” reactions. I truly admire the people here for their measured response! Certainly, there was more police deployed throughout the city during our final week in Barcelona, but except for that, life was returning to normal by the evening of the day after.
T + 2 days
The Saturday came around and, after having done my shopping for the week,
The Font Magica de Montjuic was far from deserted on the third day of mourning. Even though no show was to be held.
I decided to visit Pl. Espanya again. The sky promised great opportunities for images. Late to leave, I had missed the sunset and was left to make do with dusk.
The steps before the Museo Nacional de les Arts Catalanes (MNAC) were full, though all official performances were suspended during the 3 days of mourning. There it was immediately obvious, that a terrible incident took place, since the crowd was quite a bit smaller than normal.
No fountain show was to be expected but apparently this did not reach the ears of many tourists, since a lot of those present clearly expected to be entertained. I pity the local police for having to answer the same question over and over again.
It is a silver lining on the horizon that the original plan of the group literally blew up in their faces and the attack carried out afterward was improvised. Had the attacker shown up just half an hour later, or even a day later, the number of victims may very well have been a lot higher, since – due to the weather – the tourist spots do not get well populated before 1700.
T > 4 days
Aleksander trying to revive his phone for pictures at the display of Fraternitat de Dalt during the final day of La Festa. Almost midnight and still a lot of visitors!
With the end of the official mourning period the city life definitely returned to normal. Entertainment and vigor were back on the menu inviting any- and everybody to participate in the very lifestyle the terrorists were attempting to disrupt.
On the final Saturday in Barcelona, I decided to scout some souvenirs and was promptly swept up in the march against terror. That is where I have learned the few words in Catalan – the local dialect in Catalonia – which I shall never forget:
No tinc por – I am not afraid
True to this motto we have continued to pursue the completion of our projects in the little time that we still had in this great city. Nothing shall change the academics’ lifestyle – semester break is when you get work done!
As the PRACE Summer of HPC team’s Barcelona division we stand with this city and declare
This summer I started to make my first steps towards the field of HPC. My participation to the SoHPC program sparked my curiosity about this field and I have become willing to learn more about the world’s most powerful supercomputers. The Top500 list has been an interesting source of information to learn about that. I enjoy going through the list and exploring the specifications of those extraordinary computers.
One idea that came across my mind was to apply Machine Learning (ML) to the Top500 list of supercomputers. With ML, we can learn and extract knowledge from data. I was particularly interested in applying clustering to investigate potential structures underlying the data. In other words, to check if the Top500 supercomputers can be categorised into groups (i.e. clusters) based on a mere data-driven perspective.
In this post, I present how I utilised ML clustering to achieve that goal. I believe that the discovered clusters may help explore of the Top500 list in a different way. The rest of the post elaborates on the development of the clustering model, and the cluster analysis conducted.
2. Questions of Interest
Truth be told, this work was initially started just to satisfy my curiosity. However, one of the qualities that I learned from doing research, is the importance of having questions upfront. Therefore, I always try to formulate what I endeavour to achieve in terms of well-defined questions. In this sense, below are the questions addressed by this piece of work:
Is there a tendency of cluster formation underlying the Top500 list of supercomputers based on specific similarity measures (e.g. number of cores, memory capacity, Rmax, etc.)?
If yes, how do such clusters vary with respect to computing capabilities (e.g. Rmax, Rpeak), or geographically (e.g. region, or country)?
3. Data Source
First of all, the data was collected from the Top500.org list as per the rankings of June 2017. The data was scraped using Python. With Python, web scraping is simplified using modules such as urllib and BeutifulSoup. The scraped dataset included 500 rows, whereas every row represented one data sample of a particular supercomputer. The dataset was stored as a CSV file. The Python script and dataset are accessible from my GitHub below: https://github.com/Mahmoud-Elbattah/Data_Scraping_Top500.org
4. Clustering Approach
As described by (Jain, 2010), clustering algorithms can be divided into two main categories including: i) Hierarchical algorithms, and ii) Partitional algorithms. On one hand, the family of hierarchical algorithms attempts to build a hierarchy of clusters representing a nested grouping of objects and similarity level that change the grouping scope. In this way, clusters can be computed in an agglomerative (bottom-up) or a divisive (top-down) fashion. On the other hand, partitional algorithms decompose data into a set of disjoint clusters (Sammut, and Webb, 2011). Data is divided into K clusters satisfying that: i) Each cluster contains at least one point, and ii) Each point belongs to exactly one cluster.
In this piece of work, I used the partitional clustering approach using the K-Means algorithm. The K-Means algorithm is one of the simplest and most widely used clustering algorithms. The K-Means clustering uses a simple iterative technique to group points in a dataset into clusters that contain similar characteristics. Initially a (K) number is decided, which represents centroids (i.e. centre of a cluster). The algorithm iteratively places data points into clusters by minimising the within-cluster sum of squares as the equation below (Jain, 2010). The algorithm eventually converges on a solution when meeting one or more of these conditions: i) The cluster assignments no longer change, or ii) The specified number of iterations is completed.
Where μK is the mean of cluster Ck, and J(Ck) is the squared error between μK and the points in cluster Ck.
5. Feature Selection
The dataset initially contained a set of 13 variables (Table1), which can be considered as candidate features for training the clustering model. However, the K-Means algorithm works well with numeric features, where a distance metric (e.g. Euclidean distance) can be used for measuring similarity between data points. Fortunately, the Top500 list includes a number of numeric attributes such as number of cores, memory capacity, and Rmax/Rpeak. Other categorical variables (e.g. Rank, Name, Country, etc.) were not considered.
The model was trained using the following features: i) Cores, ii) Rmax, iii) Rpeak, iv)Power. Though being a numeric feature, the memory capacity had to be excluded since it contained a significant proportion (≈ 36%) of missing values.
Table 1: Variables explored as candidate features.
Variables
Rank
Name
City
Country
Region
Manufacturer
Segment
OS
Cores
Memory Capacity
Rmax
Rpeak
Power
6. Pre-processing: Feature Scaling
Feature scaling is a central pre-processing step in ML in case that the range of feature values varies widely. In this regard, the features of the dataset were rescaled to constrain dataset values to a standard range. The min-max normalisation method was used, and every feature was linearly rescaled to the [0, 1] interval. The values were transformed using the formula below:
7. Clustering Experiments
The unavoidable question while approaching a clustering task is how many clusters (K) exist? In this regard, the clustering model was trained using K ranging from 2 to 7. Initially, the quality of clusters was examined based on the within cluster sum of squares (WSS), as plotted in Figure 1. Figure 1 is commonly named as the “elbow method”, as it obviously looks like an elbow. Based on the elbow figure, we can choose the number of clusters so that adding another cluster does not give much better modeling of the data. In our case, we can learn that the quality of clusters started to level off when K=3 or 4. In view of that, it can be decided that three or four clusters can best separate the dataset into well-detached cohorts. Table 2 presents the parameters used within the clustering experiments.
Figure 1: Plotting the sum of squared distances within clusters.
Table 2: Parameters used within the K-Means algorithm.
Parameter
Value
Number of Clusters (K)
2–7
Centroid Initialisation
Random
Similarity Metric
Euclidian Distance
Number of Iterations
100
To provide a further visual explanation, the clusters were projected into two dimensions using the Principal Component Analysis (PCA), as in Figure 2. Each sub-figure in Figure 2 represents the output of a single clustering experiment using a different number of clusters (K). Initially with K=2, the output indicated a promising tendency of clusters, where the data space is obviously separated into two big clusters. Similarly for K=3, and K=4, the clusters are still well-separated. However, the clusters started to be less coherent when K=5. Thus, it was eventually decided to choose K=4.
The clustering experiments were conducted using the Azure ML Studio, which provides a flexibly scalable cloud-based environment for ML. Furthermore, the Azure ML Studio supports Python and R scripting as well. The cluster visualisations were produced using R-scripts with the ggplot package (Wickham 2009).
K=2
K=3
K=4
K=5
Figure 2: Visualisation of clusters with K ranging from 2 to 5. The clusters are projected based on the principal components.
8. Exploring Clusters
Now let’s explore the clusters in an attempt to reveal interesting correlations or insights. First, Figure 3 shows the proportions of data points (i.e. supercomputers) within every cluster. It is obvious that there is a pronounced variation. For example, Cluster3 contains more than 60% of the Top500 list, while Cluster4 represents only about 2%.
Figure 3: The percentages of data points within clusters.
Figure 4 and Figure 5 plot the number of cores and Rmax against the four clusters respectively. It can be noticed that Cluster4 included the most powerful supercomputers in this regard. It also interesting to spot that “flying-away“ outlier, which represents the supercomputer ranked #1, located at the National Supercomputing Center in Wuxi, China. This gap clearly shows how that supercomputer is significantly superior to the whole dataset.
Figure 4: The variation of the number of cores variable within the four clusters of supercomputers.
Figure 5: The variation of the Rmax values within the four clusters of supercomputers.
Now, let’s learn more about Cluster4 that obviously represents the category of most powerful supercomputers. Figure 6 plots the segments (e.g. research, government, etc.) associated with Cluster4 supercomputers. The research-oriented segment clearly dominates that cluster.
Furthermore, Figure 6 shows how this cluster is geographically distributed worldwide. It is interesting that the Cluster4 supercomputers are only located in China, Japan, and US. You may be wondering: why the Piz Daint supercomputer (ranked #3) was not included in Cluster4? When I went back to the Top500 list, I found out that the Piz Daint actually has fewer cores than other lower ranked supercomputers in the list. For instance, the Sequoia supercomputer (ranked #5) has more than 1.5M cores compared to about 360K cores of the Piz Daint. However, the Piz Daint has higher Rmax and Rpeak. I am not aware of the process of evaluating and ranking supercomputers, but the noteworthy point to mention here is that this grouping was purely predicated on a data-driven perspective, and considering the expert’s viewpoint should add further concerns.
Figure 6: The variation of segments within Cluster4.
Figure 7: The geographic distribution of Cluster4 supercomputers.
9. Closing Thought
Data clustering is an effective method for exploring data without making any prior assumptions. I believe that the Top500 list can be viewed differently with the suggested clusters. Exploring the clusters themselves can raise further interesting questions. This post is just the kick-off for further interesting ML or visualisation work. For this reason, I have made the experiment accessible through the Azure ML studio via: https://gallery.cortanaintelligence.com/Experiment/Top500-Supercomputers-Clustering
References
Jain, A. K. (2010). Data Clustering: 50 Years beyond K-means. Pattern recognition letters, 31(8), 651-666.
Sammut, C., & Webb, G. I. (Eds.). (2011). Encyclopedia of Machine Learning. Springer Science & Business Media.
Wickham, H., 2009. ggplot2: Elegant Graphics for Data Dnalysis. Springer Science & Business Media.
After two exciting months, the Summer of HPC comes to an end. However, the last few weeks have been quite intense. On the one hand, Paras and I aimed to visit as many interesting places as possible, but on the other hand we also had to finish our projects – which was also quite interesting as we could finally achieve some results although it indeed took some effort.
A few weeks ago, we decided to travel to Italy and explore Florence and Venice. Culinary and historically it was magnificent!
Paras and me at the Arno in Florence
Also, we had wonderful daytrips with Leon and his wife close to the Italian border visiting a natural reserve with waterfalls and lovely creeks. On another weekend Prof. Dr. Janez Povh invited us to go hiking in the mountains which we also enjoyed a lot!
Janez, Paras and me in the mountains
Last entry I have not yet revealed anything about performance related results of my project. So, I will focus on this now. Since the projected gradient method (PGM) has a computational more expensive update rule than the fixed point method (FPM) it is obvious that the PGM has a greater execution time compared to the FPM. We investigated this within a little benchmark by varying the matrix dimension n and observing the execution time that was needed.
comparing execution time between FPM and PGM for different dimensions n in a double-logarithmic manner
If you are interested in the steps that I performed to achieve the performance feel free to visit the following video!
And if you are interested in further details, I can encourage you to visit the official Summer of HPC website to have a look into the final report which will be uploaded in a few days.
At the end, I want to express my gratitude to PRACE, all the members of the organizational team of Summer of HPC, Prof. Dr. Janez Povh and Dr. Leon Kos! Thanks for giving me the opportunity to learn and thanks for a great summer.
Well, I’m writing this from back home in Norway after a great summer in Barcelona! This time, I’ve decided not to write so much hard scientific stuff as before. For that, I refer back to my two previous posts one on the basics of the project and AI and another specifically on Deep Q-Learning, as well as direct you to the final report of my project which will soon available on the Summer of HPC website under final reports 2017.
Overall, it’s been a great summer. You get to see a different city in a completely different way when you stay there for 2 months, working, than you do when you’re just visiting. Moreover, SoHPC is a great way of getting experience with a different field than your own. Working hard in a different area for just two months, gives you a whole new skill set. In Edinburgh, I do physics and maths, whilst in Barcelona I used a lot of my maths knowledge for functional analysis to get a deeper understanding in deep learning and other algorithms. I used my knowledge from statistical mechanics and probability theory to learn a lot of new things to understand Monte Carlo methods and their proofs. I was able to see first hand the similarities between optimization in algorithms, and physics in general. Moreover, I shared an office with people working on Monte Carlo algorithms, the other people working on the crowd simulation came from computer graphics, and I often worked on a desktop in the same room as the programming models group!
Work
We got some nice preliminary results using Python, and have started the quest into transforming this into C++ and CUDA, i.e mixed CPU and GPU, heterogeneous clusters as we say. For results, I’ll just refer to my video below, and the final report which will soon be out and available through the Summer of HPC website under final reports 2017.
I realise I could have spoken a bit faster in the video, and actually, letting YouTube speed up by 1.25 works reasonably well! The videos subtitles aren’t perfect, but work alright. It tends to write research when I say tree search.
Pictures from Barcelona
I’ll skip all the touristic stuff we all know so well from Barcelona and focus on the less common things. The art expo was a one off thing, the concert hall is Palau de la Musica Catalana, the picture from the train is close to Premia de Mar, and for the festival you can read Anton’s blog post Baptism of Fire.
Yep, that is really how close the R1 line train runs to the water the whole way.
Hi, this is going to be my last post. I am going to introduce to you the machine learning (ML) pipeline in my project 🙂
In short, ML is a set of approaches to make data predictions using a series of complex models and algorithms. The key idea of ML is to find a way to train a system to learn the data and make relevant predictions. However, based on the nature of the data, we often need to have a bit of understanding of the data, as well as to select features and algorithms and to adjust the evaluation criteria to optimise the prediction accuracy. Algorithms refer to the training methods, such as whether it is a classification or clustering problem. Evaluation criteria means how you define the threshold to make the prediction. A ML workflow in this project will be illustrated later.
Figure 1. ML pipeline workflow: S1: Select data set A according to requirements. S2. Segment data as set A1 and further segment the training set as D1-D5. S3: Training and validation on segmented data set using TP1-TPn and identification of the best TP (highlighted in green), based on accuracy. S4: Training and validation of the whole data set using the best TP. S6. Iterate S2-4 when set A is segmented to A2-A5, known as cross validation. ST: training, V: validation, TP: training parameter, VAL: validation.
As shown in Figure 1, the ML pipeline in this project had a total of 6 steps. Each data set (e.g. set A) was selected based on some categories from the database (the outer loop) and it went through the ML pipeline via the cross-validation process (the inner loop). The cross-validation process divided each data set into 20% for validation and 80% for training. As one data set can be separated into a total of 5 distinctive validation sets, this training/validation process would be iterated five times. The advantage of cross-validation is that all data would be validated as well as participating in the training process. For each iteration, the training set would be segmented further to D1-D5, each of which contained an 80/20 training/validation ratio. The data would be learned using training parameters (TP) 1-n. In this case, it was the gamma and the cost. Both are parameters for a nonlinear support vector machine (SVM) with a Gaussian radial basis function kernel for classification problems. SVMs are used for classification and regression analysis in supervised learning models. They are normally associated with a number of algorithms. Kernel methods are named after Kernel functions, which deal with a high-dimensional, implicit featured statistical problems. After the TP with the highest accuracy was identified, the D1-D5 would be merged into the original 80% training set, trained again using the best TP and followed by an overall validation. This method can reduce the computational cost exceedingly.
Figure 2. Prediction accuracy against gamma (left) and cost (right). The values of the gamma and the cost were picked based on experimental experience.
The effect on accuracy of gamma (left) and cost (right) had been visualised using interactive diagrams as shown in Figure 2. The outer loops were the data set selected based on the features from the database. The inner loop represents the result from each iteration in cross-validation. Normally, a high gamma leads to high-bias (high accuracy) and low-variance (high precision) models, but in this case, the gamma had no influence on the accuracy. Hence all lines overlapped each other on Figure 2 (right). On the other hand, the cost defines the soft/hard margin in the classification process. In this case, some certain cost value is better than the rest. By visualising the accuracy against training parameters, it is very easy to find the best parameters for the overall training set.
Do you get it now? 🙂
If you want to learn more about my work, please watch my presentation below:
Like all entropy-generating processes, the Summer of HPC must come to an end. However, you won’t get rid of me that easily, since there’s still one more blog post left to be done! This time I’ll tell you more about the HIP framework, some benchmark results of the code and how the Summer of HPC program looks in retrospect.
Figure 0: Me, when I notice a delicious-looking CUDA GPU kernel, ripe for translation to HIP.
HIP, or Heterogeneous-Compute Interface for Portability, enables writing General-Purpose GPU code for both AMD’s and Nvidia’s computation architectures at once. This is beneficial for many reasons. Firstly, you’re not bound to one graphics card manufacturer with your code, so you may freely select the one which suits your needs the most. Secondly, your program can have much more users if they can run it on whatever GPU they have. And thirdly, managing two different code bases is very time-consuming and tedious, and as the common saying in high-performance computing goes, your time is worth more than the CPU time.
The HIP code is a layer of abstraction on top of the CUDA or HCC code, which are the manufacturer-specific interfaces for writing GPGPU code. HIP code thus entails a few properties to the framework. The HIP compiler, hipcc, converts the source code into CUDA or HCC code during compilation time and compiles it with their own respective compilers. This may result into the framework only using features which both interfaces have in common, and thus it is not so powerful as either of them separately. However, it is possible to incorporate some hardware-specific features into the code easily, but the code needs to be more managed.
There’s also a script included in the HIP framework, which converts code to the opposite direction: from CUDA C/C++ to HIP C++. This is a very neat program, because lots of existing GPU code is written in CUDA, which is considered the current “lingua franca” of GPGPU. The script allows to make code much more portable with near-zero effort, at least in theory. I tried it out in my project, but with thin results. The HIP framework is currently under heavy development, so hopefully this is something which we’ll see in the future.
No fancy translation scripts are needed, though. The HIP framework’s design and syntax are very natural to someone coming from CUDA. So if you’re familiar with it, you might as well write the whole code for yourself. If you’re not familiar with GPGPU programming, then you can just follow the CUDA tutorials and just translate the commands to HIP. It’s that easy. To illustrate this, I created a comic (figure 1) of the two toughest guys of the Pulp Fiction movie, but instead of being hired killers, they’re HPC experts.
Figure 1: A discrete time model simulating human communication (also known as: comic). Simulation in question concerns a hypothetical scenario, where characters Vincent Vega and Jules Winnfield from Quentin Tarantino’s motion picture Pulp Fiction had made different career choices.
In high-performance computing, timing is everything, so I did extensive benchmarking of my program. The results are shown in figure 2. The top graph shows the comparison of the runtimes of the program on Nvidia’s Tesla K40m and AMD’s Radeon R9 Fury graphics cards with single and double precision floating point numbers. The behaviour is quite expected: the single precision floating point calculations are faster than the double precision ones. There’s a curious doubling of the runtime on the Tesla at around p = 32. This comes from the fact that 32 is the maximum amount of threads to be running in Single-Instruction-Multiple-Data fashion on CUDA. This bunch of threads is called a warp and their effects could be taken into account with different sorts of implementations.
The bottom graph shows a different approach to the problem where a comparison is made between the usual approach and an iterative approach on the Tesla card. In the usual approach, the operator matrices are stored in the global GPU memory and each thread reads them out of there. In the iterative approach, the operator matrices are computed at runtime on each of the threads. This results in more computation, but less reading from the memory. Surprisingly, the iterative approach using single precision floating point numbers was faster than the other approaches when the multipole amount was large.
Figure 2: Runtime graphs for my GPU program. The computation was done for 4096 multipole expansions in parallel.
During this study programme, I met lots of awesome people from all around the world, researchers and students alike. Not only did I meet the people affiliated with the Summer of HPC, but many scientists and Master’s or Ph.D. students working at the Jülich Supercomputing Centre and, of course, the wonderful guest students of JSC Guest Student Programme. My skill repertoire has also expanded tremendously with, for example, version management software git, LaTeX-based graphics TikZ, parallel computing libraries MPI and OpenMP, creative writing and video directing. All in all, this summer has been an extremely rewarding experience: I made very good friends, got to travel around Europe and learned a lot. Totally an unforgettable experience!
Figure 3: Horse-riding supercomputing barbarians from Jülich ransacking and plundering the hapless local town of Aachen. Picture taken by awesomely barbaric Julius.
After all this, you could ask me which brand I recommend between Nvidia and AMD, or which GPU I would prefer, that is, which GPU would be mine.
My GPU will be the one which says Bad Motherlover. That’s it, that’s my Bad Motherlover.
Two months of this summer have passed by very fast. It is the end of my Summer of HPC project and thus the end of my adventure in Scotland. However, is it the end of this project for sure? I’m going to go back to Poland to continue my studies and job. My mentor told me that someone at the EPCC would take over the results of my work, develop it and try to push it into the production servers. I’ll try to contribute to the development by myself in my spare time because I hope that this visualisation tool will be completed and used by EPCC workers.
What has been done?
My first post explained the idea of the whole project. My second post was about the selection of the best technological stack. In this post, I’ll try to address my experience with the chosen framework – Angular. I’m not super satisfied with the outcome of my work, but somehow I’ve managed to finish it at all.
Angular is known for its modularity. Web app developed in this framework is a root component composed of different components. The great thing is that the JavaScript community is coding generic components which are later on published on public repositories. I’ve used two customizable components from NPM – Node Package Manager: ngx-charts and ng2-nouislider. I’ve also encountered the worse part of the JavaScript community – lack of documentation and sometimes a mess in the code. However, it led to my first accepted pull request on GitHub, so it turned out well!
Uglified JS code accidentally printed by one of my friends.
I’ve combined these two components, customise them for the needs of this project and include my own, hand-tailored to the specification components. The slider component was a bit trickier because the API of a pure JavaScript slider wasn’t moved to the Angular component. Therefore, I had to resign from pips below the slider. There was no way that my pull request with necessary changes will be accepted before the end of a summer.
HPC service arrived at unit
Other great features of Angular are services. A service is used when a common functionality needs to be provided to various modules. For example, we could have a database functionality that could be reused among different modules. And hence you could create a service that could have the database functionality. It also allows for the use of dependency injection in your custom components.
Dependency injection is an important application design pattern. Angular has its own dependency injection framework, and you really can’t build an Angular application without it. It’s a coding pattern in which a class receives its dependencies from external sources rather than creating them itself.
Due to our lack of back-end server, I had to preprocess the usage and jobs data on the front-end side. Therefore, I coded an Angular service that aggregates, sums, averages and manipulates data in a lot of ways. After that, I’ve “injected” it into the chart component. If EPCC staff ever implement the back-end that will do the same things but on a server, then they will just have to swap that service into an HTTP service which communicates with a server.
Video presentation
There is a video presentation created by me for this project. It is uploaded to YouTube:
Website spotlight
You can check the website live on the following GitHub page. However, do not be surprised by the example data. It’s dummy data generated for the purpose of this project.
The code is also available on my GitHub repository.
This Summer of HPC in Bratislava is coming to it’s end. A summer which has been full of adventures and amazing experiences, meeting great people and spending a lot of time working hard on the project and learning from it. We’ve been spending a lot of time on our projects but we’ve also had the time to do amazing trips, like a boating day at the Danube river. The trip took about 6 hours, and although it required a lot of effort, it was completely worth it.
We, getting a rest in the middle of the journey
Also, as result of this summer, I have created a video to allow everyone to learn about and understand what is High Performance Computing, what is Big Data, their applications and some other things, while also introducing some aspects of my project in a very popular manner, in such a way that it reaches the maximum amount of people, using very general content and a whiteboard-style animation video.
Project Results
We successfully implemented – as my previous post explains, the proposed algorithms with different traditional HPC approaches (MPI) and with renowned Big Data tools (Apache Spark), in order to measure the computational efficiency, as well as a fault-tolerant approach (GPI-2) which outperforms the efficiency of Apache Spark while still preserving all of its advantages.
We depicted the results for running K-Means with Apache Spark, first and second MPI methods and mixed MPI/GPI-2 method, with different number of cluster centers and maximum number of iterations, using the 1 million 2-dimensional dataset. We used 2 and 4 nodes for this benchmarking.
Execution time of K-Means on a varying number of nodes with 1 million points, 1000 centroids and 300 iterations over 1 million 2-dimensional points
This experiment was carried out for 1000 centroids and 300 iterations. Apache Spark, as expected works significantly slower, with a 2.5 time decrease in speed when compared to other approaches. The first MPI approach turned out to be slightly better in terms of computation time than the second MPI approach. Also, the Mixed MPI/GPI-2 method, even with the added fault-recovery capability, is slightly better than the second best implementation (the first MPI approach). GPI-2 features doesn’t add any appreciable delay in the execution due to the fact it uses, in the logical level, the GASPI asynchronous methodology to perform all the checkpoint savings and fault detections.
Future Plans
Although the progress of the project has ended up exceeding the initial goal, my mentor (Michal Pitoňák) and me, have more ideas in order to finish a scientific article about the work we’ve done with this project, so throughout this year we’ll continue working hand-by-hand to finish them.
As the project intensively included the processing of NetCDF datasets, this section serves as a brief background to the NetCDF format and its underlying data structure. NetCDF stands for “Network Common Data Form”. The NetCDF creators (Rew, Davis, 1990) defined it as a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It actually emerged as an extension to the NASA’s Common Data Format (CDF). NetCDF was developed and is maintained within the Unidata organisation.
The NetCDF data abstraction, models a scientific dataset as a collection of named multi-dimensional variables along with their coordinate systems, and some of their named auxiliary attributes. Typically, each NetCDF file has three components including: i) Dimensions, ii) Variables, and iii) Attributes. On one hand, dimensions describe the axes of the data arrays. A dimension has a name and a length. On the other hand, a typical NetCDF variable has a name, a data type, and a shape described by a list of dimensions. Variables in NetCDF files can be one of six types (char, byte, short, int, float, double). Scalar variables have an empty list of dimensions. Any NetCDF variable may also have an associated list of attributes to represent information about the variable.
Figure 1 illustrates The NetCDF abstraction with an example of dimensions/variables that can be contained in a NetCDF file. The variables in the example represent a 2D array of surface temperatures on a latitude/longitude grid, and a 3D array of relative humidities defined on the same grid, but with a dimension representing atmospheric level.
Figure 1: NetCDF data structure.
2. Problem Description
Climate researchers and institutions can share their NetCDF datasets on the DSpace data repository. However, a shared file can be considered as a “black box”, which always needs to be opened first in order to know what is inside. In fact, climate simulation models generate vast amounts of data, stored in the standard NetCDf format. A typical NetCDF file can contain a set of many dimensions and variables. With so many files, researchers can spend a lot of time trying to find the appropriate file (if any). Figure 2 portrays the problem of sharing NetCDF datasets on DSpace.
Figure 2: Problem description.
3. Project Objectives
The main goal of the project was to produce explanatory metadata that can effectively describe NetCDF datasets. The metadata should also be stored and indexed in a query-able format, so that search and query tasks can be conducted efficiently. In this manner, we can facilitate the search and query of NetCDF datasets uploaded to the DSpace repository, so that researchers can easily discover and use climate data. Specifically, a set of objectives were defined as below:
Defining the relevant metadata structure to be extracted from NetCDF files.
Extraction of metadata from the NetCDF files.
Storage/indexing of extracted metadata.
Extending search/querying functionalities.
The project was developed in a collaboration between GrNet and the Aristotle University of Thessaloniki. GrNet provided us with access to the ARIS supercomputing facility in Greece, and they also manage the DSapce repository. The ARIS supercomputer is usually utilised to run the computationally intensive climate simulation models. The output of simulation models was also stored on ARIS.
4. Data Source
As already mentioned, the DSpace repository contained the main data source of NetCDF files. DSpace is a digital service that collects, preserves, and distributes digital material. Our particular focus is on climate datasets provided by Dr Eleni Katragkou from the Department of Meteorology and Climatology, Aristotle University of Thessaloniki. The datasets are available through the URL below: https://goo.gl/3pkW9n
5. Methodology
The project was mainly developed using Python. A set of packages was utilised as follows: i) NetCDF4., ii) xml.etree.cElementTree., iii) xml.dom.minidom., iv) glob, and v) os.
Subsequently, the extracted metadata was encoded using the standard Dublin Core XML-based schema. The Dublin Core Schema is a small set of domain-independent vocabulary terms that can be used to describe information or data in a general sense. The full set of Dublin Core metadata terms can be found on the Dublin Core Metadata Initiative (DCMI) website. Figure 3 sketches the stages of project development. Furthermore, the full implemented Python code can accessed on GitHub via: https://github.com/Mahmoud-Elbattah/Extract_Metadata_NetCDF
Figure 3: Project overview.
6. Extracted Metadata
The project outcomes included the following:
More than 40K metadata fields were extracted.
940 DublinCore-based XML files.
Figure 4 provides an example of the extracted metadata.
Figure 4: Example of extracted metadata.
7. Project Mentors
Dr Eleni Katragkou
Department of Meteorology and Climatology, Aristotle University of Thessaloniki
Dr Ioannis Liabotis
Greek Research and Technology Network, Athens, Greece
8. Acknowledgements
First, I would like to thank my mentors Ioannis Liabotis and Eleni Katragkou for their kind support and help. Further, many thanks to Dimitris Dellis from GrNet who provided a lot of technical support during the project development. Last but not least, thanks to Edwige Pezzulli for her kind collegiality and companionship.
References
McCormick, B. H. (1987). Visualization in Scientific Computing. Computer graphics, 21(6), 1-14
Check out my video presentation about the project:
The wonderful and memorable summer, like every good thing, has finally come to an end. Thus, in this post I’ll try to wrap up things with some more remarks on the project and also some interesting pictures from our visit to Italy.
Geometry Cleanup in a nutshell
As explained in the previous post, geometry cleanup or defeaturing, refers to the process of removing unnecessary details such as small parts (nuts, bolts, screws) and intricate features (fillets, holes, chamfers). The process is explained in a schematic manner for the example of a plate in the picture below.
Schematic description of defeaturing of a plate.
What’s a fillet ?
Fillet is a rounding of the sharp edges of a component. For the purpose of defeaturing, a classification based on the rounding done at the end of the edge forming the fillet has been employed. The classes currently handled by the utility include:
Isolated Edge Fillet - IEF
Single Connected Edge Fillet - SCEF
Double connected Edge Fillet - DCEF
Different types of fillets (IEF, SCEF, DCEF)
The defeaturing process begins by identification of the fillet face and the neighboring faces, and is explained in detail in the presentation video given at the end of this post.
Wrap up
One thing which I can definitely assert is that this has been one of the best summers of my life to date. The journey which began with the training week at IT4Innovations, continued with even more fun and learning in Ljubljana. It was enlightening to learn about different aspects of HPC and software development. I also got to hone my skills in programming with Python and CAD data extraction using PythonOCC which I see as useful skills for my future work in the scientific computing domain.
“Travelling is cool” is another take home for me from this amazing experience. Following my newly developed passion for travel, I and my SoHPC mate Jan made a weekend trip to Italy visiting the beautiful cities of Florence and Venice. The featured image shows the city of Florence as seen from the Piazzale Michelangelo during the sunset and the one below was taken at the famous Piazza San Marco in Venice.
Piazza San Marco in Venice, Italy
Also, here is the final presentation video summarizing the project I did during the Summer, so enjoy the video 🙂
Welcome to my final blog post on the PRACE SoHPC website! First, I will show you a trick I used for building an accurate model from few data observations. Then I will introduce the video presentation that summarises my work during this summer at the Juelich Supercomputing Center, that was shown during our final web conference.
Introduction
In my last post we saw that we needed a way to create some kind of model from a set of data points. Specifically, we needed to tune the Acceptance rate (the dependent variable) by manipulating the number of MD steps (the independent variable), whose relationship is unknown. Not only is it unknown, but there are some other parameters, such as the lattice size, that we know can influence this relationship for different simulations. Therefore, we concluded that an online algorithm would make things simpler, as the model would have to be created from observations by using the same combination from the remaining parameters.
However, we are not completely unaware of the relationship we are looking for. From my last post we saw that:
the data have a sigmoid or an S-shaped (almost)
the acceptance rate (dependant variable) starts from 0.0 for a few steps and ends at 1.0 for a big number of steps.
There are many ways of creating a model, such as regression by using a Multi-Layer Perceptron (MLP) – which is a type of an Artificial Neural Network. The problem with such an approach is that neural networks are very agnostic to the shape of the data we would like to model. It can take longer to train the network and it is also prone to overfitting (the error of the training data contributes to creating an inaccurate model). The biggest disadvantage of using the neural network approach is that since it doesn’t take into consideration what we already know about the shape, it can give a model that fits the current data well but makes very bad new predictions (such as with overfitting). Imagine the orange line in the above figure to be the result of training an MLP, but for any Nmd greater than 300 the acceptance rate falls back to values near 0. This is very possible with MLPs and it is unwanted. That is one important reason for choosing to fit the data to a specific function by using least squares fitting instead.
Selecting the best function for fitting
In the above picture, we saw an attempt to fit the data to the equation 1/(1+exp(x)) which was a relatively good attempt. The problem though is that the sigmoid function is symmetric. As we can observe, the orange line is more “pointy” than needed at the top part of the figure and less “pointy” than needed at the lower part. The idea was to look for functions that look like a sigmoid function but allow for a degree of asymmetry. The solution was to use the Cumulative Distribution Function (CDF) of the skew normal distribution.
The CDF gives the area under the probability density function (PDF) and it actually has the shape we are looking for. Also, the skew normal distribution allows skewness which can make our data better fit the model. In the following graph, you can see a comparison of the normal distribution and the skew normal distribution.
Comparison of the PDFs and CDFs of two distributions.
As we can see, by manipulating the α parameter, we can change the shape of the PDF of the distribution and consequently the shape of the CDF as well. Of course, as the selection of the function was done visually, it does not necessarily mean that the physics give a skew normal distribution form to the data, but it seems to suit our cases well. In the following figure, we can see a demonstration of the different CDFs we can obtain by varying the α parameter.
This “trick” gave a quick and accurate methodology for creating a relatively good model, even with a couple of observations.
Video Presentation
And now I present to you my video presentation that summarises the work. At the 3:56 timestamp, you can see the resulting visualisation of the tuning in action.
At the time when I wrote my last blog post (check it out if you didn’t read it!) I was quite happy with the state of my Summer of HPC project “Tracing in 4D data“. I had completed the implementation of a 3D version of the object tracking algorithm Medianflow and the test results of the dummy data were quite promising.
That changed when I got the results when using real world data. The problem was not so much the tracking accuracy of the algorithm, which actually was quite good, but it was the performance. The execution time was very problematic: 25 minutes to track one bubble between two frames? We need to track hundreds of bubbles through hundreds of frames! OK, we can track every bubble independently so a parallel implementation of the algorithm would scale quite well with the number of bubbles, but that’s not true for the number of frames!
I felt that a more advanced tracking algorithm could perform much better, while still retaining the same advantages from parallelization. That is why I considered OpenCV again, to select another algorithm to generalize to three dimensions. I found KCF, which stands for Kernelized Correlation Filters.
Again, if you’re not interested in the math, you can skip directly to the last section where you can see some nice visualizations of the testing results of the algorithm! Or if you don’t want to read at all, just watch the video presentation of my project here:
Kernelized Correlation Filters
The second algorithm I chose to implement is KCF [1]. While MedianFlow is based on old techniques of Computer Vision (it’s based on the Lucas-Kanade feature tracker [2] which was originally published in 1981), KCF is inspired by the new statistical machine learning methods. It works by building a discriminative linear classifier (which some people would call artificial intelligence) tasked with distinguishing between the target and the surrounding environment. This method is called ‘Tracking by detection’: the classifier is typically trained with translated and scaled sample patches in order to learn a representation of the target and then it predicts the presence or absence of the target in an image patch. This can be very computationally expensive:
In the training phase, the classifier is trained online with samples collected during tracking. Unfortunately, the potentially large number of samples becomes a computational burden, which directly conflicts with real-time requirements of tracking. On the other hand, limiting the samples may sacrifice performance and accuracy.
In the detection phase, what happens then – as with other similar algorithms, is that the classifier is tested on many candidate patches to find the most likely location. This is also very computationally expensive and we encounter the same problems as before.
KCF solves this by using some nice mathematical properties in both training and detection. The first mathematical tool that KCF employs is the Fourier transform by taking advantage of the convolution theorem: the convolution of two patches is equivalent to an element-wise product in the Fourier domain. So formulating the objective in the Fourier domain can allow us to specify the desired output of a linear classifier for several translated image patches at once. This is not the only benefit of the Fourier transform because interestingly, as we add more samples the problem acquires a circulant structure. To better explain this point, let’s consider for a moment one dimensional images (basically vectors). Let be a 1-D image. The circulant matrix created from this image contains all possible translations of the image along its dimension:
As you may have noticed, this matrix contains a lot of redundant data but we need to know only the first row to actually generate the matrix. But these matrices also have a really amazing mathematical property: they are all made diagonal by the Discrete Fourier Transform (DFT), regardless of the generating vector! So, by working in the Fourier domain we can avoid the expensive computation of matrix multiplications because all these matrices are diagonal and all operations can be done element-wise on their diagonal elements.
Using these nice properties, we can reduce the computational cost from to nearly linear , bounded by the DFT. The complexity comes from the Ridge regression formula that we need to solve to learn a representation of the target:
In other words, the goal of training is to find a function that minimizes the square error over samples (which are the rows of the matrix X in the 1D example) and their regression targets . This means finding the parameters .
There is still one last step which we can do to improve the algorithm. That is we can use the “kernel trick” to allow more powerful non-linear regression functions . This means moving the problem to a different set of variables (the dual space) where the optimization problem is still linear. This usually also means on the downside that evaluating typically grows in complexity with the number of samples. But we can use again all the properties previously discussed to the kernelized version of Ridge Regression and obtain non-linear filters that are as fast as linear correlation filters, both to train and evaluate. All the steps are described in more detail in the original paper [1].
Results
Again, every step described up to now can be easily generalized to three dimensional volumes, same as Medianflow. As soon as I completed the generalization, I tested the algorithm first on the dummy data and the results were almost instantaneous… but wrong. But not for long, I just needed to adjust just some parameters! And the speed was still quite impressive. So it was time to test it on real world data.
This data comes from the tomographic reconstruction of a system that is a constriction through which a liquid foam is flowing. It was collected from one of Europe’s several synchrotron facilities (big X-ray machines). In general, tracing the movements can allow us to better understand some properties of the material and it has several application both in industry (i.e. manufacturing or food production) and academic research experiments. Unfortunately, as of this time I only have three frames available. You can see a rendering of the data just below.
Rendered animation of the flowing foam system.
It’s easy to see why this is a high performance computing problem, look at all those bubbles! But this is not the only reason! These 3D volumes are represented through voxels (the 3D version of pixels in a 2D image) and while you can’t see it in the rendering animation, they keep the full information about depth of the data. (If you think of pixels as pieces of a cardboard puzzle, voxels would be like LEGO, which by the way are originally from here in Denmark!). This is why they can take so much disk space, each frame taking up to 15 GB. The algorithm is generalized to these sequences of 3D volumes and designed to work directly on the voxels. But now, without further ado, here’s the results of the tracking for a bubble:
Visualization of the tracking results of KCF in each dimension.
It works! And it takes only about 15 seconds per frame for one bubble. That’s a x100 speedup compared to Medianflow! While these results look very promising, there is still a lot of work to do. First of all, we need further testing of the algorithm with different kinds of real data. (As I had limited options for this during the summer, I spent more time than I care to admit to create some voxel art animations and test the algorithms on those… you can check the results in the video presentation at the top of the page). Then, we still need to implement it in a true high performance computing fashion and make a version of the algorithm that can run on multiple nodes in a cluster. For now, the algorithm runs only on a single node, but it’s possible to speed up the computation by using a GPU for the Fourier transform. Only then my project will be truly done. But the summer is almost over and I need to come back home, hopefully I will have another chance to keep working on it!
References
[1] Henriques, João F., et al. “High-speed tracking with kernelized correlation filters.” IEEE Transactions on Pattern Analysis and Machine Intelligence 37.3 (2015): 583-596.
[2] Lucas, Bruce D., and Takeo Kanade. “An iterative image registration technique with an application to stereo vision.” (1981): 674-679.
The summer in Italy is not seeming to end but my stay here does. In my last post I showcased my work which was by that time almost done. From then, the looks didn’t change much, only the underlying stuff to make it faster, more stable and better altogether. The biggest change in looks was the 3D model itself where we added the “room” to the model so it looks more realistic and closer to the real thing which you can see in the video below. You can watch the whole glory in my video presentation of the project.
The Summary
All in all, this summer was the best summer I ever had and probably will ever have. My buddy Arnau introduced me to proper photography and as a result I already spent more than 1000 euros on my own DSLR and lenses. Actually, my newest lens just arrived today. Thanks Arnau, I am broke now!
My newest lens, the Samyang 14 mm ultra wide angle. Awesome for landscape photos and stars.
Another thing I got this summer was fat. Quite a lot. So during the autumn I have something to do at least. But who wouldn’t get fat with all the superb food around me. All the pizza, pasta, antipasti, meat, beer and wine. You have to enjoy everything!
To get less fat, a bike helped. We were given bikes by our colleagues in order to get to work and with a simple calculation we biked more than 210 km during the stay here.
We also visited quite a lot of Italy. We stuffed our stomachs in Bologna (of course), Venice, Florence, Parma and even Rome. And in the meantime, between meals we took a lot of photos. I actually took more than 2100 photos! One of my favorite photos I took is this one from the top of St. Peter’s Cathedral in Vatican City.
It is actually 17 photos stitched together creating almost 360 degrees panorama of Rome and I am planning on printing it. The length of canvas, if I want a 30 cm high one, is 213 cm. Pretty long, but awesome!
To end on a high note, I will modify a quote from my favorite author – Douglas Adams
Among many definitions of visualisation, I prefer when perhaps it was ideally described as the transformation of the symbolic into the geometric (McCormick, 1987). In this sense, visualisation methods are increasingly embraced to explore, communicate, and establish our understanding of data.
In this post, I present an interactive visualisation that can help explore the Top500 list of supercomputers. It was aimed to design the visualisation in an intuitive way that can synthesise and summarise the key statistics of the Top500 list. I believe that the produced visualisation can resonate with a larger audience whether within or outside of the HPC community. The rest of the post gives an overview of the visualisation, and how it was developed. Alternatively, the visualisation can be experimented directly from the URL below: https://goo.gl/YywhU7
2. Overview: Visualisation Pipeline
The visualisation is delivered through a web-based application. The visualisation was produced over a set of stages as sketched in Figure 1. First, the data was collected from the Top500.org list according to the June 2017 list. The data was scraped using a Python script that mainly used the urllib and BeutifulSoup modules. Subsequently, another Python script was used to produce statistics out of the Top500 data such as the market share of vendors or operating systems. The statistics were stored into simple CSV-formatted files. The Python scripts are accessible from my GitHub below: https://github.com/Mahmoud-Elbattah/Data_Scraping_Top500.org
The visualisation was created using the widely used D3 JavaScript library. Eventually, the visualisation was integrated within a simple web application that can provide some interactivity features as well.
Figure1: Visualisation pipeline overview.
3. Visual Design
The visualisation mainly uses bubble charts for delivering the statistics about the Top500 list. Particularly, the size of a bubble represents the percentage with respect to the whole 500 supercomputers. For instance, Figure 2 shows a bubble chart that visualises the counts of supercomputers in countries based on the Top500 June 2017 list. Obviously, US and China have the highest number of supercomputers worldwide.
Figure 2: Visual design.
4. Interactivity
The user can choose a category provided in a dropdown box. The categories included: i) Countries , ii) Regions, iii) Segments, iv) Vendors, and v) Operating Systems.
Furthermore, a pop-up tooltip shows up when the cursor hovers a bubble. This can be very useful for viewing the information of small bubbles within the visualisation (e.g. Figure 3).
Sometimes, rejecting unfeasible ideas early enough is a crucial step towards actually solving a problem. In my quest to speed up the ocean modeling framework Veros, I considered the option of restructuring the existing pure Python code in order to produce more efficient, automatically optimized code by Bohrium. However, that would require deep knowledge of the internals of a complex system like Bohrium and in practice, it is hard to identify such anti-patterns. Moreover, rewriting the most time-consuming parts in C or C++ would also potentially improve the overall performance but that choice would not be compatible with the goals of readability and usability the framework has set. A third, compromising solution had to be found.
Cython enables the developer to combine the productivity, expressiveness and mature ecosystem of Python with the bare-metal performance of C/C++. Static type declarations allow the compiler to produce more efficient C code, diverging from Python’s dynamic nature. In particular, low-level numerical loops in Python tend to be slow, and although NumPy operations can replace many of these cases, the best way to express many computations is through looping. Using Cython, these computational loops can have C performance. Finally, Cython allows Python to interface natively with existing C, C++ and Fortran code. Overall, it provides a productivity and runtime speedup with minimal amount of effort.
After adding explicit types, turning certain numerical operations into loops and disabling safety checks using compiler directives, I decided to exploit another Cython feature. Typed memoryviews allow efficient access to memory buffers, without any Python overhead. Indexing on memoryviews is automatically translated to a memory address. They can handle NumPy, C and Cython arrays.
But how can we take full advantage of the underlying cores and threads? Cython supports native parallelism via OpenMP. In a parallel loop, OpenMP starts a group (pool) of threads and distributes the work among them. Such a loop can only be used when the Global Interpreter Lock (GIL) is released. Luckily, the memoryview interface does not usually require the GIL. Thus, it is an ideal choice for indexing inside the parallel loop.
Because of its large overhead, Bohrium was not used in the benchmarks. Below is a chart comparing the execution time between the original NumPy version and the one with the critical parts implemented in Cython:
Even though not much time was left, I still wanted to attempt to port a section of the framework to run on the GPU. Veros solves a 2D Poisson equation using the Biconjugate Gradient Stabilized (BiCGSTAB) iterative method. This solver, along with other linear algebra operations are implemented in the ViennaCL library. ViennaCL is an open-source linear algebra library implemented in C++ and supports CUDA, OpenCL and OpenMP. In order to execute the solver on the GPU, I created a C++ method that uses the ViennaCL interface. Cython passes the input matrices to C++ where they are reconstructed and copied to the GPU. The solver runs on the GPU and the result vector is copied back to the host. Finally it is sent back to Cython without any copies. This procedure was implemented both for CUDA and OpenCL.
Here is the comparison between NumPy, Cython and Cython with the OpenCL solver:
Now that the programme is coming to its end, it is time to express my thoughts on the whole experience. This summer I made some awesome friends and worked as part of an amazing research group – I hope our paths will converge again in the future. I traveled more than any previous time and had the chance to explore many and different places. Of course, I had the opportunity to expand my knowledge and improve my programming skills. So if you are interested in computing, consider applying in the years to come, Summer of HPC is one of the best ways to spend your summer!
During the past few weeks I have been able to finalize the project task that I have been assigned and I am very very excited for that. In a nutshell, a Python based code was developed for the pre-processing of real well-log data. The borehole data that I have been using and testing comes from the Netherlands and Dutch sector from the North Sea continental self. The developed utilities now include: Data extraction based on specific geologic parameters such as Gamma Ray, Porosity or Density measurements, Plotting with interpolation features and Data storagein various formats or binary form.
Another feature, which is actually the core of my summer project, was the merging of a large number of different well-log data into one file, with formatting suitable for the Data entry to a cool code!! This cool code that I am referring to, is a newly developed Well-Log correlation method that is included in The Mines Java Toolkit repository. The process of generating real case studies for Well-Log correlation was facilitated with this attempt and I was able to visualize the results. Well-log correlation looks something like the graph below, with the colourbar included in the essentials!!
Tools for Geologic interpretation: “13 density well-logs correlated based on relative geologic time in a color-coded manner”
Now it is up to experienced geologists to evaluate this “Simultaneous Well-Log correlation method” and discuss its limitations and accuracy. In the past, well-log correlation was conducted solely by geologists and required many man-hours for an accurate geologic interpretation to be complete. Nowadays ,the development of autonomous well-log correlation methods enables that in just a few seconds. But beware 🙂 every method should be thoroughly tested before put into industrial use.
Future research includes machine-learning methods and deep convolution networks for solving the exact same problem. Is this possible you may ask?
Oh well 🙂 , methods that facilitate geologic interpretation, and to be more specific the problem of Facies classification that I mentioned in my previous blog posts, have already been developed. The results are promising too. Despite that, literature reviews and open source material in this research area shows that inadequate realistic case studies are mostly responsible for the poor performance and accuracy of results. That is the main reason why my project was focused on the generation of real well-log correlation examples. You can check out my video presentation in YouTube if you want, and all the material on my GitHub profile for a more detailed documentation.
Last but not least, as my whole coding journey through the libraries of NumPy, Scikit-learn, Pandas, Matplotlib and many more comes to and end, I am grateful for the amazing learning experience PRACE’s Summer of HPC proved to be. I would also like to add that it was accompanied by many cups of the finest Scottish tea, shortbread treats and exciting weekend trips around Edinburgh.
After all these adventures I am a wee engineer/programmer now :). And remember, if someone tells you that having a vacation while learning and being productive is not possible, prove them wrong. 🙂
As summer ends, I can’t help but rememb Jim Morrison (The Doors) singing “this is the end, my only friend, the end…”. This post also represents an end. It is the end of my Summer of HPC project and hence the end of my stay in Bologna. Nonetheless, as Mike Oldfield puts it: “as we all know, endings are just beginnings…”. This post also represents a beginning. I will go back home and finish my Ph.D., making use of all the skills I’ve learned during this summer. Somewhere in Italy, someone will take the work I’ve done during summer, and continue the project.
Some of my best memories from Italy, the Duomo in Firenze.
Some of my best memories from Italy, the Ponte Vecchio in Firenze.
How to efficiently visualize data
On my first post, I explained what my task would be and in the second one how I could accomplish it. In this post, I will briefly present the results of my work. Have I succeeded? You can judge by yourselves…
ParaView Plugins
One of the reasons ParaView is such great software is its ability to grow. Users can easily program their filters to interact with the data and convert them in plugins by embedding them in xml files. The result would be a ParaView filter with a custom interface performing your operations!
So that’s what I did. I created plugins for interacting with OGSTM-BFM NetCDF output data as a time series, where the variables to present can be easily selected. Other plugins perform selections on the data set, e.g., selection of the sub basins in the Mediterranean Sea or selection between “coast” and “open sea”. Finally, some plugins perform post-processing of the data set, such as computation of the Okubo-Weiss criterion.
Pipes and pipelines
Paraview’s chain of operations, i.e., the operations performed on the data set is called a pipeline. Building a pipeline is necessary to perform a visualization of the data. They can be built interactively or by programming with Python. For a ParaView web application, the pipeline needs to be specified in Python in a script that loads the data set to the server. I built a couple of pipeline examples: one gathering the data from Pico and another from Marconi.
OGSTM-BFM Data Viewer with chlorophyll concentration (mg/m3).
OGSTM-BFM Data Viewer with phosphates concentration (mmolP/m3) in the Ionian basin.
The OGSTM-BFM Viewer
“OGSTM-BFM Viewer” is the name of the web application I developed in this project. It is basically a fork of the Visualizer application from Kitware, where some of the features have been simplified or tailored according to OGS’s needs. In a nutshell, OGSTM-BFM Viewer loads all the OGS custom plugins and a pipeline. The viewer lets the user modify the properties of the filters that affect the visualization but not the pipeline.
Visit to my supervisors at OGS Trieste.
What lies ahead
As I am writing this post, the application has to still be deployed on the cluster. Once it is done, OGS will have an efficient viewer of data sets produced by OGSTM-BFM that can be easily accessed from a regular browser. The project will be continued by students at OGS and upgraded with the latest features that are currently being developed.
Quan l’estiu s’acaba i s’acosta anar a currar, Granollers fa festa i no pensa en l’endemà…
I’d like to end this post by mentioning the festival of my hometown: Granollers. It is in the last week of August, when summer ends and work starts (as the lines of the title say). This year I’m missing most of it but soon enough I’ll be jumping under the correfoc and of course taking pictures of everything! So, stay in touch with me, check out the photos and…
Because we love fire! Photo of a correfoc in Granollers last year.
Last few weeks, it has been a very interesting time here at the Jülich Supercomputing Center. Unfortunately, I only have a bit more than a week to spend on these fascinating topics with the inspiring people here at the center.
Vectorization by hand
In this section I present some of my practical exploration of the vectorization capabilities of GCC (GNU Compiler Collection) and ICC (Intel C/C++ Compiler ) for a small code that performs a simple operation over vectors of data. The most important outcome of this task was to familiarize myself with code vectorization so I will be able to modify and optimize vectorized code in more complex software such as physics simulators.
In the Figure below we can see the performance of the application for different problem sizes and compiler optimization levels. As we can see each time the vector size increases more than a cache level size, then we have a performance overhead because the data need to be fetched from a higher level of the cache hierarchy, which is more expensive to access.
Performance for different problem sizes and GCC compiler optimization levels on a Jureca node.
For the above experiment I did not specify any vectorization intrinsic into the code. I was expecting from GCC to exploit vectorization but it did not. This is the reason that inserting commands from a selection of intrinsics supported by the processor could be a good idea to ensure that the code is vectorized.
Below we can see a similar set of results that compares the performance achieved with and without manual vectorization.
GCC compiler vectorization vs vectorization by hand (FMA intrinsics)
As we can see, at least for the small programs sizes below the L1 cache size manual vectorization almost doubles the performance when using the O3 optimization flag and triples it when using a lower optimization level.
For the Intel compiler though this was not the case. ICC was able to exploit vectorization. Of course, the compiler technology could become very efficient for auto-vectorization in the future and this would be more desired over a code with intrinsics, which can have architecture portability issues. But for HPC applications it can still make sense to continue vectorizing code manually because of the specialized nature of the software and hardware infrastructure.
Accelerating the simulations with online optimization
Simulations of Lattice QCD (Lattice Quantum Chromodynamics) are used to study and calculate properties of strongly interacting matter, as well as the quark-gluon plasma, which are very difficult to explore experimentally due to the extremely high temperatures that are required for these interactions (click here for more). The specific simulator we are trying to optimize is using Quantum Monte Carlo Calculations, which have originally been used for LaticeQCD, to study the electronic properties of carbon nanotubes [1].
My last activity was to help eliminate the simulation time of the simulator by using data science techniques. The method I will describe in this section is independent to the parallelization and vectorization of the code for fully exploiting the computing resources. The code of the simulator is already parallelised and vectorised, perhaps not optimally, but as we are going to see, some algorithmic decisions can further accelerate their execution.
The simulator uses leapfrog integration to integrate differential equations. The equations that are used for the integration look like recurrence relations, which means each next value is calculated using a previous value. Specifically, the smaller the time difference (Δt), the more accurate the integration is. Δt is inversely proportional to the number of MD (Molecular Dynamics) steps (Nmd) . Therefore, the greater Nmd gets, the more accurate the simulation is but also the more time it takes to finish. If the value of the Nmd is low, the acceptance rate of the new trajectories is too low for getting useful results in reasonable time, even on a supercomputer. If Nmd is too high the acceptance rate is 100% but it introduces a significant time overhead. The goal is to tune this parameter on runtime to select the value that would yield an acceptance rate near 66%.
In computer science, when an algorithm is processing data as soon as they become ready in serial fashion is called an online algorithm. In our case we need such an online algorithm for tuning Nmd because each data point is expensive to produce and we do not have the whole dataset from the beginning. Also, we are not aware of the exact relationship of the acceptance rate with the other parameters, such as the lattice size. We could do an exhaustive exploration of the other parameters but it could take an unreasonable amount of time for the same computing resources. Of course there are shortcuts to creating models when exploring a big design space, but the dynamic approach makes it less complicated and more general for future experiments where new parameters could be added without the knowing the amount of their orthogonality.
The idea is to create a model of the relationship Acceptance Rate versus Nmd each time a new data point comes in and then select the best Nmd for the next run. In the following figure we can see a set of observations from an experiment, as well as the model my (earlier version) auto-tuner can create. It uses the least-squares fitting functionality of the scipy python library to create the model based on a pre-defined equation. As we can see observations have an (almost) sigmoid-like shape and therefore we could fit a function like f(x) = 1/(1+exp(x)).
Observations for different number of md steps
The algorithm that scipy uses to fit the data is the Levenberg-Marquardt algorithm. As a user it is very simple to tell the algorithm what parameters need to be found. From high school mathematics we already know that we can shift, stretch or shrink the graph of a function with a set of simple operations.
If we want to move f(x) to the left by a, we use f(x+a)
If we want to shrink f(x) by b in the x-axis, we use f(b*x)
If we want to move f(x) up by c, we use f(x)+c
If we want to stretch f(x) by d in the y-axis, we use d*f(x)
Since we already know that the acceptance rate ranges from 0 to 1 then we do not need the last two of the above. Then, our model has the form 1/(1+exp((x+a)*b)), which is fast and accurate even if we only add a few points (beginning of the simulations). This example is a bit more simplistic than my actual algorithm.
And here is an animation that shows how the regression is progressing through 40 iterations.
Animation: Least-squares fitting in action
Conclusion
This has been proven useful for our simulator and in a next post I will explain more about the specific algorithm I used for fine tuning the simulations for less runtime.
Regarding the vectorization, I did not have much time for a second task, to further optimize the current implementation, but I learnt a lot during the initial hands-on activity described above and can easily apply it in other problems in the future, as well to identify bottlenecks in performance.
Other activities
I have also had a great time visiting Cologne. The public transportation seems to be very convenient here, especially at our accommodation where there is a train stop at only 300 meters. In the picture below you can see an example of the level of detail in the decorations of the Cathedral of Cologne.
References
[1] Luu, Thomas, and Timo A. Lähde. “Quantum Monte Carlo calculations for carbon nanotubes.” Physical Review B 93, no. 15 (2016): 155106.
Decoration of the main entrance of the Cologne Cathedral (you can view/save the 21-Megapixel picture to zoom for more details!)
The outer battlements of Park Güell in the early morning, tourists are already streaming in.
The morning broke, as it always does, with the ringing of the clock reminding me of the fact that I should have made the appointment with my bed sooner. Already, one could tell that insanity was afoot in the neighbourhood since the ringing of the clock was accompanied by gunfire. Attributing those to the holiday, the bed was calling again…
With the sun shining through the slits in the blinds, warning that I have to make haste to get a few nice pictures of my destination – Park Güell, I thus abandoned my attempt to go back to sleep.
An aside on Park Güell
One of the better maintained parks of Barcelona, thanks to it being a major tourist attraction. The lower part of the park is taken up by the monument area, with the Gaudi museum and the iconic terrace amongst them.
On the upper levels, a beautiful view of the central part of the city opens up to those who make the journey. Since it is one of the major tourist attractions, I recommend visiting the place in off-peak hours: very early or very late in the day. Those who come in before sunrise will have a treat of an enjoyable wandering and a nice view of Barcelona waking up. The evening, of course, will greet you with fiery hilltops and – if you’re lucky and the weather is not perfect, with fiery skies.
Additional benefit for the early birds: the visit to the monument area is free until 0800. Afterward the park becomes saturated with tourists fast.
The morning tour of the park was short – 0830 is, after all, a tad late. The sun shows its merciless nature and the tourists are starting to swarm the premises. Having made a mental note of coming back at 0700, with a fresh mind I was off to the office – the local holiday be damned.
A typical day in the life at BSC. (Image due to PhD comics Jorge Cham www.phdcomics.com)
Bad news were waiting at the office, for due to errors in a matrix-conversion routine I’ve been using, my numerical experiments did not even get going. The menu for the day was thus set to “Debugging”.
Off the clock
After a day’s work – a few hours, really – as a pest exterminator of the digital kind, off we went back to the barracks, to catch the parade in our part of Barcelona.
This year, this week, the Grácia quarter of Barcelona is celebrating its 200th anniversary. Meaning: Where once you had trouble strolling along due to tourists, you now have trouble strolling due to colorful stalls.
Note to self: Explore that one further.
Before going off to the parade I figured I’d change into something more comfortable to move around, since I expected to run around a lot taking pictures. Thus I changed into a pair of lightweight shorts and shoes I bought specifically for sports. As anyone will know, those are basically purely synthetic – we shall return to this point later. With this done it was time to go, for
The dangers (or joy) of participating in “Festa Mejor” – this is the moment I should’ve noticed that I chose the wrong clothes for this festival!
we didn’t want to miss the beginning of the parade.
La Festa Mejor de Gràcia.
The week is then filled with minor and major events taking place all over the Barri de Gràcia.
According to a current exhibition on the major festivals in Spain at La Virreina Centre de la Imatge the FM Gràcia is based entirely upon
popular contributions. The work going on in the workshops in the evenings leading up to the festival is
what has originally alerted me to its presence. Spoiler alert: Most (if not all) of the spectacular decorations
to be found throughout the quarter are made from reused plastic bottles, papier-mâché and a lot of sellotape.
The creative process made me actually want to join, perhaps I should seek a PhD position in Barcelona…
For starters, we situated ourselves not far from the start of the parade, at Pl. Trilla – a nice and open space to watch the festivities. When the procession rolled by, the sparks of insanity already started to show themselves. Noise and fireworks, sweets and a lot of jumping around – also: children screaming. And the people participating in the parade apparently thoroughly enjoying chasing bystanders with fireworks. The fireworks clearly weren’t approved by the German TÜV since their firepower was quite something. The first burns were nursed by the harvest of sweets and after the procession rolled by my conclusion was:
“Well.. That wasn’t all that great. Each group was more or less the same.”
Being good scientists, Aleksander and I decided to ascertain our findings by doing a second observation run.
The dancing devil by La Vella Gracia.
Dance with the devil
And so we cut through Gracia and went meandering to Carrer del Torrent de l’Olla where we rendezvous-ed with the head of the parade. Here, it was now obvious, that the procession was just revving up at the beginning, for now the noise was deafening! So, with our blood boiling (what else would it do in the heat of Barcelona?) we rejoined “La Festa” at the corner to Carrer del Diluvi.
By this time, I was sick of the ground level and the local garbage bins were looking increasingly enticing.
Long story short: up we go to get a better view! It was worth it.
Alas, I got more than I bargained for, for I bought myself a standoff with the devil himself. Now a stand-off with the leader of hell seldom works out in favor of the mortals and fair enough, I soon had my ass on fire and took flight to hide behind the walking wall called Aleksander.
Apparently loosing clothes and sustaining burns is par for the course at a Spanish festival.
Casualty report
The casualties PRACE T-Shirt: Fireproof!(note the white spots, that’s fireworks residue) My favorite shorts: not so much. Not shown: Me, sports shoes.
Back home, dead-tired and with a strange ringing in the ears the casualty list was composed to contain: a. Shorts, b. Sports shoes, c. Myself
Note to self: It’s not a good idea to wear anything synthetic when there’s a possibility of playing with fire, unless you need an excuse to go shopping for new clothes.
The remainder of the week will probably pass before the infernal ringing will stop. The PRACE T-shirt provided a nice surprise, though. For apart from a few discoloured places, where the fireworks hit, the shirt was fine!
Conclusion: PRACE wear – fireproof!
The dancing devil by La Vella Gracia.
The final fireworks dance-off of the inhabitants of hell begins in the square. Plenty of space not to shower the audience in sparks.
The dangers (or joy) of participating in “Festa Mejor” – this is the moment I should’ve noticed that I chose the wrong clothes for this festival!
Tourists and locals alike are being showered in sparks by this fire maiden at the beginning of the parade.
The outer battlements of Park Güell in the early morning, tourists are already streaming in.
The replica of the “dragon” at park Güell has arrived bearing fire. Note to self: There’s no place to run when you’re on top of trash containers surrounded by a crowd.
The dancing devil of La Vella Gracia before his descent on the cornered public
The meeting of hells inhabitants in Pl. de la Villa de Gracia.
“The greatest value of a picture is when it forces us to notice what we never expected to see.”, John Tukey.
Introduction
Data visualisation continues to change the way we see, interpret and understand the world around us. But it may be surprising to learn that visualisation techniques were actually embraced long before the age of computing. One example dates back to 1854, the John Snow’s visualisation that helped demystify the Cholera outbreak over the Soho district in London (Snow, 1855). John Snow, one of the fathers of modern epidemiology, made his famous map (Figure 1) of the Soho district, plotting the locations of death alongside street water pumps in the neighborhood. The visualisation provided the first clear evidence that linked cholera transmission to contaminated supply of water.
Here, I am introducing a visualisation that can help explore the world’s most powerful computers (i.e. supercomputers). The main purpose of the visualisation is to enable an interactive visual geo-exploration of the supercomputers worldwide. Thus, the visualisation can simply be considered as a pictorial representation of the Top500.org list. The rest of the post gives an overview of the visualisation, and how it was developed. The visualisation itself is accessible from the URL below: https://goo.gl/sbapBc
Figure 1: John Snow’s Cholera map.
Overview: Visualisation Pipeline
The visualisation is delivered through a web-based application. The visualisation was produced over the stages sketched in Figure 2. First, the data was collected from the Top500.org rankings according to the June 2017 list. It was aimed to get the top 100 supercomputers. The data was scraped using a Python script that mainly used the urllib and BeutifulSoup modules. In addition to rankings, the scraped data included location info (e.g. city, country), and specifications-related info (e.g. Rmax, Rpeak). The location info was utilised to get latitude and longitude coordinates using Google Maps API.
Subsequently, the data was transformed into the JSON format using Python as well. The JSON-structured data defined the markers to be plotted on the map. The JSON output is described as the “Data Layers” by Google Maps API. The map visualisation is rendered using Google Maps API along with the JSON data layers. Eventually, the visualisation is integrated within a simple web application that can provide interactivity features as well.
Figure 2: Overview of visualisation pipeline.
Visual Design
The visualisation is provided on top of Google Maps. The locations of supercomputers are plotted by markers as shown below in Figure 3. Three colours are used for the markers as follows: i) Green (i.e. rankings 1-10), ii) Yellow (i.e. rankings 11-50), iii) Orange (i.e. rankings 51-100).
Figure 3: Visual design.
Interactivity
The user can easily apply filters (e.g. Top 10) using choices provided in a drop-down box. Moreover, the map markers are clickable, so that more detailed information on a particular supercomputer can be displayed on demand.
Figure 4: Viewing details on-demand.
References
Snow, J. (1855). On the Mode of Communication of Cholera. John Churchill.
The Guardian. (2013). Retrieved from: https://www.theguardian.com/news/datablog/2013/mar/15/john-snow-cholera-map
The algorithm we talk about here made it to the cover !
It’s time for another blog post apparently! Just as last time, I have decided to fill it with science! This time I will talk mostly about the “Deep Q Network” and the “Stochastic Gradient Descent” algorithm. These are all fairly advanced topics, but I hope to be able to present them in a manner which makes it as comprehensible as possible. Of course, I will provide an update to my project as well.
The machine learning part of my project is going very well, and is indeed working as expected. I have several variations of neural network based Q-learning including the Deep Q Network discussed below. In the end, we have replicated most features of the DQN except for the introductory “convolutional layers”. All these layers do, is convert the image of the screen into some representation that is possible for the computer to understand. Depending on the final solution of how we will incorporate the existing simulation, these layers may be put back in.
Our Monte Carlo methods right now are limited to the Stochastic Gradient Descent algorithm and some random sampling during the training phase of the Deep Q Network, as we’ve had some problems with our Monte Carlo Tree Search. This is the Monte Carlo method mentioned in the last post for navigating towards a goal.
It does however look like we have found a way to make it work. However, some implementation details are still required. Where we have before tried to make the tree search work on the scale of individual agents, we have now moved to a larger scale, which means we don’t have the problems we had before with the need of communicating between agents.
Deep Q Network
This network was first introduced in [1] published in the journal Nature, where they managed to play a wide variety of Atari games without having to change a single thing. The Deep Q Network was the first to enable the combination of (deep) neural networks and reinforcement learning in a manner which allows it to be used on the size of problems we work on today. The main new feature added is called “experience replay” and is inspired by a real neurobiological process in our brain. [1]
Compared to the dynamic programming problem of Q-learning, we do “neuro-dynamic programming”, i.e. we replace the function from the dynamic programming problem with a neural network.
In the following video, based on an open source implementation of the Deep Q-Network, we see that in the beginning the agent has no idea what’s going on. However, as time progresses, it not only learns how to control itself, it learns that the best strategy is to create a passage to the back playing area! At the bottom of the page, you can also find links to Deep Minds videos of their implementation.
Basics of Q learning
In the previous post we talked about Markov Decision Processes. Q-learning is really just a specific way of solving MDP’s. It involves a Q-function which assigns a numerical value, the “reward”, for taking a certain action in a state . By keeping track of tuples of information of the form , i.e the state, action taken and reward, we can construct a so called (action) policy (function).
A policy is a rule which tells you that in a certain state you take some action. We are usually interested in the optimal policy which gives us the best course of action. A simple solution, but not very applicable in real world situations, is explicitly storing a table specifying which action to take. In the case of a tabular approach, we simply update the table whenever we receive a better reward than the one we already have for the state. In the case of using neural networks, we use the stochastic gradient descent algorithm discussed below, together with the reward, to guide our policy function towards the optimal policy.
An important thing to mention here is that it is important that we don’t always follow the optimal policy, as this would lead to us not obtaining new information about our Q-function. Therefore, every now and then, we take some random action instead of the one given by the optimal policy. A smart thing to do is to do this almost all the time in the beginning, but reduce it over time. We can perhaps see the analogy to growing up. Children often do things we, as adults, know not to do, this way they learn; by exploring the state space and seeing if they are rewarded or punished.
Experience Replay
During learning, episodes of sensory input and output are stored and later randomly sampled, over and over again to combine previous and new experiences. An episode contains information about which action was taken in a state, and what reward this gave. This technique is also done by the brain’s hippocampus, where recent experiences are being reactivated and processed during rest periods, like sleeping. [1]
Stochastic Gradient Descent
Let’s, for a moment, skip the stochastic part and concentrate on the gradient descent. Imagine you are on a hill and want to get down as quickly as possible, i.e. take the shortest path, having no access to a map or any information about where you might be or the surroundings. Probably, your best bet is to always go in the direction of the steepest descent.
In the case of machine learning, we are in a bit more abstract world, and would like not to get to the ground, but find the (mathematical) function¹ that most closely resembles our data. One way of doing so is minimising the so called norm, which is the “straight forward” generalisation of Pythagoras Theorem, , from points in the plane to distance between functions. How this works can be seen, I think, quite easily from the gif below. Simply let and where x and y are the length of the catheti along the x- and y-axes, respectively. Then just bend your mind for a while!
Imagine being somewhere on this graph. You want to get down as fast as possible, so always head in the direction of the steepest descent. There is a risk of hitting a local minimum (white dot). However, this can easily be overcome by having a look around.
The Taylor series approximation for the exponential function.
This gives us a a direction to go towards! One problem in the case of very high dimensional functions is that this can be very time consuming. Therefore, we often use Stochastic Gradient Descent (SGD) instead, which updates only one, or a few, dimensions at a time. This is done randomly, hence stochastic in the name.
How on earth does this relate to the image of the neural net I showed last time, you may ask? Well, this is where the magic that is neural nets comes in. As far as I can see on the internet, no one has explained this in a easy to understand fashion; in partiuclar it would require a very long explanation on exactly what the networks do. You can try to visualise some connection between the dimension of the function and the neurons (the circles in last weeks picture).
¹ – These functions might not be as simple as most people are used too. For a starter, in school, we meet functions in 2 dimensions and of the form . The functions I’m talking about will often have tens of millions of dimensions.
Well, that’s it. The actual algorithm is presented below, and will not be explained. I refer to the original authors’ paper.
Deep Q-learning algorithm. This version deals with processing images, hence the mention of images. Not intended to be understood by the vast majority, but included for completeness for those who can.
Image is from the cover of Nature, issue 518, Cover illustration: Max Cant /Google DeepMind
Further Reading
I will recommend the following papers, for anyone with an interest in understanding the mathematical details. Theory of Deep learning: part I, part II and part III.
For any other aspects, your favourite implementation of PageRank should be able to help you.
Some authorative textbooks include Reinforcement Learning, Sutton; Deep Learning Book, Goodfellow et.al; General Machine Learning books include the books by Murphy; Bishop; and Tibshirani & Hastie, in increasing order of mathematical sophistication.
In my last post, I explained what El Niño events were about. In general, El Niño events are part of a bigger oscillation pattern in the climate of Earth called ENSO (El Niño-Southern Oscillation) that changes patterns in temperature and precipitations worlwide.
Being able to predict what their real consequences are around the globe is not easy, it takes a lot of time to evaluate if a certain anomaly in the weather in other regions is due to mere chance or if, on the contrary, El Niño events plays an important role on them.
Just to give an example, it is believed that, during El Niño years the summer monsoon in India is very weak and that the country is going to suffer from drought that year. Nevertheless, more deep studies show that, although El Niño years tend to be drier than average, the strongest El Niño of the century (1997-1998) produced a monsoon season with above-average rainfall. So, with these kind of figures, how to distinguish between causality and coincidence?
Comparison between the Oceanic Niño Index and the Indian monsoon rainfall from 1950 to 2012. Source: https://www.climate.gov/news-features/blogs/enso/enso-and-indian-monsoon%E2%80%A6-not-straightforward-you%E2%80%99d-think
What I have been doing all these weeks in Dublin (apart from getting plenty of rainfall myself) is try to figure out how these El Niño events have truly made an impact in other parts of the world.
The Pacific Ocean
The first question to answer is how it changes rainfall patterns in the Pacific Ocean (since ENSO originates there, it will obviously change something there) and that can be easily shown in the figure below.
Rainfall during December of several years regarding different states of the ENSO climate oscillation
We see that, during normal conditions, weather tends to be drier near the Pacific islands. During the cold phase of ENSO (La Niña phase), this pattern amplifies and the Pacific Ocean suffers from a dry month in December. However, during 1997 (the strongest El Niño year of the last century), this pattern changes drastically, implying a severe drought near Indonesia and nearby island and causing heavy rainfall in the Pacific ocean and Peru.
Correlation between floods in Florida and ENSO events. Source: https://www.weather.gov/tae/enso
The United States
But ENSO events also changes rainfall in the United States, where it can have either bad or good consequences. It is believed that there exists a correlation between rainfall in Florida and El Niño, which causes an increase in the amount of floods. Although, once again, it is difficult to say whether it is only a coincidence. On the other hand, California, which is a general dry place suffering from drought every once in a while, tends to have more rainfall during El Niño events which is good news for everyone living there.
Many more places to discover
Of course, apart from the places mentioned in this post (India, Pacific ocean, Florida and California), many more places can be affected by El Niño events and some correlation may not be as straightforward as others. Scientists should be careful about making statements when it comes to ENSO and, once discovered something, it is important to keep in mind that climate is one of the most unpredictable things the Earth has.
So, if you want to discover which other places are affected by ENSO events and how big this correlation is, do not miss my next post with more news about it. And in the meantime, I will leave you with an animation on how an El Niño event takes place. In the video below you will see how the temperature of the ocean rises in the equatorial Pacific Ocean at the end of 2015, which was one of the strongest ENSO events ever recorded.
Caves , Castles and CAD data extraction – these are the three key subjects of focus in this blog. Now, having spent more than a month in Ljubljana, I have been able to make some progress not only in my SoHPC project but also in exploring this beautiful place. So, let’s have a deeper look into the things.
at the Predjama Castle
Over the past month, I got the chance to visit some of the magnificent sights in Slovenia including the Postojna Cave, the Predjama Castle and the Rakov Škocjan Valley. The feature picture is that of the Diamond which is the Symbol of the Postojna cave. It is around 5 meter high stalgamite formation and is snow white – as if truly shining like diamonds.
What is CAD ?
CAD stands for Computer Aided Design, which is a technical discipline dealing with the application of computers or “Super Computers” to the design of products used in everyday life. The CAD model contains the geometric details of the product and is created using a CAD modelling software such as SOLIDWORKS. As mentioned in my previous post , the CAD model serves as the starting point for various computer based design validation techniques (numerical simulation) including Finite Element Analysis and Computational Fluid Dynamics. The terms “CAD data” or simply “CAD” are many a times used to refer to the CAD model, a convention followed in this post from now on.
Why is CAD so bad after all ?
The CAD data as created by the designer, almost invariably contains many small details such as fillets, holes and small parts such as bolts and nuts which are not of much importance for the numerical simulation. An attempt to use this data directly for numerical simulation would result in much complex and sometimes even incorrect numerical models. Thus arises the need for CAD data extraction (also referred to as defeaturing or geometry cleanup) wherein, we extract the useful data from the CAD model. Usually, this is somehow handled manually within the pre-processor but is cumbersome and time consuming. As already mentioned in my previous post, this project aims at development of a utility for carrying out the defeaturing in a programmatic way and to automate the process.
How Do we correct it ?
Before we go into the technical details of how this utility works, it is essential to know the hierarchy of the topological entities used for modelling in OpenCASCADE.
Solid -> Shell -> Face -> Wire -> Edge -> Vertex
The flow chart below describes the steps followed in this utility for geometry cleanup
Import the CAD data as a .STEP file and sort all the solids present in the model according to their volumes.
Remove all the solids with volumes below a threshold, as these are likely to form nuts, bolts ans other small parts not required for simulation.
Removal of fillets : This step can be subdivided into sub-tasks as done below.
Loop over each solid, recognize if it contains fillets and determine the type of fillet
Apply suitable algorithm for removing the fillet, depending on its type ( explained later in detail)
For the purpose of debugging(removing “bugs” /errors present, so that the code does what it is expected to do), we use simple geometry as input. The input and the modified CAD model for such an example is presented below.
CAD Model of a Box with 4 isolated edge fillets
The modified model with the fillets removed
The next step in the process chain is the preparation of the CFD input files using the modified CAD data. A detailed description of this step along the different types of fillets handled in this utility will follow in the next post. So, stay tuned till then 🙂
Disclaimer: This post will not teach you how to build your own graphics engine. It may, however, delight and confuse you.
In my previous post (which you can find here) I introduced what radiosity is along with why and how we want to use it. In this post I’m going to get into a little bit more technical detail on the bits and pieces that go into computer graphics in general and are used to help in the final calculation of radiosity. To begin with, I’ll talk about how the z-buffer is used to sort out which polygons are visible in a scene. Just to remind you, polygons are the shapes (here I use triangles) that make up any and every object in a virtual scene.
Buffering… Please Wait
So you’ve got your virtual scene, perhaps a room filled with virtual doggos, and you’ve figured out how to put a little camera in it to view the scene (with a fancy perspective projection?) but how do you figure out which parts of the scene are visible to you? In the real world it’s easy, light from objects behind other objects simply doesn’t reach your eye, but when rendering a virtual scene the computer isn’t sending out billions of rays of light and checking what hits its virtual eye, that would be far too expensive. Instead we have to be clever about it and one way to do this is by using the z-buffer.
The basic idea is simple, for every polygon in your view, for every pixel inside that polygon, compare the depth of that pixel (typically the z-coordinate, hence z-buffer) with the value currently in the z-buffer to decide which pixel should be displayed. For two intersecting polygons the final z-buffer and rendered image look like this.
For a sample scene, a room with coloured walls and a white box inside, the z-buffer and final image looks like this.
Bro, Do You Even Clip?
During the projection step, but before we use the z-buffer, we choose a near plane, basically a kind of split in the scene separating polygons that are far from us from polygons that are too close to or behind the virtual camera. Any polygons in front of this plane should be rendered and any behind it should be discarded. However, there is the tricky situation of there being polygons that happen to have points on either side of the near plane. These polygons must be clipped, that is they are cut so that the bit that’s in front of the near plane remains and the bit behind it is discarded.
Below you can see the situations where no clipping is needed, where one point of the triangle is cut leaving two triangles, and lastly where two points are cut leaving just one smaller triangle.
Can You Make It Look Less… Tron?
So with the help of the techniques above I’ve actually managed to get a kind of first version completed of the rendering process. Check out the strange looking results below.
There’s a few things to notice that are wrong about the previous image. For one, it’s far too dark, radiosity must be being lost somewhere. Also, two neighbouring triangles should have similar shading values and it’s clearly seen in this image that the triangles seem to be forming a kind of stripy pattern. There’s also artifacts appearing between certain triangles because of the drawing method. The last problem I’m not too worried about but the first two are indicative of some kind of bug in the code. Maybe it’ll get fixed in time, maybe it won’t.
My first Monday morning in Copenhagen, Konstantinos and I met the guys working at the Niels Bohr Institute and we had a very delicious breakfast together. Seriously, Danish pastries are so good I can understand why the Danes named them after themselves!* Soon after that I was handed by my supervisor a paper with some notes about my project “Tracing in 4D data” (Check out my introduction blog post if you didn’t read it!).
By the afternoon on the same day I already completed 4 over 6 points of the notes’ list, getting familiar with my project, OpenCV and its implementation of various 2D object tracking algorithms, and having fun testing them with my laptop’s webcam. The next point was “Move to 3D”. That’s what I have been doing so far regarding my project and what I am going to talk about in this blog post.
In the next two sections you’ll find a description of the object tracking algorithms I implemented during my project. If you’re not interested in the math, you can skip directly to the last section where you can see some nice visualizations of the testing results of the algorithm!
*This statement is not true but made only for comical purposes. Use this joke at your own risk. Actually Danish pastries are called like this everywhere but in Denmark, where they are called Vienna Bread (wienerbrød), as they were first made in Denmark in 1840 by Viennese chefs! Click here to know more about this.
Median Flow
Forward-backward consistency assumption. [1]
Median Flow is the algorithm I finally chose to generalise to 4D data among the ones from OpenCV after reading some academic papers. It [1] is mainly based on one idea: the algorithm should be able to track the object regardless of the direction of time-flow. This concept is well explained by the image on the right. Point 1 is visible in both images and the tracker is able to track it correctly: tracking this point forward and backward results in identical trajectories. On the other hand Point 2 is occluded in the second image and the tracker localizes a different point. Tracking this point backward the tracker finds a different location than the original one. In the algorithm the discrepancies between these forward and backward trajectories are measured and if they differ significantly, the forward trajectory is considered as incorrect. This Forward-Backward error penalizes these inconsistent trajectories and enables the algorithm to reliably detect tracking failures and select reliable trajectories in video sequences.
Block diagram of Median Flow. [1]
But how do you actually track the objects? Well, the algorithm’s block diagram is shown on the image on the left. The tracker accepts a pair of images (the current frame and the next consecutive frame of a video) and a bounding box that locates the object to track in the first frame, and it outputs a new bounding box that locates the object in the next frame. Initially, a set of points is initialized on a rectangular grid within the bounding box. These points identify the object and they are the actual elements tracked in the video. To track these points Median Flow relies on a sparse optical flow feature tracker called Lucas-Kanade [2]. You can read more about it on the next section. This is the ‘Flow’ part in ‘Median Flow’. The quality of the point predictions is then estimated and each point is assigned an error (a combination of Forward-Backward error and other measures). 50% of the worst predictions are filtered out and only the remaining predictions are used to estimate the displacement of the bounding box and scale change using median over each spatial dimension. This is the ‘Median’ part!
Median Flow at work. Red points are correctly tracked and the ones chosen to update the bounding box for the next frame.
To improve the accuracy of Median Flow we can use other error measuring approaches like normalized cross-correlation, which turns out it’s complementary to the forward-backward error! [1] The name sounds fancy but again the concept is very simple: we compare a small region around the tracked points in the current frame and the next one, and measure how similar they are. If the points are tracked correctly the regions should be identical. This is true under the assumption already made by optical flow based tracking.
Lucas-Kanade feature tracker
The Lucas-Kanade tracker is a widely used differential method for sparse optical flow estimation. The concept of optical flow is quite old and it extends outside the field of Computer Vision. In this blog you can think of optical flow as a vector that contains for every dimension the number of pixels a point has moved between one frame and the next. To compute optical flow firstly we have to make an important assumption: the brightness of the region we are tracking in the video doesn’t change between consecutive frames. Mathematically this translates to:
Then if the movement is small, we can develop the term using the first order from the Taylor series and we get:
Now from the original equation it follows that:
Dividing by we finally obtain the optical flow equation:
where and are the velocities, or the components of the optical flow. We can also write this equation in such a way that it’s still valid for 3 dimensions in the following way:
The problem now is that this is an equation in two (or three) unknowns and cannot be solved as such. This is known as the aperture problem of optical flow algorithms, which basically means that we can estimate optical flow only in the direction of the gradients and not in the direction perpendicular to it. To actually solve this equation we need some additional constraint and there are various approaches and algorithms for estimating the actual flow. The Lucas-Kanade tracker applies the least square principle to solve the equation. Making the same assumptions as before (brightness consistency constraint and small movement) for the neighborhood of the point to track we can apply the optical flow equation to all the pixels in the window centered around the point. We obtain a system of equations that can be written in a matrix form where is the optical flow and:
and
And the solution is:
Finally! Now we understand the classical Lucas-Kanade algorithm. But when we have to implement it we still have some ways to improve the accuracy and robustness of this original version of the algorithm, for example adding the words ‘iterative’ and ‘pyramidal’ to it [2].
Pyramidal implementation of Lucas-Kanade feature tracker.
By ‘iterative’ I mean just iterate the classical Lucas-Kanade multiple times, using every time the solution found in the previous iteration and letting the algorithm converge. This is what we need to do in practice to obtain an accurate solution. The second trick is to build pyramidal representations of the images: this means to resize every image halving the size multiple times. So for example, for an image of size 640×480 and a pyramid of 3 levels we would also have images of size 320×240, 160×120 and 80×60. The reason behind this is to allow the algorithm to handle large pixel motion. The movement of a pixel is a lot smaller on the top level of the pyramid (divided by where is the top level of the pyramid, exactly) . We apply the iterative algorithm to every level of the pyramid using the solution of the current level as initial guess of optical flow for the next one.
Now the Lucas-Kanade feature tracker works quite well!
Reinventing the wheel to move forward
If you read the previous sections and you thought that there is nothing that cannot be extended to more dimensions, you are correct! Every step described up to now can be easily generalized to three dimensional volumes, Lucas-Kanade included The proof is left as an exercise for the reader . Everything I described is already well implemented by the OpenCV library for typical 2D videos. That’s why at first I started diving into the code of the implementations of OpenCV. But for me it was too much optimized to make something useful out of it. Let’s remember that the OpenCV project was started 18 years ago and the main contributors were a group Russian optimization experts who worked at Intel, and they also had 18 years to optimize it even more! As an epigraph written and highlighted just outside my office says: “Optimization hinders evolution”, and I guess I experienced this firsthand. To be fair to the OpenCV project, I don’t think it was ever the intention of the developers to implement 3D versions of the algorithms.
But enough ramblings! Long story short, I reinvented the wheel and did my implementation of these algorithms in Python using Numpy, then generalized them to work on 3D movies. To test the new Median Flow I generated my own 4D data to make the problem very easy for the algorithm, like for example a volume of a ball, and then applying small translations to the ball.
3D Median Flow test results.
It works! Now we can try to make the problem a little more difficult and also have a bit of fun. For example, let’s add gravity to the ball! Or as my professor would say, let’s use a numerical integrator to solve the system of ordinary differential equations of the laws of kinematics for the ball subject to a gravitational field!
It took me way longer than I care to admit to make this.
But it’s not just balls! The beautiful thing about the algorithm is that it doesn’t care about the object it is tracking, but we can apply it to any sort of data, for example 3D MRI data! Let’s generate some 3D movie applying again some translations to make the head move and then track something, like the nose!
Having fun with 3D Median Flow. Database source: University of North Carolina
These toy problems are quite easy to handle for the algorithm because the object moves very slowly between one frame and the next one, and also they are quite easy to handle for my laptop since the sizes of the volumes is quite small. This is not the case for real world research data. Increasing the size of the data increases the computation required by the algorithm. Also when the object moves faster between frames we need to increase the total number of levels of the pyramidal representation and the size of the integration window for the Lucas-Kanade algorithm. This means even more computation, a lot more!
That’s why this is a problem for supercomputers and why I am moving onto the next point of my notes’ list: “High performance computing”. But this a topic for another blog post.
References
[1] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Forward-backward error: Automatic detection of tracking failures. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 2756–2759. IEEE, 2010.
[2] Jean-Yves Bouguet. Pyramidal implementation of the Lucas Kanade feature tracker. Description of the algorithm. Intel Corporation, 5, 2001.
My project has reached a turning point! What I’ve now got is an Multipole2Local segment of the Fast Multipole Method written in C++ using HIP and thus it runs on both AMD and Nvidia GPUs. The next step is to start implementing neat memory hacks and data reuse schemes to optimize the running time.
Most of my time wasn’t spent writing new code, but instead debugging what I had previously written. And I’ve had quite a number of those bugs. At least this many, accompanied with the appropriate reaction of course. To visualize this work, I could’ve taken screen-shots of the error messages, but instead I’ll seize this opportunity to show you what sorts of neat bugs I’ve found around the office!
Epilachninae Syntaxerronia in its natural habitat, the source code. They mark their territory with compiler errors, which makes finding them quite easy. This particular individual has made it’s presence very clear with in-line comments in the hope of attracting mates, but that also makes it more susceptible to predators.
I know what you’re now thinking: “What’s this ‘Multipole2Local’ anyway?” and “Wait a minute, you’re slacking off! Halfway through the program and you’ve only got one part!”. Don’t worry, I’ll address these issues right away!
The Fast Multipole Method consists of five separate phases: Particle2Multipole, Multipole2Multipole, Multipole2Local, Local2Local and Particle2Particle. As I mentioned in my previous post, the area in which we want to compute the potential is divided into small elementary boxes. The box division is done by dividing the whole area into eight smaller boxes and each of those boxes is again divided into eight smaller boxes and so on, until desired coarseness of the division is reached. The smallest boxes are called elementary boxes. You can then describe the whole structure using an octree.
In Particle2Multipole phase, the spherical harmonics expansion of the potential function are calculated for each elementary box.
In Multipole2Multipole phase, the individual potential function expansions of elementary boxes are shifted up the octree and then added up as the potential function of each parent box.
In Multipole2Local phase, the far-field effects of several spherical harmonics expansions are converted into a local Taylor expansion of the potential function.
In Local2Local phase, the Taylor expansions are added up and shifted down the octree. Now the far-field potential are represented as an Taylor expansion in each elementary box.
Finally, in Particle2Particle phase, the near-field potentials are calculated classically pairwise.
The reason for working with just one segment of this algorithm becomes apparent from the following figure:
The runtime measurement of each segment of the algorithm [1].
Most of the runtime is spent in converting multipole expansions into local Taylor expansions, so there’s more speedup attained from optimizing the M2L segment than other parts. This project is an unusual one from the high-performance computing point of view, since usually you want to reduce the program runtime from several days to several minutes, but now we want to go from 2 milliseconds to 1 millisecond. In order to do that, one needs to utilize all sorts of non-trivial tricks close to the hardware.
A rare picture of Heterocera Segfaulticus. This nocturnal species is quite elusive, as they’re usually encountered during runtime, but its nutrition, liberal usage of memcpy and funky pointer arithmetic is found nearly everywhere. The large eyes are optimal for detecting tasty binaries in low-light environments.
The memory bandwidth acts as a bottleneck in this computation problem, which means that cutting down memory usage or speeding it up are the ways to increase the efficiency. There are now two ways to proceed: to look into the usage of constant (or texture) memory of GPU, which cannot be edited during computation kernel execution, but accesses to it are faster than in global memory. On the other hand, on the GPU we have FLOPSs in abundance so we wouldn’t need to store any data, if we could compute it on the fly. These are the approaches I will be looking into during the rest of my stay in Jülich Supercomputing Center.
In other news, I took a break from all of this bug-hunting and coffee-drinking and paid a weekend-long visit to Copenhagen, at another Summer of HPC site, where Alessandro and Konstantinos are working on their fascinating projects. Their workplace is the Niels Bohr Institute, which is the same exact building, where Niels Bohr, Werner Heisenberg and other legendary physicists developed the theory of quantum mechanics in the 1920s. And nowadays, the building is overrun with computer scientists; these truly are the end times.
The Niels Bohr Institute in Copenhagen. Those vertical rails with some smaller sticks protruding out of them is a beautiful lighting setup which represents particle measurements at one of those enormous detectors at CERN.
August 7th 5:30 am, Copenhagen Central Station. I spent the weekend exploring the city along with my roommate Alessandro and my friend Antti, who traveled from Jülich to visit us. Among other activities, we walked the helical corridor to reach the top of the Rundetaarn or Round Tower and attended an outdoor chill-out music festival. Unfortunately, it was time for Antti to take his train so we said goodbye and promised to meet soon. Since the day was young, we decided to keep biking and enjoy the sunrise by the waterside. What better place for an excursion than the Langelinie promenade and the world famous statue of The Little Mermaid? Soon we were enjoying the view, clearing our minds while the city was slowly waking up.
10 am, Niels Bohr Institute. With a cup of coffee in my hand, I was ready to address the challenges of the day. But “what is the project you are working on”, you may ask. As I explained in a previous post, my summer mission is to improve the performance of the ocean modeling framework Versatile Ocean Simulation (Veros). Implemented in pure Python, it suffers from a heavy computational cost in the case of large models, requiring the assistance of Bohrium – an automatic parallelization framework that acts as a high performance alternative for the scientific computing library NumPy, targeting CPUs, GPUs and clusters. However, Bohrium can only do so much; even when it outperforms NumPy, simulations are still time consuming. It is clear that an investigation of the underlying issues behind the performance of Veros is essential.
Naturally, during the initial period, my time was mostly spent studying both the Veros documentation and the respective source code. I configured the environment for running existing model setups and experimented with the available settings. Running local tests and benchmarks of various models and backends, including NumPy and Bohrium, helped me familiarize myself with the basic internal components and the hands-on workflow of the ocean simulations.
The golden rule of optimization is: never optimize without profiling the code. Thus, I embarked on a quest to execute large scale simulations, from hundreds of thousands to millions of elements, in order to compare the performance with and without the use of Bohrium and analyze the running times of the core methods to identify possible bottlenecks. Such benchmarks are computationally intensive so I scheduled ocean models of different sizes to run on a university cluster for long periods of time. As expected, Bohrium becomes more efficient than NumPy if the setup contains at least a million elements. Among the most time-consuming methods in the core of Veros are isoneutral mixing and advection, even though they are straightforward number-crunching routines and Bohrium should theoretically accelerate them to a greater degree.
In other words, Bohrium makes a difference when the number of elements exceeds a certain threshold but there is still much room for improvement. Towards this direction, there are two possible plans of action: replace certain slow parts with handmade optimized and parallelized code or modify the existing NumPy code in order to improve its acceleration by Bohrium. For the former scenario, I am working on porting methods to Cython by adding static types and compiler directives in order to diverge from the dynamic nature of the language and speed up these specific segments independently of Bohrium. I also intend to take advantage of Cython’s OpenMP support in order to enable parallel processing via threads. For the latter part, I need to study the internal details of Bohrium and examine the C code produced by it for these time intensive methods.
There are multiple possible paths to be considered and tested as solutions for the problem. In the next weeks I will attempt to implement and compare them, making sure that performance gains by Bohrium are not affected by these changes. On the contrary, Bohrium should benefit from the code restructure. Besides working on the project, I plan to visit more beautiful sights of Copenhagen and make a dedicated blog post about the city if there is enough time!
1. The Problem
Sharing data among researchers is usually an afterthought. In our case, data is already shared publicly on a data repository, which is called DSpace. DSpace serves as an open-access repository for scholarly data published in various scientific fields. The main focus here is on the climate data.
Through DSpace, climate researchers and institutions can easily share their datasets. However, shared files can be considered as a “black box”, which needs to be opened first in order to know what is inside. In fact, climate simulation models generate vast amounts of data, stored in the standard NetCDf format. A typical NetCDF file contains a set of many dimensions and variables. With so many files, researchers can waste a lot of time trying to find the appropriate file (if any).
The goal of our project is to produce intelligible metadata about the NetCDF-based data. The metadata is to be stored and indexed in a query-able format, so that search and query tasks can be conducted effectively. In this manner, climate researchers can easily discover and use NetCDF data.
2. Data Source
As already mentioned, the DSpace repository is the main data source of NetCDF files. DSpace is a digital service that collects, preserves, and distributes digital material. Our particular focus is on climate datasets provided by Dr Eleni Katragkou from the Department of Meteorology and Climatology, Aristotle University of Thessaloniki. The datasets are available through the URL below: https://goo.gl/3pkW9n
3. Background: NetCDF Files (Rew, & Davis, 1990) and (Unidata, 2017)
As the NetCDF format is usually used within particular scientific fields such as life sciences and climatology, it is expected that the reader may not be familiar with it. This section serves as a brief background to the NetCDF format, and its underlying structure.
NetCDF stands for “Network Common Data Form”. It emerged as an extension to NASA’s Common Data Format (CDF). The NetCDF is defined as a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. NetCDF was developed and is maintained at Unidata.
Unidata is a diverse community of education and research institutions with the common goal of sharing geoscience data and the tools to access and visualise that data. Unidata aims to provide data, software tools, and support to enhance Earth-system education and research. Funded primarily by the National Science Foundation (NSF), Unidata is one of the University Corporation for Atmospheric Research (UCAR)’s Community Programs (UCP).
The NetCDF data abstraction models a scientific data set as a collection of named multidimensional variables along with their coordinate systems and some of their named auxiliary properties. The NetCDF abstraction is illustrated in Figure 1 with an example of the kinds of data that may be contained in a NetCDF file. Each NetCDF file has three components: dimensions, variables, and attributes.
A variable has a name, a data type, and a shape described by a list of dimensions. Scalar variables have an empty list of dimensions. The variables in the example in Figure 1 represent a 2D array of surface temperatures on a latitude/longitude grid and a 3D array of relative humidities defined on the same grid, but with a dimension representing atmospheric level. Any NetCDF variable may also have an associated list of attributes to represent information about the variable.
Figure 1. An example of the NetCDF abstraction.
4. Project Objectives
Basically, we are attempting to facilitate search and query of NetCDF files uploaded to the DSpace repository. A set of objectives were defined as below:
Defining the relevant metadata structure to be extracted from NetCDF files.
Extraction of metadata from the NetCDF files.
Storage/indexing of extracted metadata.
Supporting search/querying functionalities.
The project is developed in a collaboration between GrNet and the Aristotle University of Thessaloniki. GrNet provides us with access to the ARIS supercomputing facility in Greece, and they also manage the DSapce repository.
References
Rew, R., & Davis, G. (1990). NetCDF: An Interface for Scientific Data Access. IEEE Computer Graphics and Applications, 10(4), 76-82.
Unidata. (2017). Retrieved on 10/08/2017 from http://www.unidata.ucar.edu/about/tour/index.html
Greetings from Edinburgh’s sunny festival season 🙂
During the past few weeks I have been both exploring this vibrant city and cultivating my Python programming skills. I will give you more details about how my work has been going, but first let me introduce you to some of the adventures I’ve had. From Edinburgh castle to Museums of Modern art and street performances this city has it all. The local pubs have their own vivid rhythm and the traditional delicious fish and chips combination. We were even brave enough to try haggis and deep fried mars bars here. 🙂
But I think the most adventurous day I have had during this trip was hiking in the Pentland Hills Regional Park. The route was challenging and the weather unpredictable, but we managed to complete our mission and climb up and down the hills with the moor flowers.
Fringe Festival 2017
Let’s take a hot chocolate, a cup of tea now and take a closer look at my journey with Python and Geology!!! 🙂
The idea to use automated algorithms to facilitate the work done by geologists (as far as geological interpretation is concerned) is not new, but a recent increase in research conducted in the field of machine-learning and the implementation of neural network methods proves that it is time to revisit the topic. That was the motivation for my project for SoHPC project at EPCC in collaboration with the British Geological Survey(BGS).
Map of boreholes in the Netherlands and Dutch sea
During my summer internship, I have been familiarizing myself with methods concerning the simultaneous correlation of multiple well logs and have also been understanding their current challenges and limitations. This research field has been very new to me but still I find it very exciting. In general, I have focused on implementing Python tools and libraries to the preprocessing of well log data from the Netherlands and the Dutch sector of the North Sea continental shelf. Understanding which parameters and measurements will be of greater use to geologists has also been part of my work conducted through publication and literature search.
After the preprocessing is complete I will implement already existing well log correlation methods based on the preprocessed data. In this way, geologists will be able to evaluate the limitations of the existing methods and the quality of specific well log measurements. The next step would be to create a neural network based application for the automatic and simultaneous well log correlation. The preliminary design and concept on which I am currently working on has proven to be challenging as well.
So I am really looking forward to the final results of this effort in the future and I am more than excited to be participating in such an interesting project.
And just to give you a glimpse of well log correlation results, please see below 🙂
Left : Density distribution as a function of depth for various well logs. Right : After well log correlation the density of the well logs as a function of relative geologic time .
Since in the last entry I promised that I am going to talk about the comparison between the fixed point method and the projected gradient method I will dwell on this. But first of all, let me show you some of these magnificent places where we have been to during the last few weeks.
Besides probably the most famous cave in Slovenia (Postojna cave) we also visited a few smaller and less known caves, e.g. in Rak Skocjan, which were all very interesting. In fact, Rak Skocjan is the oldest landscape park in Slovenia and it is basically a valley which is enclosed on all sides by cliffs.
On our way to a cave
Inside a cave that belongs to Rak Skocjan
Another impressive place that we have inspected was Lake Cerknica which is an intermittent lake. During the winter it turns into the largest lake in Slovenia but during the summer it totally dries out. Luckily, we attended the lake early enough, so we could still find a part that even allowed us to swim.
The totally dried out Lake Cerknica with its sinks that make it look like from another planet
At this point I want to thank our supervisor Leon Kos who brought us to these hidden places with his car!
In my last blog entry I briefly introduced the topic of the project. After fixing some bugs of the fixed point method implementation, it is now possible to compare the results with the projected gradient method. In particular, one could have a look at the quality measure RSE again in order to observe a difference in the behaviour of convergency. For this purpose, I created a randomly chosen test case and ran 500 iterations for both methods. In the following figure you can see that the fixed point method converges much slower to the solution than the projected gradient method.
trend of quality measure RSE for both FPM and PGM
Nevertheless, for the PGM we measured a noticable larger execution time. This can be justified with the rather expensive computation of a step size in each iteration that leads to accurate results. How much faster the fixed point method compared to the projected gradient method is will be clarified in the next post where I will come up with some benchmarks.
What is a high-performance computer (HPC)? In simple terms, nowadays it is a cluster of innumerous ‘mini’ computers (nodes) interlinked by the same network. When a big calculation is submitted to the HPC, it will be fragmented into thousands of millions of mini tasks to be completed by individual nodes simultaneously. Hence the speed of the calculation is improved.
However, sometimes you have to wait for the results from one task in order to complete another task. Hence the mini tasks cannot be split as evenly as we want between all nodes. Therefore, there are always some nodes that are busier than others. But how do we know this information and make sure the scheduling process is as fair as possible? Obviously, it is impossible to schedule millions of tasks manually to optimize the process as lots of computing time is wasted – which is neither fast or economical.
Yes! My group in IT4I has foreseen the problem and started solving it by developing a package containing a C++ core with a Python interface. My job here is to develop python code to describe the traffic between different nodes visually, as shown in the figure below:
A network diagram showing nodes traffic
This is a network diagram showing the traffic between the server and 8 workers (0-7). The arrows represent the flow of data between each worker. The width corresponds to the volume of data. Clearly you can see there is a large amount of data being exported from W0 to W5 as well as from W3 to W6. Finally, after huge, tedious communication between all workers, they send their final results to the server and the results are finally returned to us.
In the process of making this I have compared two python modules which can generate network diagrams; NetworkX and Graph-tool. NetworkX focuses on the network analysis and network related data mining, e.g. finding the shortest path between two nodes. However, since more or less all nodes communicate with each other, there is no important shortest path for us. On the other hand, Graph-tool is good at more general data-visualisation and less focused on network analysis.
So, do you know which module I chose to draw the graph above? 😉
In my previous post it was all just dry theory and nothing to show (apart from a great food picture). Today the post will only be about presenting the “cool stuff”, and for that, just for you guys, my buddy Arnau filmed my horrendous face showing you what I have been working on.
Scheme of the web application
I’ve been really busy the last couple of weeks developing the web application, and I am glad to say that most of the heavy lifting is done. The web app I developed somehow works (sometimes even I wonder how that is possible).
I used the modern approach to building it. This means that the front-end runs on your browser and it only requests data from the server via a REST API and WebSockets. This methods has its ups and downs but let’s say for this kind of webapp it is the best solution.
The figure on the right should help you get the general idea and show the big picture. I am not going to explain the architecture and boring stuff of the back-end but rather show the fancy things on the front-end that can be separated to 3 parts:
charts (cool)
one single number (fancy)
interactive 3D model of the Galileo supercomputer (super mega awesome)
The Charts & Numbers
Charts are used for a viewing historical data in a timeline based fashion. This way you can see if the cluster behaves normally as it should or if there are some problems like huge raises in temperature. What you see below is a part of the public overview. These charts and big numbers use very aggregated dataset to show a meaningful information in a fast and easy way.
The 3D Model
And now comes the most super cool awesome part of all. The 3D model of the Galileo supercomputer. You can see it in action in the video on top (and if you didn’t see it, now is the right time to watch it). I took a Blender model which the good guys from CINECA prepared for me, did some small changes to it, converted it to a Blend4Web project, extracted the source files and integrated it into the webapp. Sounds easy, right? It uses WebSockets and MQTT to get the latest data and color-codes them so you can easily see what is happening with the cluster.
To sum it up, things are looking good and promising but there is still a lot of work to do. I didn’t even show another part of the job that displays the running jobs on Galileo and accompanying data to them.
In today’s blog post, I’ll talk about the successful (yaaaay!) progress of my project with a technical point of view. So if you are into Computer Science, I’m sure you will find this interesting! This post will be structured using some small pieces of the scientific article we’re writing, but of course I can’t write everything about the research here.
Introduction
Figure 1: Apache Spark architecture
Apache Spark applications written in Scala run on top of the JVM (Java Virtual Machine) and because of this they cannot match the FPO performance of Fortran/C(++) MPI programs which are compiled to machine code. Despite this, it has many desirable features of (distributed) parallel applications such as fault-tolerance, node-aware distributed storage, caching or automated memory management (see Figure 1 for an overview of the Apache Spark architecture). Yet we are curious about the limits of the performance of Apache Spark applications in High Performance Computing problems. By writing referential code in C++ with MPI and also with GPI-2 for one-sided communication, we aim to carry out performance comparisons.
We do not expect the resulting code to be, in terms of performance, truly competitive with MPI in production applications. Still, such experiments may be valuable for engineers and programmers from the Big Data world who implement, often computationally demanding algorithms, such as Machine Learning or Clustering algorithms.
Some literature review
Figure 2: HDFS architecture
In the same way as was mentioned in previous work, it has been widely known that High Performance Computing frameworks based on MPI outrun Apache Spark or HDFS-based (see Figure 2 for an overview of the HDFS architecture) Big Data frameworks, by usually more than an order of magnitude for a variety of different application domains, e.g., SVMs & K-nearest neighbors clustering, K-means clustering, graph analytics, and large scale matrix factorizations. Recent performance analysis of Spark have shown that computing load was the main bottleneck in a wide number of Spark applications, particularly during the serialization and deserialization time.
It has been shown that it is possible to extend pure MPI-based applications to be elastic in the number of nodes using periodic data redistribution among required MPI ranks. However, this assumes that we are still using MPI as the programming framework, hence we do not get the significant benefits of processing with Apache Spark required for large scale data, such as node fault tolerance.
Unlike previous approaches described in previous work, in what we are investigating we propose a fair comparison, with similar implementations of algorithms in Spark on HDFS and in C++ on MPI. We will run both algorithms with distributed data along commodity cluster architectures inside a distributed file system, using a one-sided communication approach with native C libraries that create an interface to encapsulate the file systems functionalities so as to preserve node fault tolerance characteristics exhibited by Spark with HDFS implementations.
Partitioned Global Address Space
To provide our non-Spark applications with a Fault Tolerance features, we’ve used the GASPI API for the Partitioned Global Address Space (PGAS).
Figure 3: GASPI one-sided communication segment-based architecture
GASPI is a communication library for C/C++ and Fortran. It is based on a PGAS style communication model where each process owns a partition of a globally accessible memory space. PGAS (Partitioned Global Address Space) programming models have been discussed as an alternative to MPI for some time. The PGAS approach offers the developer an abstract shared address space which simplifies the programming task and at the same time facilitates data-locality, thread-based programming and asynchronous communication. The goal of the GASPI project is to develop a suitable programming tool for the wider HPC-Community by defining a standard with a reliable basis for future developments through the PGAS-API of Fraunhofer ITWM. Furthermore, an implementation of the standard as a highly portable open source library will be available. The standard will also define interfaces for performance analysis, for which tools will be developed in the project. The evaluation of the libraries is done via the parallel re-implementation of industrial applications up to and including production status.
One of the proposed benchmarked algorithms: K-Means clustering
To benchmark the different approaches we propose, we have implemented some algorithms, one of them being K-Means clustering. K-Means clustering is a technique commonly used in machine learning to organize observations into k sets, or clusters, which are representative of the set of observations at large. Observations (S) are represented as n-dimensional vectors, and the output of the algorithm is a set of k n-dimensional cluster centers (not necessarily elements of the original data set) that characterize the observations. Cluster centers are chosen to minimize the within-cluster sum of squares, or the sum of the distance squared to each observation in the cluster:
where Si is the set of observations in the cluster i and ui is the mean of observations in Si.
This problem is NP-hard and can be exactly solved with complexity O(ndk+1 log n). In practice, approximation algorithms are commonly used to get results that are accurate to within a given threshold by terminating before finally converging, but these algorithms can still take a significant amount of time for large datasets when many clusters are required.
The main steps of K- means algorithm are as follows:
Select an initial partition with K clusters; repeat steps 2 and 3 until cluster membership stabilizes.
Generate a new partition by assigning each pattern to its closest cluster center.
Compute new cluster centers.
The following illustration shows the results of the K-Means algorithm on a 2-dimensional dataset with three clusters:
We’ve tested our implementations of the algorithm using different datasets, some artificial and some real data obtained from some bioinformatics studies from researchers at the Biology Faculty of the Slovak Academy of Sciences. The artificial data goes up to more than 10 million samples with 2 dimensions for some instances (to benchmark computational time), and also some high dimensional (more than 100 dimensions) instances (to benchmark time resilience).
So that’s all for today, if you are really interested about the work that’s going on, just wait for the next blog post or for the article to be published (if we’re lucky enough). Here is a video of the city of Bratislava celebrating my progress with some fireworks:
Over the past decades, the planet has been experiencing worldwide climate change. In particular, the global warming process is characterized by drastic weather events, such as long periods of drought, or short intense precipitation. These can crucially affect vegetation and animal species on our planet, including human beings.
The impact of global warming is expected to be especially strong in the Mediterranean regions which together with the North Eastern European regions, emerge as the primary hot-spots. Realistic predictions of how the climate will evolve and assessing the relative impact of human behavior upon global warming, is essential for advertising a strategy to avoid and contrast negative impacts.
Changes in average temperature over Southern Europe. Gray color refers to the current climate (up to 2017). Dots refer to the average annual temperature values, and the line to the 5-year rolling average. Credit: Sofiadis Ioannis, 2017, Study of Climate Change over Europe for the 21st Century with the use of regional climate simulation driven by the RCP8.5 scenario
The international scientific community thus plays a fundamental role in understanding and predicting the effects of climate change. By means of numerical models, it is possible to simulate processes occurring in climate systems which allows the forecast of their evolution over time. Today, climate models are more complex and computationally expensive than when they first appeared. As a consequence, they are also much more demanding with regards to computing and storage requests.
In my project I am handling output data from the regional climate Weather Research and Forecasting (WRF) model, a numerical weather prediction system for both atmospheric research and operational forecasting needs, which can generate atmospheric simulations using real data (observations, analyses) or idealized conditions. It has been developed by the National Center for Atmospheric Research (NCAR), the National Oceanic, the Air Force Weather Agency (AFWA), the Naval Research Laboratory, the University of Oklahoma, and the Federal Aviation Administration (FAA).
Probability distribution function trend over time. Credit: Intergovernmental Panel on Climate Change, 2007
The use of visual representations has always been – and is – an integral part of climate science as it helps scientists to better understand complex climate phenomena. Model output of the WRF model are stored in the NetCDF (Network Common Data Form) format, which is very common in the climate community because of its advantages. For instance, it includes information about the data it contains, it can be accessed by computers with different ways of storing integers, characters, and floating-point numbers; a small subset of a large dataset may be accessed efficiently; and data may be appended to a properly structured NetCDF file without copying the dataset or redefining its structure.
In the past three weeks, I focused on temperature variation, which is the most evident feature of climate change. I wrote some pipelines in order to extract all the information from different NetCDF output data of a “pessimistic” simulation. What does a mental attitude like pessimism have to do with climate models? Well, the most relevant impact on climate evolution is anthropogenic, i.e. impact which is a consequence of human activity. In particular, climate models use a wide range of possible changes in future anthropogenic greenhouse gas emission (the biggest culprit of climate change), and the simulation I am dealing with assumes that greenhouse gas emissions will continue to increase until the end of the 21st century, representing the worst case (hopefully not the real one).
I have produced temperature color maps over Central and Southern Europe for different years, with a final animation of local temperature change over time. It has also been done for different altitudes (see the animation, only for 1990), which carry important information. In fact, global warming often occurs faster at higher altitudes – on mountains, for instance. And what happens on mountains has a deep impact on critical aspects of our economic and social system, such as access to water (re)sources. In fact, warming causes the snow to melt, with a subsequent increasing temperature of the ground. This implies that snow will accrue slowly in the winter, and will melt faster in the spring, with less – and less fresh – water flowing down the mountains during the summer. If more than one billion people don’t have daily access to clean water, climate change will definitely aggravate this situation, with also serious implications for agriculture.
As the professor in the Department of Marine and Coastal Sciences in the School of Environmental and Biological Sciences, Jim Miller, said a couple of years ago: “Water is going to be a major problem over the next few decades anyway and climate change is going to exacerbate it. Who gets the water? Are you going to use the water to grow crops or are you going to use the water to fill swimming pools in LA? Those are ultimately social and political decisions. With climate change, those changes could be more dramatic.”
It’s up to all of us to decide which answer has to be given to Miller’s question.
Pasta, pizza and ParaView are the topics I want to talk about in this post. As a Catalan guy, I’m used to the Mediterranean diet and it is impossible for me not to talk about food when staying in Italy. Of course, I would like to add the fourth P here: paella. But let’s stop daydreaming and get back to business!
Pasta, pizza and more
The homemade pizza made using ingredients from Parma.
Italians say that you cannot get fat from pasta and pizza if they are properly made. And in order to do that one must visit the city of Parma, home of the famous parmigiano cheese and prosciutto di Parma, which were bought in abundant quantities. Guess what you can do with all this food: a homemade pizza (and probably the best pizza ever so far)! The outcome? Well, you can guess what happens when top Italian ingredients are mixed. Magic happens! I also visited the beautiful city of Venice (click on the names of the cities to check the photo album).
ParaViewWeb
Demo of ParaViewWeb application “Visualizer” on an iPad.
Now, let’s stop the food talk for a moment. In my previous post, I explained that a simple and efficient way to visualize scientific data can be to simply have it displayed in a web browser. Well, that’s exactly what ParaViewWeb API is for. ParaViewWeb is a set of tools to build a web application where the data can be loaded, visualized and it also allows you to interact with the data..
How does it work?
Let us imagine a customer enters a restaurant. The client looks at the menu and asks: “I want the pizza ‘chlorophyll concentration on the Mediterranean Sea’” to the waiter. The waiter replies: “very good choice sir” and goes into the kitchen to tell the chef what the customer wants. The chef prepares the meal and gives it to the waiter and the waiter serves it to the customer. Finally, the customer gets what he has asked for on the table. Here, the customer is the client, the waiter is host 1 and the chef is host 2.
Using this example, I will explain the problems I have to deal with, if I want my “restaurant” (or web application) to work properly:
Customers generally don’t want to wait (or wait as little as possible), so the waiter must be fast in delivering the visualization to the client.
Waiters need to remember what each customer wants and not mix up or forget their deliveries.
The chef must be efficient in preparing the data. Here, ParaView Python is the one dealing with getting, loading and pre-processing the data, before it is given to the customer.
As in every restaurant, the customer needs to stick with what is on the menu, i.e., the available apps. Variations to the menu might be easily done but changes can take more time.
What have I done so far and what I will be doing
During this past weeks I’ve been setting up the “restaurant”, i.e., I have set up my host 1 using Apache2. For now, my host 2 is the same as my host 1, however, PICO will be used as host 2 in the final implementation of my application. In the coming three weeks, I will be developing the “menu”, i.e., the web application to visualize the data-sets from OGS.
A visit from OGS
I cannot end this post without talking about a short visit I received from OGS. Cosimo Livi, who is doing his master thesis in OGS, came to CINECA as he is currently involved in data post-processing using ParaView. Together, we worked to develop a methodology to efficiently convert the data-sets from NetCDF to VTK for ParaView.
Cosimo Livi from OGS and me during his visit at CINECA last week.
In my previous post I mentioned that a team at IT4Innovations developed an interesting Plugin for Blender which is useful for processing Computed Tomography (CT) images. In this post, I will briefly discuss how this tool makes it extremely easy to generate 3D models of bones from CT data. The same tool can also be used to generate models of various organs and calculate the volume of various tissues in the human body.
Blender plugins are Python scripts that the user interacts with using the Blender GUI. The actual image processing algorithms are implemented in C++ with OpenMP, with PyObject interfaces so that they can be called from Python. In brief, the plugin developed by IT4I allows somebody with little knowledge of programming to use a high performance cluster for DICOM image processing.
At the click of a button, the user can load in a series of ordered DICOM images, view the loaded images by using a slider to move through them and select various orientations to view the data. Furthermore, the user can apply various denoising filters to view the data in greater clarity, or to improve the effectiveness of later image processing.
Sagital View of Human Skull
Segmented data using K-Means Algorithm
While only one 2D slice of the data is shown above, the image processing operations are applied to all selected slices of the 3D image data. Once the data has been segmented, it is only necessary to select a segment to be converted into a structural 3D model called a “mesh model”.
The level of detail of the mesh model generated by the tool can be controlled by various parameters. Below is a higher detail model generated from the same images as the previous model.
The appropriate detail of a model produced using this tool will vary, depending on what it is to be used for. Highly detailed models will be computationally expensive to render, and as such, may not be suitable for games or interactive tools that require the model to be rendered in real time. However for generating pre-rendered animations, a highly detailed model may be more impressive, and worth the computational resources required for production.
In my next post I will discuss pre-processing techniques that can be used to improve the results of the K-means segmentation.
Most Summer of HPC participants arrived in Ostrava on Sunday. For them, an adventure that will see them away from home, friends and family for a period of two months had just began. But this adventure would see them working on a HPC related project in various European HPC centers where during this time they would learn a lot, furthering their knowledge and skills on HPC, programming and life in general.
The learning began during the PRACE Summer of HPC training week which took place between Monday 3rd – Friday 7th July 2017 at IT4Innovations, Czech Republic.
PRACE Summer of HPC 2017 group photo
During this week, participants learned about PRACE and the Summer of HPC program. Given this is part of PRACE’s outreach activities they were taught about social media and its impact. They were also shown the Summer of HPC blog, were advised on how to best write blog posts and how to share their posts using social media channels.
The Summer of HPC participants were also presented with various aspects of HPC. This included hands on sessions on how to access a HPC system, parallel programming and visualization. Various programming subjects including an introduction to MPI, introduction to OpenMP, vectorization and parallel debugging were also shown to participants.
PRACE SoHPC Participants learning hard
During the week, participants and organisers of the Summer of HPC program took part in various social events. Each night all of them would have dinner together – giving everyone the opportunity to get to know each other better and create friendships.
PRACE Summer of HPC 2017 dinner – one of many
Other team building activities included a visit to the lower area of Vítkovice – a site of industrial heritage which includes ironworks with a unique collection of industrial architecture, and clubbing at Stodolní street. More details of these social events can be found in Dimitra’s “Ostrava : A city with a heart of steel” post.
PRACE Summer of HPC 2017 lower area of Vítkovice visit
After a final barbecue farewell dinner, participants set off for their project sites in various locations around Europe. There they would stay for the next seven weeks, learning more about various HPC related subjects, working hard on their projects and exploring the wonders of the cities and countries they would be based in.
My project has been progressing quite well and a bit more of that in the next blog post. However, this time I present a more detailed description of the beginning of my project. The narrative may have been dramatized in some points.
Business as usual
The clock’s drawing close to midnight, the rain is battering against the office window and the scent of fresh coffee is floating in the air. Usually, this sort of atmosphere would be relaxing, but unfortunately the cup of coffee is the last one I can afford for now. I’ve been without a case for weeks and apparently I’ll have to sell my trusty office chair to satiate my caffeine addiction. I hear that the stand-up working style isn’t too bad though, it’ll keep your legs healthy and posture proper.
My usual day at work: answering old phones and manually browsing analog databases.
The name’s Mikkonen. I’m the current owner of the not-so-successful ZZX Jülich Detective Agency. My specialization is missing HPC implementations of algorithms. Business hasn’t been too good lately, mostly because of allthesenewcompetingfaces on the scene and because libraries like OpenCL make it too easy to write parallel code for all sorts of platforms. It’s ever more difficult for a man to make do with this profession in this corner of the world. I wonder if competitive coffee drinking is a thing? Or perhaps exotic table-dancing?
Curious encounter
All of a sudden there’s a knock on the door. What a peculiar hour for an insurance salesman to bother me. “Come on in!”, I yell reluctantly. As the slowly-opening door creaks, Damn, I need to oil the hinges again, in walks a young woman in a red dress. She’s about in her mid-twenties and her brown, slightly curly hair extends past her shoulders. Her distress is obvious. However, her HIPs instantly capture my attention.
“Have a seat. How can I help you, Miss…?”
“R9 Fury Radeon.”
She might be beautiful, but her trouble-causing capability is supreme.
“I have a problem, Mr. Mikkonen”, she began telling her tale, “and you’re my last hope.” Quite intriguing and alarming, at the same time.
“Go on.”
“I’ve contacted many agencies but none of them wanted to take up my task. You’re the last one on my list. It all started two weeks ago… Oh, blast, now that I think about it, my agency literally is the last one on the phone book. Maybe that’s why business has been so slow? I’d need to change the name, but what about the memory of Zacharias “Zulu” Xavier, my late mentor, who tragically lost his life in a freak compiling accident…? Oh crap, I’ve got to listen to her.
…and thus I desperately need an extremely fast Coulombic force solver.”
“I’m not too familiar with n-body solvers. How do you expect me to come up with the code?”
“I’m a graphics card from a renowned family, AMD, but lately software developers have focused mostly on our rivals, the Nvidia.” Sorting out family feuds? Not my favourite kind of gig.
“Not all of their work is in vain, though.”
She sets down a small piece of paper onto my desk.
“Here are the access codes to a Git repository with Fast Multipole Method solver written with CUDA. Don’t worry, they’re perfectly legally acquired. I want you to clone it and transform it into something I could use.” How about AM Investigations, or Antti’s Covert Surveillance…
“Um, hello…?”
“Oh, yes! I was already thinking of a possible parallelization scheme! I’ll accept this task. Sounds like a fascinating challenge!” It’s not like I have too many choices here, if I want to avert a coffee-less future.
“Thank you! You’re a brave man, Mr. Mikkonen.”
“Just leave me your contact details, I’ll get in touch with you when there’s been a development.”
“Of course, here’s my card. Report to me as soon as you come up with anything!”, R9 Fury uttered as she hurried off, out of my office.
My hunch tells me that there’ll be some major hindrances which she conveniently forgot to mention. The card she gave me reads:
Hmm, Jurock. That certainly sounds like a proper computation cluster, but it’s located in someone’s office, not in the big machine hall. Her family certainly isn’t doing well at all.
The job
Well, no use just sitting around, there’s coding to be done! My sluggish but trusty personal computation machine boots up and a Git login prompt appears. Let’s see, username: johannes.pekkilae… I peer into the CUDA FMM repositry and the source code is jumbled with loop unrolls, preprocessor macros and lots of template magic! A headache emerges just from looking at it. It’s going to take me weeks to port this!
A snippet of Johannes’s code. It’s blazingly fast, but not too pretty or easily understandable. The Atom editor font is nice, at least.
Pondering my next step, my thoughts drift back to that woman. The woman, and her HIPs. Of course, the HIP, Heterogenous Interface for Portability! It’s a collection of tools which enable the portable development of GPU code. Because of the prevalence of CUDA, HIP contains scripts with which one can ‘hipify’ their code, converting it from CUDA C into more general HIP C++. The perfect utility for this situation!
Except, when it’s not. CUDA function __ldg wasn’t found from the HIP framework yet and the conversion failed. No matter, function basically just speeds up data reading by using the read-only data cache. By changing that into utilizing global memory, we’ll get a working code. We don’t get the performance benefits now, but optimizations can be done afterwards.
After a successful source code translation, it’s time to compile! Compilation is done with hipcc, a perl script which calls nvcc or hcc compilers with the appropriate parameters depending on the target infrastructure. Easy as A-M-D! Wait, “error: expected an expression”, “error: identifier is undefined”, “error: no instance of function template”. It appears that the script’s still in development. Well, at least the compiler itself didn’t experience a segmentation fault this time. This night is going to be long, but quite interesting.
Tools of the trade – a GTX1080Ti, my laptop and a set of equations.
The project I have laid my hands on goes under the fancy title of
Hybrid Monte Carlo Method for Matrix Computation on P100 GPUs
Now most will understand the ending, but not the beginning of the above. Thus I’ll endeavor to decipher the statement.
In principle the main goal of the project is to accelerate the computation of a so-called preconditioner using Markov Chain Monte Carlo methods. The latter will then be used to accelerate the solution of linear systems of equations by means of iterative methods.
“Monte Carlo” already suggests some meddling of chance in the process but the remainder of the statement is still cryptic. Still, all shall be explained in due course. First a brief digression:
Linear systems of equations
From a linear system of equations to a matrix.
The workhorse of scientific and everyday computing, whether in computer games or a simulation of the universe, they are everywhere.
A linear system is a set of equations, as depicted in the figure on the right where the unknowns (x,y,z) appear untainted by any function (i.e., not as x2). In the figure one can also see the process of transforming a system of equations into a shorthand notation using matrices and vectors – something one may remember from high school. The matrix-vector formulation is more common in science.
Every linear system can be solved, in principle, using the well-known Gaussian elimination procedure. The method is very expensive, requiring ~N3 arithmetic operations (with N being the number of equations) and is thus far too expensive for many applications. In general, engineering and scientific problems yield linear systems with millions or billions of equations. Solving them using direct methods would slow the progress of technology to a crawl.
Hence one resorts to iterative methods. Methods which compute, step-by-step, an increasingly refined approximation of a solution using basically only matrix multiplications. Since matrices from real problems are often sparse (i.e., most of their entries are zero), the cost of iterative methods is a lot less than for direct methods.
Preconditioners – the why and how.
Alas! There is no such thing as a free lunch. The drawback of iterative methods is that the number of steps to reach an acceptable solution depends on the system matrix (A in the above image). The role of a preconditioning matrix is thus to accelerate the convergence of an iterative method. To achieve this it should approximate the inverse of the system matrix well and be easy to compute.
Suffice it to say that the study of preconditioners is difficult enough to warrant a dedicated field of research in mathematics. Many simple and elaborate preconditioners already exist.
Our place at the frontier of science
Since the problem of convergence is not a new one, much has been done in the field of preconditioning already. Generally, however, the preconditioners suffer from bad parallelizability – an undesirable illness in today’s world of high performance computing.
The method which I shall implement in due course may be used to compute a rough approximation to an inverse matrix to be used as a preconditioner. Furthermore, the method is cheap and being a Monte Carlo method, should in principle scale well, i.e., the computation should require less time if more computers are used.
To achieve the desired acceleration, theoretical analysis and practical implementation utilizing the enormous computational resources of the GPU will have to go hand-in-hand.
The random component we will utilize has a twist, however:
Markov Chain Monte Carlo
A simple graph which could be a Markov chain
What is a Markov Chain? For now, the example in the accompanying figure shall suffice. It is a graph (a collection of nodes and edges) where each node represents some abstract state and the edges provide the possible ways (and their probabilities) for a state to change.
We can compute a solution of a linear system essentially by creating a Markov chain (read: graph) from the system matrix, picking a state at random and performing a random walk on the graph. That is, starting in a state at each time step (i.e., every minute) we may choose some path (edge) available to us, flip a coin and based on the outcome and the weight of the edge decide whether we go along or not. If we do this for a sufficiently long time and use a certain formula we can compute entries of the preconditioner based on the nodes we visit.
A (hopefully) more satisfying explanation, will have to wait for a follow-up post.
Summa-summarum
The goal is clear:
Accelerate iterative methods by creating a method which is able to quickly compute a good preconditioner using Markov chain Monte Carlo implemented and optimized for the NVIDIA P100 GPUs.
Compare the performance of the resulting implementation with other state-of-the-art methods.
Investigate additional stochastic methods with the same goal in mind.
Here we go, my second official blog post (does this mean I can add blog writing to the skills section of my CV?). I think I’m starting to get the hang of this blog writing stuff (I’d appreciate it if you’d let me know if this is the case).
I’m far from home but at least Vienna is close
I’ve been in Bratislava for two weeks now so I feel I can write a post about my initial impression of the city.
The general consensus that I got after speaking to some natives of Bratislava was that the city could be fully explored in 2 or 3 days (I guess that means I’m getting a lot of work done over the summer!). But I have to disagree, it seems that everyday there’s something new to see. Below are just some examples of the street performers (who are actually good!) or live music that I’ve seen so far.
In the centre of Bratislava (under St. Michael’s Tower) there’s a zero mile marker that gave me a sudden feeling of being homesick….. But some local gelato from Koun (the best in the city apparently) quickly remedied that!
~
~
~
~
~
There’s also much history to be seen in the city, such as Bratislava Castle, which was destroyed by Napoleon (if you’re not too familiar with Napoleon check out this link, it’s a brief description of him). One of the cannonballs used by Napoleon is actually still lodged in the Old Town Hall.
There’s also varying architecture throughout the city, the most prominent being Communistic. But there are some Gothic structures too, such as St Martin’s Cathedral (which has a giant 150 kg gold crown on its spire).
There’s still more for me to see but at the very worst I can always return home for a decent view..
The view from my room isn’t too shabby
The Fun Stuff
But let’s be honest, you don’t really care about what I’ve getting up to in the city, you want to know what I’ve been doing on my laptop (or more specifically on Aurel, the SAS’s supercomputer). Truthfully, so far not so much (but that should be changing soon…… hopefully!), I’ve been trying to get familiar with the code I’m working with.
It turns out that MPI has already been implemented in the key bottleneck of the code that I’m working with (so I can kick back and relax for the rest of the summer).
There’s a diagonalisation subroutine that must be parallelised, so my first task is to parallelise that (when put like that it sounds so easy). After successfully completing that, I plan on visualising the crystal orbitals of the nanotube being modelled (although I haven’t fully decided how I’m gonna do that yet…).
I guess this update is a bit light on the science, but I’ve been told always leave your audience wanting more. So you’ll have to come back and check out my next post for that.
Until then I leave you with my favourite street performer in Bratislava…
The birds eye view of Ostrava from the New City Hall Tower
Friends, are you considering applying for the PRACE Summer of HPC program but are concerned that you might not have the relevant experience? Are you worried that your programming skills are not good enough to master HPC and impress the judges? Do you feel like you’ve got nothing except enthusiasm and curiosity? If you do, please take courage and apply. You never know what the program is looking for. I’m going to share with you my own experience on the road to HPC.
When I started to apply for the PRACE summer of HPC program, I thought that my application would not go anywhere after I did a check list of myself. However, you’ll never know if you dont click the submit button.
Michal Colliery: the country’s only National Cultural Heritage Site.
Nationality:
I am a Chinese citizen who is studying in the UK on a British Student Visa. This was actually my biggest concern during my application. Even though the PRACE Summer of HPC program is open to all students who study in a European university, it still wasn’t very clear whether there is a preference for those students who don’t require a visa (e.g. a student who studies in a Schengen country and is going to do project in another Schengen country, meaning that they already have the required visa). But for me, I need a Schengen visa to go to any other country, except the UK. I really appreciate the full support the program provided towards my visa application for this summer, as it does mean much more effort and cost to sponsor me than the others.
Programming background:
I was almost full of tears when I finished reading the project requirements and the code test for the first time. Am I actually able to use Fortran, C++/C, Java, etc? My college had offered an 18-hour intensive C programming course just before I started my application. I was as clueless as a five-year old toddler – having managed to finish the classic hello-world program and no more. In order to complete the code test, I learnt some basic C++ coding by watching YouTube videos and the documentation from the C course, and finally answered the questions on the fly. I checked my answers in Python, the only language in which I am confident enough to plot a few sets of figures by using a couple of loops.
HPC experience:
I use HPC on a day-to-day basis for my own research. Therefore, I know the basics of ssh, PBS queueing system etc – just enough to keep my project going. However, with my project getting more complicated, I can no longer progress any faster due to my poor programming skills and lack of knowledge of HPC. Hence, my desire to join PRACE Summer of HPC program got stronger everyday ever since I heard about this program.
Non-computing experience:
I have a pure chemistry background from my undergraduate studies, involving ZERO programming experience. My one-year long industrial experience only taught me why HPC is important. But again, ZERO programming was involved. Luckily, I’m active in some societies and have acquired a few soft skills.
Music Festival for all – Colours of Ostrava
Here is my story about my journey to the PRACE Summer of HPC program. Guess where I am now. I’m sitting at my desk in the IT4Innovations centre, absorbing the useful new knowledge like a sponge, writing this post for you. Isn’t that great?! Next year, it can be your turn 😉.
Hello, with this post I would like to give an update about my project and other interesting activities. My current progress is mainly related to exercises for using vectorization and the Xeon Phi co-processors and therefore it would better to talk about the physics involved with my project in a later blog post. Currently I am practicing on using different methods to accelerate scientific workloads on modern processors. I also compare the performance of a dummy application using different compilers and optimization techniques.
Today’s processors feature faster operations over vectors of data. They offer SIMD (Single instruction, multiple data) instructions which means you can instruct the processor to calculate multiple numbers (e.g the addition of 2 vectors of 4 double floating-point numbers) with a single assembly instruction. This is faster because usually in conventional serial programming each data operation requires a single instruction (SISD – Single instruction, single data), and therefore there are more instructions to be processed for the same amount of data.
Traditionally, computer architects have been adding smart mechanisms to the processors to increase performance, apart from improving the manufacturing process and the transistor count it allows. Such smart mechanisms include the implementation of out-of-order execution and prefetchers. Out-of-order execution tries to automatically reorder the last N number of issued instructions to avoid under-utilization of the CPU from pending memory access requests. Prefetching tries to guess and fetch the required data just before they will be accessed to hide the memory latencies. I would say the combination of the two mechanisms could compete the vectorization approach for performance increase, but it is less scalable especially when trying to increase the number of cores on a chip, due to the significant hardware complexity overhead that comes with adaptive mechanisms. It seems that lately, the processor manufacturers favour adding more vector registers and functionality to their chips and it would be very sensible if the software was designed to utilize these registers for better performance. It is important to note that SIMD instructions were already used in GPUs for many years, due to the regular access patterns of the graphics software. The real challenge is to try to exploit vectorization in scientific software either by hand or to help the compiler do it automatically.
During the last week I was also involved in other interesting activities such as a bike trip with Antti, building a time-lapse video of the Jülich sky and leaving for a weekend for my graduation in England.
My apartment is on the top floor of the highest building in the area. It also has big windows that allow me to enjoy the landscape. It also happens that I always carry with me a Raspberry Pi, which is a small Linux computer with endless capabilities. This was the perfect opportunity to program it to make a time-lapse video of a cloudy day here at Jülich. It has been capturing one photograph every 2 minutes for a day and then it has produced the video by using FFMPEG. You can see the resulting video below or a portion of it in the post thumbnail.
Visualisation is realising a growing recognition as a pivotal part of the data analysis process. As the title suggests, the project aims to avail data visualisation to help the climate research community answer or discover interesting questions about the field. The project particularly adopts interactive visualisations in order to adequately explore and communicate the understanding of large-scale climate datasets.
Data visualisation can be ideally described as the transformation of the symbolic into the geometric. Various benefits can be gained through data visualisations. The pictorial representation of data can help answer or discover further questions. The usefulness of exploratory data analysis using visual techniques was early introduced in John Tukey’s landmark textbook Exploratory Data Analysis.
However, visualisation has gained particular significance in the wake of Big Data analytics. Exploratory visualisations have become an imperative tool which allows us to discover and summarise the main characteristics of such large-scale datasets. In our case, the project deals with large-scale datasets of climate research. The data were formerly produced by climate simulation models using the WRF software.
We mainly aim to build a dashboard that can foster climate research in a twofold aspect. First, the dashboard can be used to facilitate the organisation and indexing of climate data. As climate simulation models produce huge amounts of data, it is necessary to automate the process of archiving and indexing of data and metadata as well. Second, the dashboard is intended to provide interactive data visualisations. The visualisations are intended to be produced in a web-based format, which can be interactively explored through the web browser. To achieve this, we utilise Python along with the visualisation library Bokeh. The figure below provides an overview of the project development stages.
El Niño-Southern Oscillation (ENSO) is a major climate pattern that consists of oscillations of the meteorological parameters in the equatorial Pacific ocean. It happens every 2 to 7 years but it is not periodically stable.
Normal climatic conditions and El Niño conditions in the Pacific Ocean
What is ENSO?
Normal weather conditions are the following:
Low pressures (L) in the Pacific islands, causing rainy weather and warm ocean waters.
High pressures (H) near the eastern American coast, causing colder ocean waters and less rain.
When ENSO happens, this situation is drastically changed:
The low and high pressures switch places (this is why it is called Southern Oscillation), causing the rain to move towards the west.
Because of this, warm ocean waters move towards the west, causing alterations in the fishing patterns in South America.
ENSO is a phenomenon that involves the ocean, the atmosphere and the land in such a huge scale that it affects weather in the entire world, changing precipitations, temperature and wind flows along Earth. Among all of these consequences, we can highlight the following ones: massive forest fire in Indonesia, since it stops raining in that zone, flooding in Peru caused by a severe amount of rain in a small period of time, drought in India because of abnormally light monsoon months and an overall warming of the global climate temporarily.
All of these makes ENSO an important part of the weather climate cycle, a phenomenon which is worth studying and that is why I am working at the ICHEC in Dublin this summer.
What am I doing in Dublin?
First of all, I will analyze how past El Niño events have been, whether it is possible to find some trend in these events and I will also check how future El Niño events are predicted to be. Secondly, I will study how El Niño impacts other parts of the world, what is the correlation between all these events and how much was the impact in the last event. Finally, I will take a look at all the available forecast systems one season ahead, to check their accuracy and precision in detecting future El Niño events.
Abnormal ocean surface temperatures [ºC] observed in December 1997 during the last strong El Niño. Source: https://commons.wikimedia.org/wiki/File:El-nino.png
Why is it called “El Niño”?
El Niño is a phenomenon that takes places, normally, all year round but the moment in which water warms the most in the Peru coast is in December. This is why fisherman associated this phenomenon with the Christmas holidays and called it “El Niño” that literally means “The boy” after the nativity in the end of December.
Almost two weeks have passed since my first visit to EPCC at the University of Edinburgh. I met my mentor in the first week and we’ve started working on my project from the first minute of our meeting. A little background story for the idea of a project:
Online visualisation of current and historic supercomputer usage
Edinburgh Parallel Computing Centre is a supercomputing centre based at the University of Edinburgh. It houses the UK national supercomputer service called ARCHER. There are a lot of projects and communities that rely on this machine in their everyday research. It is a crucial, scientific instrument for many different people, not only academicss of the University of Edinburgh. ARCHER has an active user base of over 3000 users! It is said that a picture is worth more than a thousand words so check this infographic:
In order to build such a data visualisation page, we have to consider a number of topics. First of all, we have to get a data from EPCC servers. It’s not the same case for current and historic data. Next, data has to be preprocessed by the back-end and stored in a database that will allow fast read/write operations, periodic compression and will be able to process large datasets. Furthermore, we must look into one of the basic visualisation problems: What data should we visualise? Which part of data is the most important one?
Moreover, we have to prepare the design of this page and decide which framework we should use for the front-end development. The last one was the largest question of this past week. D3.js is the obvious choice for SVG manipulation. But this page should contain an extensible user interface that can lead to complex logic. This means that we should use one of many MVC frameworks. I’ve some experience with React, Angular and Vue. In my opinion, all of them are great, all have their good and bad design decisions. However, I find it nice that we can combine the imperative style of Angular with the declarative approach of D3. Furthermore, I hope that making an experience with Angular will be a good choice for my future self.
After an exciting training week in Ostrava, Paras and I had probably the most convenient travel possibility to our site because Leon brought us to Ljubljana by car. On the way, we made a small detour to listen to a concert of a youth orchestra where Leon’s daughter plays clarinet which was great. I was reveling in nostalgia because it reminded me of the times when I played in a youth orchestra. Also, the scenery was simply amazing with the mountain ridge in the background of the stage. Late in the evening we arrived at our destination and we moved into our beautiful dorm room in Dom Podiplomcev where we even have a balcony!
On Sunday, still tired from the training week we thought that we could finally catch up some sleep. According to the plan, our first day at our new working place would start at 9 am. That is actually late enough to sleep in. But this thought was immediately interrupted when Leon called us right before we set the alarm clock for the next day. He announced that the next day, early in the morning we would travel to the Italian border to visit the largest supercomputer of Slovenia called Arctur. The good news definitely predominated the bad news. It was very interesting to see how they set up the processors in a literally circular manner around the air cooling system. Unfortunately, we were not allowed to take pictures.
In the afternoon we were introduced to the other students and employees who work in the same laboratory. The working atmosphere is pleasant and we had lots of discussions (not only) about HPC which is nice. The job is very demanding and therefore they offer coffee for free. Since the canteen of the faculty of mechanical engineering is (for certain reasons) closed during the summer, we usually go to other canteens or student restaurants for lunch.
At the dragon bridge
Ljubljana is a very modern city which offers a great hospitality and quality of life. You can find parks everywhere and even in the city centre it is very green. In the first week, Paras and I registered for the public bike sharing system and since, every morning we go to university by bike. On the weekend, my girlfriend Alexandra visited me and so, the three of us went to Ljubljana’s castle which is a key landmark of the town. Inside the castle there’s a museum exhibition on Slovenian history.
Now, I’d like to say a few words about the project itself. My mentor Prof. Povh received a request from the area of biomedicine where they try to investigate the relation between different biological objects containing various information. One can comprehend these objects as networks. To mine (“extract”) all similarities between these networks we use the so called matrix trifactorization which can basically be reformulated as an optimization problem:
min
s.t.
A typical approach to solve this optimization problem is to derive the Karush-Kuhn-Tucker conditions and apply iterative methods on the first order condition. Initially, we chose the fixpoint method and the projected gradient method as such iterative methods. During the first week, I have implemented the C++ code while I started using the supercomputer during the second week in order to improve the code further and to use the Intel-MKL library.
Currently, I am testing the code on real data. To show you some results, you can see a figure of convergency in the following figure. We consider RSE as a quality measure of the factorization and observe it throughout the iteration process. It is defined as:
Keep in mind, that the RSE should be decreasing and reach a small value, preferably as close to zero as possible at the end of our iteration process. But this is only possible if we choose a large enough inner dimension k < n, where n is the size of the square matrices .
So, let’s take some random sparse matrices into account which we want to decompose. Suppose, k = 7. Then we achieve the following result after 1000 iterations for also randomly chosen starting point matrices G and S.
Trend of RSE for 1000 iterations
In my next blogpost, I will go into more detail about the results. Also, I will compare both the fixpoint method and the projected gradient method on real data.
Be it in films, visual arts or computer games, photorealistic rendering of a virtual scene has been a hot topic for decades. The ultimate aim of my project is to create a photorealistic render of a simple scene like the one shown above using what’s known as radiosity. Let’s dive into a brief introduction to the whole thing.
Rend ‘er? I ‘ardly Know ‘er!
So what is rendering? Well, stored in the computer is only the description of the geometry. For the box shown below, only the location of the faces are stored. Rendering is simply the process of turning that description of geometry into an actual image you can see. Boom.
The standard way to render a scene is to choose a virtual camera position, take the geometry that the virtual camera is “seeing” and draw it to the screen. For a simple box, ignoring perspective, that looks like this.
Orthographic view of box wireframe
Using perspective to make the box appear like it would in the real world look like this.
Perspective view of box wireframe
What about colour? Simply drawing solid colours looks a little strange, see?
Solid filling of faces with flat colour
What we really want is to be able to add a virtual light to the scene and give the box some kind of shading. Adding a light behind the camera produces this kind of image.
Box shaded using Gouraud shading
So What About the Rad Sounds of Radio-City?
The shading used for that last image uses a simple method called Gouraud shading. It’s straightforward and effective but all it does is calculate how much each bit of the geometry is directly affected by each light in the scene. Take a look at the room shown on the first picture, there’s only one light, a white light, and yet the box on the left is coloured a little bit red because of the red wall. Here we have a case of light bouncing off one object and affecting another and Gouraud shading doesn’t (easily) take into account. So, we turn to radiosity.
To give a quick definition, radiosity is simply a measure of how much light a little patch of a virtual scene is giving off. This amount of light produced is dependent on two things, how much light is hitting the patch and bouncing off, and how much light the patch itself is emitting. We use this to render a scene by considering how the light given off by each patch in a scene affects every other patch, taking into account how it bounces around the space. We can then properly colour a scene using the calculated radiosity.
And How Can I Tune In to The Visual Nectar That is Radio-City?
Calculating the radiosity of each patch in a scene comes in two parts. Firstly, the spatial relationship between any two patches must be found, this is called the form factor. It basically describes how much one patch can “see” another patch. If two patches are very far away from one another, the form factor between them will be very small, if they’re very close the form factor will be much bigger. If one patch can’t be seen at all from another patch, the form factor will be zero.
The second part comes here. With the form factors known, we can build a large linear system of equations in the form of a matrix equation that relates the outgoing radiosity of any one patch to the incoming radiosity from every other patch. By solving this equation we find the radiosity. Almost as simple as getting your car insurance through comparethemarket.com genericinsurance.com!
But How Can I Make My Own Radio-City?
A few different techniques and algorithms are used in the process of calculating radiosity. In order to calculate the form factors, a model-view-projection transformation must be set up, this allows the placement of a virtual camera on a patch in order to see which other patches that patch “sees”.
This transformation is used to project the geometry onto a half-cube sticking out of the patch (this is known as the hemi-cube approximation).
A z-buffer algorithm is then used to sort which patches are in front of other patches and then measure how much area the projection of each of the closest patches covers on the half-cube. This area is equivalent to the form factor.
Calculating the radiosity from the form factors can be done using numerical techniques for solving matrix equations but this requires the storage of n-squared numbers for n patches in the scene, which can end up being a massive amount of numbers, and a massive amount of memory, for complex scenes.
We can alternatively use a technique known as progressive refinement. This is where instead of finding out the radiosity of every patch at one time, the radiosity of a single patch is estimated based on approximate radiosity levels currently known in the scene and then we project that patch into the scene as if it were a kind of light. This process is repeated over the patches until the radiosity for the whole scene gets as close as we want to the true values.
And that’s all she wrote.
Dublin Though, Must Be Tough.
lol
Seriously though, this city is wonderful. The people are as Irish as I was expecting, which is brilliant. There are parks everywhere, nice coffee places around every corner, a plethora of dynamite folk and traditional music happening in at least one bar on every street and to top it all off <suckup>I’ve been working with a bunch of pretty damn great folk at ICHEC </suckup, please offer me a job>. I’ve managed to get to at least a hundred pubs, Dublin Maker (an art/design/technology fair), the National Botanics, the National Gallery, the Natural History Museum (aka the Dead Zoo), a cinema to see the new Spiderman (pretty good, I give it 3.5 out of 5 parallel processes) and even Galway for a few days (didn’t find a Galway girl…). On top of that I managed to see a high speed chase down one of the main streets of Dublin AND I got to play the fiddle of possibly the nicest guy in trad, Cormac, the fiddler from Lankum, check them out below.
How will Earth’s climate change in the next century? This is the kind of questions climate researchers ask themselves and the reason they develop global climate models. These are used to simulate conditions over long periods of time and under various scenarios, including factors such as the atmosphere, oceans and the sun. Understanding and predicting climate behavior comes at a price though – it is clearly a computationally intensive task and heavily relies on the use of supercomputing facilities. This is why climate models estimate trends rather than events and their results are less detailed in comparison to similar models used for weather forecasting. For example, a climate model can predict the average temperature over a decade but not on a specific day. Thus, climate models no longer scale as computers are getting bigger. Instead, climate research depends on extending the length of a simulation into hundreds of thousands of years, creating the need for faster computers in order to advance the field.
Simulation of waves traveling in the Atlantic
This is where my summer project in Niels Bohr Institute, University of Copenhagen comes into play. A number of ocean simulation models are implemented in the Versatile Ocean Simulation (Veros) framework. Aiming to be the Swiss Army knife of ocean modeling, it offers many numerical methods to calculate the state of the environment during each step of the simulation. It supports anything between realistic and highly idealized setups. Everything is implemented in pure Python in order to create an open-source model that makes it easy to access, contribute and use. However, choosing a dynamically typed, interpreted language over a compiled one like Fortran leads to heavy computational cost making acceleration necessary for larger models.
How can we accelerate climate models? Parallel computation saves the day and enables us to take full advantage of the underlying architecture. In particular, GPUs can dramatically boost performance because they contain thousands of smaller cores designed to process tasks in parallel. On the other hand, processors such as the Xeon Phi series combine many cores onto a single chip, delivering massive parallelism and significant speed-ups.
Niels Bohr Institute, University of Copenhagen
In the case of Veros, overcoming the computational gap is mainly accomplished through the use of Bohrium, a framework that acts as a high performance alternative for the scientific computing library NumPy. Bohrium seamlessly integrates into NumPy and enables it to utilize CPU, GPU and clusters without requiring any code modifications. Bohrium takes care of all the parallelism in the background so the developer can concentrate on writing a nice, readable ocean model. Also, porting the parallel program to run on another accelerator does not require any code changes as the parallelization is abstracted by the framework.
Bohrium is extremely useful but naturally there are cases where automatic parallelization fails to significantly improve performance. My mission during the next two months is to profile and analyze the execution of the core methods and models of Veros in order to identify the most time-consuming parts of the framework. The next step is to optimize the NumPy code using Cython to add static types in order to diverge from the dynamic nature of the language and speed up execution. I also plan to take advantage of Cython’s OpenMP support to enable parallel processing via threads. Finally, the last task of the project is to port these time-intensive methods to run on accelerators including GPUs and Xeon Phis using parallel computation interfaces including PyOpenCL and possibly PyCUDA. Bohrium will detect that these segments were parallelized by the developer and will not attempt to optimize them. In this way, the power of handmade parallel code is combined with the automatic magic of Bohrium to produce truly accelerated ocean simulations.
Speaking of parallelism, while I will be making progress on the simulation forefront, I intend to explore the beautiful city of Copenhagen along with my friend Alessandro and take as much photographic evidence as possible. Stay tuned for more posts on both parts of the Summer of HPC experience!
If you read my presentation blog post and you have come here expecting to see some nice visualizations about dying blowflies, then keep reading as I have a little surprise for you.
Testing my tracking algorithm first in 2D
As you may already know, my project is about Tracing in 4D data. But what does that really mean? Well, it just means that I have to use math to track the position of an object in a 3D movie! The problem is that this task is very well studied and documented in 2D, but not that much in 3D. If you are interested in Computer vision or Image processing then you may already be familiar with OpenCV, a big and awesome library that comes packaged with every Computer Vision/Image Processing algorithm you may need and hope for, even for tracing objects in 2D videos… but not for tracing in 3D. So I need to make my own algorithm! Fortunately, math comes to help me, because the same algorithms that work in 2D can easily be extended to 3D, you just need to add one more dimension while the math stays the same! Now I just need to implement them in a parallel and high performance fashion!
At this point, you may already have some questions, like: “3D movie? Like Avatar?” Well, not really! 3D films in cinemas are not really 3D but they use stereoscopic photography to give an illusion of depth, while the kind of 3D movies I’m talking about keep the full information about depth representing each 3D volume through voxels (the 3D version of pixels in a 2D image)! That’s why they occupy so much disk space and why they are so computationally intensive to analyze and process! They are often used in the visualization and analysis of medical and scientific data as a result of a volumetric imaging technique (like Computed tomography (CT), Magnetic Resonance Imaging (MRI), or ultrasound). These techniques are now pretty common, for example my dentist asked me to do a 3D CT scan before removing my wisdom teeth!
But the real problem comes in research experiments. Europe has several synchrotron facilities (big X-ray machines) like ESRF, PSI, and MAX-IV. These facilities can produce enormous datasets, especially medical beam lines. They collect 3D volumes very fast to make 3D movies to study what happens over time. These volumes are typically very high resolution ( ~ 2560x2560x2560 voxels) and this means that if, for example, we collect a volume like this 15 times every second for 10 minutes, we get 9’000 volumes with a total size of more than 300 TB! That’s like 10’000 blu-ray discs!
Tomographic reconstruction of a system that is a constriction through which a liquid foam is flowing.
The analysis of this data typically involves tracing one or more objects, such as a heart clap or lung tissue, in 3D over time. Another example is the paper by Rajmund Mokso on which my project was born. Yes! It’s the one about the blowflies! He and some other guys developed a very fast and high resolution CT imaging technique to study their flight muscles, which swing back and forth 150 times per second! As promised, here you can find a visualization (that came attached to the paper) of the kind of data they collected. But the cool thing about the algorithm is that it doesn’t depend on the data and it could work on any dataset! Just this week I was handed a dataset where the system is a constriction through which a liquid foam is flowing.
So I need to get back to work and I will let you know how it goes in my next blog post!
After the stimulating training week at IT4Innovations, Ostrava we (me, Jan and our site supervisor Dr. Leon Kos) set forth on a long, exhausting but finally much rewarding drive towards Slovenia. Special thanks to Leon, as I got the opportunity to taste some delicious Slovenian food on my very first evening in the world’s first green country. After catching up on some much needed sleep and a sneak peak in the beautiful city of Ljubljana over the weekend, I was all charged up to start with my project work. My HPC journey in Ljubljana stared with a visit to the Arctur supercomputer, one of the few privately owned commercial HPC service providers. Learning about another awesome supercomputer invigorated me to start on my project titled “CAD data extraction for CFD simulation“.
Jan in the Bubble 🙂 trying to play some tricks with soap bubbles
What is the project about ??
As described in my previous blog post, CAE is a vital component of the modern day product design validation cycle. The CAE application we focus on in my project is Computational Fluid Dynamics (CFD), where one studies the flow of fluids, usually around some object – such as an airplane wing. The steps involved in the process are described in the figure below.
The CAD-CAE process chain
As described here, CAD model (geometry) is fed as input to the pre-processor where the problem is set up and transformed into an equation system which is then solved to get the results. One of the trickiest steps is the geometry cleanup which involves defeaturing (i.e. removal of small features such as holes, chamfers, fillets etc. which are not important for the CFD analysis and also make the meshing complex and difficult) and repairing of surfaces which may not have been properly imported in the pre-processor. Usually, this is somehow handled manually within the pre-processor but is cumbersome and time consuming. Thus, in order to “let the things flow” and to fast pace the entire process, we aim to develop a dedicated geometry cleanup tool based on PythonOCC which is nothing but a Python wrapper to the C++ based CAD modeling library OpenCASCADE. With the inclusion of this Geometry cleanup tool into the process chain, the entire process up to solving the equation system can be automated by writing Python scripts.
The CAD-CAE process chain with proposed modification
Since I was new to PythonOCC, it took quite some time to figure out how actually the CAD data is represented in the program, how different parts of an assembly can be accessed, manipulated etc. Now, almost two weeks into it, I have already began with code development. Thus, you can expect some exciting results in my next blog 🙂
Exploring Ljubljana
I feel kind of lucky to have been offered my Summer of HPC project in this beautiful, green city. Last weekend, I got the opportunity to explore some parts of Ljubljana and also to visit Ljubljana Castle which apart from its great architecture also offers a great view of the city.
View from top of the Ljubljana Castle showing the Ljubljanica river
The Summer of HPC 2017 training week has long since passed and we have all
A carnival parade during the weekend of the Jazz and Blues festival.
started our projects in our respective countries. It has been about 2 weeks since we landed in Edinburgh and within that time I think I can safely say we have all settled in nicely here. I must say I quite like the lifestyle here, although I guess it isn’t much different to the lifestyle in Ireland (my homeland) and probably why I fit in so well here! Despite this, I’m still getting used to some of the words they use here and a few of the countryside accents.
Flower displays in Prince’s street gardens. The display on the left is a working clock.
For the few months leading up to this project I joked with friends and family about how I was going to be skipping the summer by living in Scotland for a few months. This (surprisingly) couldn’t be further from the truth! Granted, we have had a couple of days of dull rainy spells, but we have also had some great sunny days for exploring the city. Edinburgh is a great city to explore and to just go for walks in a random direction (you’ll always come across something cool). From Edinburgh castle to the gardens alongside Prince’s street to farmer’s markets selling all kinds of things and even a long beach to relax on, the city has something for everyone. Also one of the best decisions I’ve made since coming here is purchasing a bike and it is a nice way to travel to and from work. It allows you to take your mind of the day’s work and enjoy the weather (when it’s sunny) and the scenery.
A view of Edinburgh castle from a car park that hosts the farmer’s market every Saturday.
Now to the most important part of life in Edinburgh, the food. The fish and chips here are to die for, either from a restaurant or a take-away (chipper). It has always been one of my favourite meals and here is possibly the best I’ve had it (although the west of Ireland does it pretty good too). This is mainly due to the use of freshly caught local Haddock. However, this is not the only thing I’ve been eating (although I’ve had it more times than I’d like to admit) and Scotland has some other dishes to try out. The traditional breakfast here of bacon, sausage, egg, beans and black pudding is great. And when I mean breakfast, I mean the only proper meal you’ll need that day. And of course lets not forget the famous haggis, neaps and tatties which is surprisingly edible and pretty good if you ask me! I’d show you a photo of it but I was starving and ate it all instantly.
Ok I’ve talked enough about Edinburgh, now it’s time to tell you a bit about EPCC. At this stage, the three of us doing a project here in Edinburgh have met our respective mentor and have began working on various codes. EPCC has welcomed us with open arms, given us a work space and access to their coffee kitchen (which is always needed) and been of great help to settle into life here (especially Ben Morse!). My mentor, Nick, has already been a huge helping me get started in modifying the visualization end of his MONC weather model and we have already met several times to discuss ideas, problems/bugs and goals. I don’t feel like there’s much point showing off Nick’s model as it wasn’t any of my work, so stay tuned for my next blog post where I will be showing images/videos of my progress on the weather model!
# import Cup_Of_Tea as English_Tea
# import Blog_Post as Introduction
# Hello everyone :)
I am currently in Edinburgh, home of the biggest celebration of arts and culture on the planet, and of course I am referring to the famous Fringe Festival. But we have plenty of time until the festival begins and my blog starts documenting all the exciting cultural events that happen in this cool city. So it is just the right moment for me to introduce you to my internship project for PRACE’s Summer of HPC.
I have spent the past two weeks working at the the Edinburgh Parallel Computing Center on exciting topics that concern state of the art geological applications and the use of Python libraries – such as Sklearn, that support Machine-Learning and Scientific Computing. It seems that there has been a change of plan, as far as the theme of my summer project is concerned. So, instead of earthquake hazard modeling, I will be working on the development of an application that geologists will be using to discover the geological time in which sediments with similar properties were deposited over large areas of geophysical interest. These properties have been measured up to this day in well logging. But to further explore areas for oil and gas, or minerals requires many man-hours of the well log correlation.
So now, you will be wondering how my work might help geologists and engineers decide which regions are most suited for specific geo-technical exploration. A very useful tool for the experts in this area of study and analysis, would be the simultaneous and automatic well log correlation based on measurements over large areas of interest. Since methodologies that can be implemented in the well log correlation problem have been documented, it sounds reasonable to combine these scientific breakthroughs with machine-learning methods, for a more efficient overall documentation of an areas’ lithology in terms of facies classification based on limited well log measurements.
The project as presented above requires a lot of work, but I am more than content to work on the preliminary design and concept of this idea during my internship.
My main goal for this summer will be not only to work with a large amount of available measurements well logs data, but also to use existing models of well log correlation in the general concept of machine-learning applications using Python!!!
Just to give you a glimpse of the data available, I provide you with a visualization of the most common parameters as functions of depth that characterize a well log, along with a colobar with information concerning the lithology of the well. Wouldn’t it be awesome if we trained a neural network to identify the lithology with only these 5 parameters as shown below?
Parameters as Functions of Depth[m] => Gamma Ray(GR), Induction Resistivity(ILD_log10), Photoelectric effect(PE), neutron-density porosity difference(DeltaPHI), average neutron-density porosity(PHIND)
Well-log correlation terminology in a nutshell :)
Well-logging : is the process of recording various physical, chemical, electrical, or other properties of the rock/fluid mixtures penetrated by drilling a borehole into the earths cruste. A log is a record of a voyage, similar to a ships log or a travelog.
Facies: It is the sum total characteristics of a rock including its chemical, physical, and biological features that distinguishes it from adjacent rock.
It’s been more than a week since the Summer of HPC program began, and these days have been quite interesting, getting to know members of the Computing Center of the Slovak Academy of Sciences and finding out more details about the project I will be working on. The aim of my work here, is to ascertain if it is useful to use typical Big Data tools to solve High Performance Computing problems. This will depend on the computational time required using tools such as Scala and Spark. This will be compared to ‘equivalent more traditional’ approaches such as using a distributed memory model with MPI on a distributed file system such as HDFS (Hadoop Distributed File System) or BFS (Baidu File System) and native C libraries that create an interface to encapsulate this file system functionalities. To be more precise, in order to perform fair comparisons, we’ve chosen the K-means clustering algorithm, rather than the initially proposed Hartree-Fock algorithm. This will be run on variable size datasets obtained from bioinformatics studies and we will compare the computational run time and time resilience of both approaches.
The dartboard: the missed throws are mine, probably breaking the typical π/4 Monte Carlo methods proportion
Of course, not everything in Bratislava is centered around work. The guys of the computing center have introduced me to ‘the game of darts’! My colleague Andreas has already played several times before, so he has some advantage. I find darts quite entertaining and useful to disconnect a little bit from the project and clear the mind. I must admit that I am a little bad at playing darts, as can be confirmed by the image on the left.
The distance from Bratislava to other cities, indicated on the floor in the center of Bratislava
We have also had some time to visit parts of the city, and although it is small, I found it attractive to my liking, with multiple restaurants where one can taste traditional Slovak food. Up till now, the traditional dish I liked the most is the Bryndzové Halušky, which is basically potato dumplings with sheep-milk cheese. Also, since we are in summer, we have tried some good ice cream in Bratislava, and apparently we have been fortunate enough to try one of the best ice cream places in the city: KOUN. If you want to have a good ice cream in Bratislava, I recommend you this place to 101%.
To conclude, we also had the opportunity to enjoy the performance of a local music band, playing the song Angels by Robbie Williams – see my video below.
Figure 0: Me, travelling around the reserach center with my Tesla K40 GPU.
Forschungszentrum Jülich is quite an interesting place to work at. You can think of it as a town in the middle of nowhere where over the half of the population are scientists or technicians. The research center even has its own fire department and healthcare center! It is also surrounded by chain link fences with barbed wires, which are probably designed to keep the wildlife on the outside and the researchers on the inside. To get in you have to present your personal ID card and you ought to have it on your person at all times. The security of the facility appears to be quite high on the priority list. The whole place reminds me of Black Mesa Research Facility, but with less headcrabs. I’m not even sure if I was allowed to take photos, but I did anyway.
Figure 1: My field of view while strolling around. There are lots of these bizarre solar paneled boxes all over the facility. Nobody I have asked knew what they are. Maybe it’s some on-going experiment.
You can actually spend quite a lot of time just strolling around the research center. A couple of Master students I share the office with were kind enough to take me on a tour around the campus on a route called The Two Towers. This route visits the two largest radio towers the center has. As you might deduce from the its name, the route is almost as long as the trek from Shire to Mordor. But walking isn’t all I do! I also eat brunches, lunches and second breakfasts while drinking copious amounts of coffee. In addition, during breaks from all of that stuff, I have this pretty cool programming project I’m working on.
The project is basically about calculating gravitational or electric potentials caused by groups of discrete point masses or charges on a graphics card very quickly. These sorts of potentials are encountered for example, when simulating molecular dynamics, behaviour of plasma or formation of a solar system. You want the calculations to be very fast, because simulations usually require an immense number of time steps before they’re of any use. And lastly, we want to utilize GPUs, because nowadays they’re ubiquitous and offer the cheapest FLOPSs.
Figure 2: My field of view during a break from the strolling around. Walking’s serious business: note the legs in an optimal resting position. That white thing right there is my personal coffee kettle. That’s right, a liter of coffee, just for me!
More specifically, I convert existing CUDA code for nVidia GPUs to a more general, open-source framework called Heterogenous-Compute Interface for Portability, or HIP. With it, you only need to write source code and during compilation you decide which hardware to use. The framework is quite new and under heavy development, which becomes apparent when Googling the various error messages and noting the absence of hits from StackOverflow. In such cases, you need to do the figuring out by yourself! However unreasonable it may sound, this playing in the background has kept me sane while debugging my code.
The algorithm used for the potential computation is called the Fast Multipole Method. The large distribution of point charges (or masses) is divided into small boxes, each of which contains only several particles. The particles of a certain box and its neighbouring boxes interact classically using the usual formula for 1/r potentials and this is called the near-field. Everything further away is considered to be in the far-field. As it turns out, one can give an exact estimate of the potential outside of one particular box in the far-field with an infinite series of spherical harmonics. This we can of course approximate by selecting just a few first terms from the infinite sum. The more further away we are from the source of the potential, the more neighbouring boxes are included in the spherical harmonics series. By this one weird trick you can reduce the time complexity of this problem from quadratic to linear!
The coolest thing is that in some cases, this gives a more accurate result than the classical pointwise calculation of the potentials, by reducing the amount summations and thus diminishing the cumulative floating point addition error! No wonder the algorithm is said to be one of the most important algorithms of the last century.
Throughout history, humankind has had a clear preference for grouping together in larger and larger groups. On the global scale, 54 per cent of the world’s population are living in urban areas as of 2014. In 1950, this figure was 30 per cent, and in 2050 it is projected to be 66 per cent. [1]
Planning ahead for this massive increase in urban population will be demanding, and having an accurate simulation to test your city’s new infrastructure might save lives. We can, using both modelling and real data, see how people react to various external stimuli such as a catastrophe or a change in the road.
What words are those?
Monte Carlo and Deep Learning Methods for Enhancing Crowd Simulation
Yep, that’s my project title. Loads of fancy, technical words! All this media coverage of artificial intelligence and how it will take over the world is probably not helping with the confusion of the “man in the street”, how can a machine alone do my work?! Before I start talking about what I’m doing, I’ll try to quickly explain what Monte Carlo and Deep Learning methods are.
Monte Carlo Methods
Monte Carlo methods refer to methods of obtaining numerical results in which we randomly sample a small part of the whole space we would otherwise need to compute every point of. A good example is computing the total volume of particles in a box which could fit up to N³ particles. A normal way of computing this involves checking every single location, whereas a Monte Carlo algorithm might only check on the order of N locations, and then extrapolate the result. There are obvious limitations to this, but it works surprisingly well in the real world!
Artificial Intelligence (AI) can be defined as the study of intelligent agents; a device that perceives its environment and takes actions that maximize its chance of success at some goal. [3]
Historically, we can divide approaches to AI into groups of thinking or acting humanly or rationally. One way of determining whether a machine has reached human level intelligence is the Turing test. As laid out in Russell and Norvig, a machine needs to possess several capabilities in order to be able to pass the Turing test. Amongst those are understanding natural languages, storing information and using this information to reason, and to be able to learn and adapt to new situations.
An Artificial Neural Network. This specific one can for instance learn the XOR (exclusive or) operation. By Glosser.ca – Own work, BY-SA 3.0, Link
This last ability is what we often call Machine Learning (ML) – learning something without being told explicitly what to do in each case. There are many approaches to this; the one we will talk about is Neural Networks (NN). The basic idea is to model the structure of the brain, and also mimic how we humans learn. We split the task into several smaller tasks and then combine all the small parts in the end.
As explained in the Deep Learning book [3], we try to enable the computers to learn from previous experiences and understand the world in terms of a hierarchy of concepts. Each concept is defined using those concepts further down in the hierarchy. This hierarchy of concepts model allows the computer to learn complicated concepts by building them out of simpler ones.
This hierarchical model, when laid out in graph form such as in the picture on the right, is what has inspired the name Deep Learning. Deep Learning is when there are a large number of
hidden layers.
A visualisation of a Biological Neural Network, on which these artificial neural networks are based.
I feel it’s worth mentioning that most of the fantastic progress seen recently is in this field of AI. And that it is doubtful that this transfers directly to so called general AI due to the very specialised training required. However, Machine Learning will change what people spend time on that is for sure. Certain tasks are very well suited for this specific approach and I suspect we will soon see it commercialised.
A last thing I should explain is the term Reinforcement Learning (RL). It’s not often mentioned in media, but it is one of the most successful approaches, albeit limited to few domains, such as games. The basic idea is that we mimic how humans and animals are trained. If the agent does something good, say, doesn’t crash, it gets a reward. If it does something bad, e.g. crashes or breaks a rule, it gets punished. There are many approaches to solve this, they all centre around giving the machine a function which assigns a predicted score for every action in a state. [4]
What am I doing?
My project is mainly algorithmic in nature. I try to come up with the underlying rules each individual in the crowd must follow. The main ones are not crashing, leaving the area and going towards any goal you’ve been given. A secondary goal is coming up with a nice visualisation. This is mostly already done, but there will likely be a large amount of work in transferring the algorithm.
Illustration of a Markov Decision Process with three states (green circles) and two actions (orange circles), with two rewards (orange arrows). A POMDP is when some of this information is not given. By waldoalvarez – Own work, CC BY-SA 4.0, Link
For avoiding collisions, we will combine deep learning with reinforcement learning, similar to what Google-owned Deep Mind did with the deep Q-network for playing Atari games, although our network will be much more shallow and far less complicated.
A last thing we hope to try, is using another branch of machine learning called supervised learning to instruct the agents on various aspects – such as some city paths are more likely to be taken. This can also be used to say that this road, which normally is the best option, at the time we want to simulate is in fact heavily congested and should be avoided.
The crowd simulations themselves are quite complicated creatures, so to develop the algorithms in the first instance, my primary weapon is pen and paper. After I have derived the idea, I implement it in the Python programming language, using a physics simulator called PyMunk and a game engine called PyGame. The biggest advantage of using Python is in fact Keras, a so called API for Neural Networks. It allows me to code my neural net in just a few seconds. Combined, these allow me to quickly test my ideas in practice.
Status of project as I was leaving the office Friday 21st of July. The green dot is the one I’m training and the blue dots are stationary obstacles we try to navigate around. The background shows training data left and right, and in the background is an algorithm I’m trying to see if we can adopt. The algorithm was proposed in the paper Deep Successor Reinforcement Learning as implemented for a self driving truck in this Euro Truck Simulator 2 project.
It was a very very long trip, which began on Sunday at 2 p.m. and ended on Monday at 11 pm (with a 15-hours stopover in Milan), but finally, I landed in Athens!
“Athens, the eye of the Greece, mother of arts and eloquence”.
Since first sight from the plane, I understood this was going to be true love: a lively city, full of life and a sense of solidarity.
If I was made of Ethyl bromide C2H3Br, I would have evaporated already.
Athens welcomed me with a temperature of 40 degrees. Correction, 41, and it remained this way during the whole stay! The surrounding air reminded me of the feeling I had in Death Valley (California), when the thermometer read 120 F (almost 49 C).
After this first impact, things have been much better.
I spent the first training-week with Dimitri and Ioannis at GRNET (Greek Research & Technology Network). GRNET is located at Ambelokipi, a central district which hosts Panathinaikos’s home ground.
I familiarized myself with NetCDF (Network Common Data Form) files, which are used a lot in climate science, and learned how to access (also physically speaking – see Mahmud’s post) ARIS (Advanced Research Information System), the Greek national supercomputer system.
We also spent a great time together. Ioannis and Dimitri invited me and Mahmud to a special Greek dinner, and after a walk through ancient Greek ruins, I ate the tastiest grilled mushrooms I have ever tried. I still dream about having another piece of the organic watermelon I had there: taken directly from Dimitri’s garden!
“Salonique à tout prix!”
After this wonderful first week, I moved to Thessaloniki, to switch to the operational part of my project. It has been raining for the first few days, but now the sun is shining over the cultural pole of Greece. I’m revisiting the relevant literature on climatology, in order to understand which quantities are important to visualize from a physical point of view. I am very enthusiastic in deepening my knowledge on climate models, which help us understand how our climate system works and figure out possible future scenarios for our Earth.
In a nutshell, the objective of my project this summer is to develop a visualisation of real human skeletal motion based on motion capture. I will be working at IT4Innovations, the Czech national Supercomputing Center in Ostrava, where the Salomon cluster is based. The Salomon cluster is currently the 78th most powerful supercomputer in the world, and includes some specialised visualisation nodes which will be useful for my project.
The first step of this project will be to generate a 3-dimensional model of a human skeleton from computed tomography (CT) images. To do this, it will be necessary to segment the skeletal image data into a set of images representing individual bones. The K-Means Segmentation Algorithm is a simple, yet powerful, unsupervised machine learning algorithm that will be suitable for this segmentation.
Animation of the k-means algorithm. Source: wikimedia.org/wiki/File:Kmeans_animation.gif
The standard K-Means Segmentation Algorithm has a time complexity of , which means that it takes a long time to run this algorithm on a large dataset. Thankfully, this algorithm can be parallelised, and a group of researchers at IT4Inovations have implemented a parallel version of this algorithm using OpenMP. This implementation is optimised to run on the Xeon Phi co-processors that are included in several compute nodes of the Salomon cluster. In plain English, this means that this method for generating models of individual bones from the model of the skeleton requires less time to run using the supercomputer.
Kinematic chain modelling human body (Simplified). Source: wikipedia.org/wiki/Kinematic_chain#/media/File:Modele_cinematique_corps_humain.svg
Once I have generated mesh models from the image data, I will need to use the Kinect 2 to generate a 3-dimensional kinematic chain based on motion capture of a moving human. A kinematic chain is simply a set of rigid bodies, connected by fixed points and various constraints defining how the fixed bodies can move relative to the rigid bodies they connect to. These models are often used in robotics, but can also be useful in other fields including structural engineering and physiotherapy.
IT4I have developed some rather interesting plugins for Blender used for medical image processing on the Salomon cluster. In my next blogpost, I will go into more detail about image processing, the DICOM standard, and generating mesh models with Blender using the Salomon cluster.
As a guy from Central Europe, upon arriving to Bologna my first reaction wasn’t subtle: “Damn, it’s hot!”. After two weeks of getting used to the weather, I can say that I am still boiling. I will return to temperatures once more in this post, but in completely different way.
Bologna, the city of food. You can guess what the first thing we did was when we arrived at our apartment. Dropped the bags and right away went for the famous spaghetti alla bolognese. And I can tell you, they were superb! The “summer body” term has a completely different meaning for me now…
I can talk about all the amazing food, drinks, architecture and people all day long (the best thing for this is to follow me on Instagram) but let’s get to what I am actually doing this summer – apart from getting fat.
CINECA and Galileo Supercomputer
My Summer of HPC project focuses on visualizing a HPC system’s energy load at the CINECA facility. In less cryptic words, we monitor a lot of metrics regarding the whole HPC system starting from the core counters to temperatures and cooling of the whole room where the supercomputers are.
Currently, we are working on the Galileo supercomputer – which is the 281st most powerful supercomputer in the world, and we are monitoring only about an eighth of its nodes (64 of 516).
Now we get to the more technical and interesting part.
The metrics we measure are of several types, i.e. environmental metrics (mainly temperatures in different places and levels), performance metrics (such as core loads, instructions per second) and many others. Actually, there are more than 50 metrics which are being monitored right now. All these are usually measured every 2 – 10 seconds and are sent via the MQTT protocol to a database cluster which stores it for 2 weeks. You may ask why store these for such small amount of time? The answer is simple. There is a lot of data. Each day the system produces around 60 GB of data. That is a lot of data to store and keep in the long run.
Some of the data that we are receiving in 2 seconds samples.
Now that we have all the data, what should we do with them? Visualize them of course! Here’s where I come in as the savior. I am creating a web application that takes the data and shows it in consumable format, mainly charts or as one single number. Of course, I am not showing all the gigabytes of data we have, only a very small portion of them which is usually defined by the job that runs on Galileo. This way it can show in a couple of different scenarios. A programmer is more interested in how their program performs and a system administrator cares about the temperatures and loads. This gives us an uneasy task of deciding what data should be shown.
3D model of one of the cluster rooms at CINECA
All of this will eventually be displayed in a sexy interactive 3D model of the Galileo supercomputer. The good folks from CINECA already did the boring part, and the model is prepared for me to play with and still, everything will be in a browser. So once you run a job on Galileo you can actually see the physical position of the nodes on which it runs, how it behaves and how it influences, e.g., temperatures and power consumption of the nodes. Cool, huh?
In the long run, this web application can help people around HPC to realize how much energy and resources is needed to run their applications and make them focus more on the energy impact of their doings.
says Sheldon Cooper when trying to teach some Physics to Penny. My story for the next two months, which I am going to explain in this blog, starts in a similar way…
It was a hot summer afternoon in Bologna airport. The plane had just landed and my SoHPC mate, Petr, and I were going to the apartment where we will be spending the next two months. It was tough at first, after a week of training with all the SoHPC students and 25ºC weather in Ostrava, to get used to Bologna’s 35ºC weather. Also there were just two of us, since everyone is now scattered around Europe. We had some time to enjoy the beautiful city of Bologna and start our summer diet of pasta and pizza. However, we were really looking forward to meet the people from CINECA and getting started with our projects. In conclusion, my summer story had just begun.
A typical evening at the center of Bologna.
I could write about Bologna and my stories here for the whole post, but that’s not what I want to present to you today. In this post, I want to introduce you to my SoHPC project which I will be working on during the summer. In my first self-introductory post I mentioned that I am working on the project “Web visualization of the Mediterranean sea”. But, why do we need a web visualization of the Mediterranean sea? What scientific use does it have? Well, this is what I’m going to explain in this post.
CINECA’s supercomputer racks, photo taken during the visit to CINECA’s supercomputer facilities.
Short time biochemical forecasts that cover around 20 days and are run twice a week in one of CINECA’s supercomputers: Pico. Their inputs, also called “forcings”, come from other Copernicus’ partners, building a very complex network.
Climate studies that project up to 100 years into the future. In this case, surface satellite colour data can be assimilated with the model, both to improve the forecast skill and to produce optimal representations of the last decades biogeochemical state of the Sea.
Quite impressive, isn’t it?
So, what is my role then? Concerning numerical simulations, we can usually distinguish three main parts: pre-processing and set-up, running the model and post-processing the results. As you might have guessed, the OGSTM-BFM code generates large amounts of data. However, raw data are of little use generally. We need to post-process this data set and convert it to something that can be easily understood, for example graphs or maps. One of the problems OGS has in this subject is precisely how to optimally visualize the 3D data.
Chlorophyll concentration on the Mediterranean Sea. Post-processed in VTK using python, exported to ParaView and animated with Blender. Credits: “Generated with E.U. Copernicus Marine Service Information”
Tools such as Paraview and Blender can produce 3D visualizations and videos that can be easily understood. The philosophy behind these is: “an image is worth a hundred words”. Recently, these programs (Paraview in my case) have developed web-based viewers that can connect to and access the computers where the data set is stored. Then, 3D visualizations, maps, vertical sections or any kind of graphs that can extract information from the data set selected will appear in a browser. In a nutshell, as my SoHPC supervisor, Dr. Paolo Lazzari mentioned:
“We need an efficient way to explore the data.”
So, this is what I will be doing this summer, besides eating pizza and pasta. Stay tuned here to learn more about my project and follow my story!
Credits: “E.U. Copernicus Marine Service Information”
The forest is only 5 mins away from the hotel. Perfect to have a run in the morning before work begins
It has been one and a half weeks since the PRACE Summer of HPC program has begun. I have settled here very well, having a quick run in the forest just before breakfast, enjoying the singing of the birds, working on the top floor in the building with nice views and reading books in the evening (Since Dimitra and the SoHPC interns from last year have written great posts about IT4I and Ostrava, I’m not going to repeat the same remarkable things again, you can find all about it here).
Training week: We are all studying hard! Lecturers are from various universities and companies, sharing their first-hand experience in learning and using HPC
Because IT4 Innovations is the training site this year, I’ve got an extra full weekend to spare, unlike the others who have to travel to other HPC sites as soon as the training week has finished. After seeing off my new friends on their departures, I went to Globus (a big supermarket) with David, Alex, Antti and Phillipos. I was totally surprised that the Globus can entertain my special Chinese taste with rice and soy sauce.
I guess I’m also the first one who has met their mentor. On the first day of training, I happened to meet him at lunchtime and had a brief talk. As my mentor had to go to a conference in Italy the following week, we arranged a meeting just before his departure and summarised a to-do-list for the week. The main focus is to get me onboard with sufficient reading and tutorials on several Python Modules (Bokeh in particular), to understand the end goal of the project and to be able to reproduce the current data. People in the group are very nice and friendly.
Generally speaking, our group is building a python module for measuring HPC performance at the moment. At the same time, we are also helping another experimental group to understand the bioactivity trends of their molecules. We’ve received information about the molecules from them and we are going to apply machine learning techniques to investigate the relationship between the molecules and their bioactivity. Meanwhile, we also have to monitor the computing performance of all our jobs, to make sure that all nodes are working at their highest efficiency. As the only chemist in the group, I may also able to provide useful insights from a biochemistry perspective.
My name is Jakub Nurski. I am an ambitious B.Eng student of Computer Science in Poznan University of Technology in Poland. I also work there for a small start-up that provides innovative software for photovoltaic companies. Bringing green energy to people is one of my lifelong goals.
Always thinking of something, either computer related or not.
During my studies I’ve participated in many hackathons. “What is a hackathon?” you may ask.
Design sprint-like event in which computer programmers and others involved in software development collaborate intensively on software projects. It typically last between a day and a week.
Wikipedia
Together with my team we have created video games, a study help app, a first aid app, a social media app, an e-commerce app and many more. 24 hour work without sleep is really rewarding. Especially if you work with determined people and win some prizes.
However, programming is not my whole life (and it will never be – I hope). In my spare time I like to design and play board games. Day-to-day board game meetups in a fun group is a great way to recharge you batteries for a week. Moreover, I prefer to spend my holidays in mountainous areas which allow me to pursue one of my greatest passions – hiking.
On the summit of Sněžka, Czech Republic.
I feel enthusiastic about the PRACE Summer of HPC program. I hope that I’ll learn a lot during my project about visualisation. I also hope that my work will help the public to understand how important HPC is for science.
Hello Readers, as might be suggested by the title of this post, here I’ll try to introduce myself and also give a glimpse of what this amazing experience called the PRACE Summer of HPC program probably has in store for me.
Me at Karnspitze, Sarntal, Italy during last summer
My name is Paras Kumar, I’am a 27 year old mechanical engineer turned computational scientist who is currently pursing his Masters in Computational Engineering (sounds similar to Computer Engineering, actually it is not, so please click to know more) at the University of Erlangen-Nuremberg in Erlangen, Germany. My scientific interests lie in the development of numerical methods for solving complex problems in Mechanical Engineering, especially Solid Mechanics. Put simply, I may for instance, be interested in determining under what loading conditions could the axle of your car or the landing gear of an airplane fail. One could build prototypes and do physical tests, but this requires a lot of effort, money, time and may even be infeasible in certain cases. Thus, we resort to the numerical route, which involves some complex equations of maths which cannot be solved by hand so one has to write computer programs for that. High Performance Computing (HPC) is concerned with using powerful supercomputers to solve such complex problems.
Born and brought up in New Delhi, India, I did my schooling and undergraduate studies there. Post Bachelors, I worked for three years in an automotive company where I was involved in strength and durability analysis (something similar as described above) for design of motorcycle structural parts. These calculations are based on a mathematical technique called Finite Element Method and in the industrial context are generally implemented using commercially available software programs. The competencies and passion to excel, which I developed while working there was indeed a turning point in my professional life. An avid thinker and curious mind as I have always been, I decided to sound the depths of how these software tools actually work behind the scene to present the engineer with colorful pictures containing some important information. This pursuit of excellence, as I consider it led me to pack my bags, leave my family and friends and come all the way to Erlangen, Germany where I now live for the past two years.
Besides solving problems in Mechanics, I am also engrossed in hiking, travelling places, trying different cuisines, watching movies, playing chess and table tennis. Last year, I got the opportunity to go hiking in the Sarntal Alps in the South Tyrol area of Northern Italy. Also, I look forward to see new places during my stay in Slovenia over the summer. By the way, I already had the opportunity to taste traditional Slovenian food on my way to Ljubljana and it was yummy.
At the Triple Bridge in Ljubljana, Slovenia during last weekend
Now on the verge of completing my masters studies, I perceive the PRACE Summer of HPC program as a unique opportunity to further develop my skills in Computational Science and this time apply them to a different application – CFD, as I spend the summer at the Faculty of Mechanical Engineering in the University of Ljubljana trying to come up with a strategy to automate the complete CAD – CFD process involving the application of Open CASCADE and OpenFOAM. Last week, I along with twenty other participants attended an introductory training event at the IT4Innovations Supercomputing Center in Ostrava, Czech Republic. This training not only allowed us to try our hands at the Salomon Supercomputer, but I also got to meet a bunch of amazing people from different parts of Europe, interact with different cultures and make friends. I hope to pursue this further during my stay in this beautiful city of Ljubljana.
Welcome to the PRACE Summer of HPC website! My name is Philippos and I am one of the current participants. I have done my Bachelor in Computer Science at the University of Cyprus and my Master in Advanced Computer Science at the University of Cambridge. My computing interests range from computer architecture and high-performance computing to machine learning. I also find other science subjects very interesting such as physics and mathematics. A fascinating fact about the Summer of HPC program is that the majority of the projects combine different disciplines and this is the main reason I decided to apply.
My home country is Cyprus. In Cyprus we have a supercomputer at the Cyprus Institute that I have used in multiple occasions. It is amazing to think of what can be achieved with parallelisation in such a relatively small supercomputer. For example, on a typical utilization of the supercomputer you can run a workload of around 10.000 CPU hours in about a day. This essentially means that you can run an experiment that takes over a year to calculate in a single-core computer in one day. As the problem size scales, as well as when we move to bigger supercomputers, we can achieve more previously-unfeasible tasks in a small amount of time.
In my free time I like to listen to classical music. Some of my favourite composers are Bach, Fauré, Brahms and Duruflé. My music theory knowledge is limited, but I find the mathematical aspects of some pieces very interesting. In addition, I used to play the violin as a hobby and it is amazing how many different sounds a single instrument can produce. Each violinist has a unique way of expressing a piece of music and the interpretation may be influenced from the player’s experiences, the wood with which the violin was made, the string selection etc. My question would be whether an artificial neural network could be trained to reproduce a specific player’s playing ability given only the music score for different pieces. It would be surely a complex application and a single conventional few-core computer would probably be insufficient for this amount of computing.
Computer science is a rapidly evolving field and doing related research can be very fascinating. I look forward to begin experimentation with my project at Forschungszentrum Jülich in Germany.
Friends quote: “A Cypriot programmer that has enthusiasm about a variety of things such as mechanical watches and computers.”
You can find more of my interests and some publicly available software (including a 3D N-Body simulator and a twilight calculator) by checking out my personal website: ucy.philippos.info.
My name is Edwige Pezzulli, I am a girl coming from the Eternal City – Rome, where I live with my wonderful dog, Moja. Currently, I am finishing my Ph.D. in Astrophysics at University La Sapienza.
Me in Monument Valley
During my studies, I visited many cities and spent three months in Paris, a city I fell in love with.
My research studies the formation of and growth of the first supermassive black holes – monsters with enormous mass, up to 10 billion times the mass of our Sun, formed when the Universe was very young.
However, stars and galaxies are not my sole interest. I love traveling all over the world (my last trip was in the “autentica Cuba”) physically, but also by reading. I am a cinephile, and I like sports such as rugby and boxing/MMA – a full-contact combat sport I only recently discovered. I do volunteer work with young inmates and I would like to hold astrophysics seminars in jails. I am vegetarian, love animals, nature, enthusiasm and knowledge, and I am firmly convinced that the key to progress resides in the promotion of diversity.
Moja in Parc de Sceaux, Paris
I am pretty excited to work with climate visualization, which is the topic of my project. I will spend the next two months at the Aristotle University, in Thessaloniki, a city full of cultural life and events. The project has a lot of possible applications, much more concrete with respect to my Ph.D. studies – but I don’t like to ask myself if something is just “useful”.
Visualizations bring intuition to researchers on how climate systems behave and are also important in communicating the problem of climate change to the public. In fact, possible advancements towards a more green future start from the knowledge of the effect of our actions on Earth’s climate (and eco) system.
My name is Mahmoud, originally from Egypt, and currently a PhD student at the National University of Ireland Galway (NUIG). I have been enjoying my time very much in Ireland. I love the vibrant campus of NUIG, and the relatively simple life in smaller towns like Galway City. I am joining a Summer of HPC project at the Aristotle University of Thessaloniki in Greece.
Being on the lookout for taking part in research-related projects, I am very glad to join the Summer of HPC program. I particularly get a thrill out of doing research, and developing new ideas. My enjoyment extends in the case of inter-disciplinary research projects that can jointly synthesize different fields (e.g. computing and life sciences). I expect to experience such inter-disciplinary advantages during the summer project, where I will be working closely with a climate scientist.
The Quadrangle, NUIG
With regard to my PhD research, it mainly involves simulation modeling along with machine learning (ML). We attempt to develop a hybrid approach that integrates simulation models with ML. At its core, our approach is based on the premise that a system’s knowledge can be partially captured and learned in an automated manner, aided by ML models. We conceive that the proposed approach can help lead to self-adaptive simulation models that can learn to change their behaviour in accordance with changes in the real-world system’s behaviour.
Besides computing and research, I have always been keen on learning about arts, particularly drama and theatre arts. I enjoy reading plays, and watching theatre very much. I also studied drama at the Academy of Arts in Cairo. I deeply believe that studying a quite different discipline has widened the breadth of my knowledge.
My favourite quote is “An investment in knowledge pays the best interest”, Benjamin Franklin.
With the SoHPC participants, Leon, Zheng, and Karina inside IT4I , Ostrava.
Have a look at my intro video shot inside the IT4Innovations supercomputing facility in Ostrava, Czech:
“A typical German guy”. That is me, Anton Lebedev, according to my co-conspirator Aleksander Wennersteen, who will be joining me at the Barcelona Supercomputing Center (BSC) in Spain to work in the domain of general purpose graphics processing units (GPGPU) programming.
And now, true to the description given above, the description of myself and my project shall follow suit.
A first generation immigrant to Germany from Ukraine, I have finally obtained a MSc in physics at the University of Tübingen after an initial stint at ETH Zürich in Switzerland. Now I consider myself a free radical at my current institution – a student still, but looking for an interesting PhD position. I pledge no allegiance to anybody anymore.
Since at the time of this post the summer term and lectures are still ongoing in Tübingen, I will be a remote teaching assistant during half of the PRACE Summer of High Performance Computing (SoHPC) program.
My field of study was and remains theoretical electrodynamics with elements of general (or geometric) relativity. In the course of my work and due to careful theoretical analysis
Thinking (or pretending to think) about the code during the introduction week.
I was able to reduce the computational cost of the numerical methods used in my thesis, therefore there was no need to use high performance computing for in the not-so-distant past.
This reduction in complexity left me wanting to work in high performance computing again – at least for a while. Thus I have applied for the PRACE “Summer of High Performance Computing” program.
I was quite surprised to have been selected for the project I will be doing in the upcoming weeks, since I do not consider myself to be a prolific programmer. I aim, in part, to use this opportunity to compare and evaluate the work ethics and methodology I have acquired in Tübingen in an international setting.
As to the project, I will be porting and optimizing existing parallel code which implements Markov-Chain Monte-Carlo methods for computing approximate solutions to linear systems of equations to take advantage of NVIDIA GPUs. These solutions will then be used as so-called preconditioners for iterative solvers (i.e. conjugate-gradient solver) for very large linear systems which occur, among other things, in almost all engineering simulations.
The performance analysis of the resulting code will be carried on the technologically bleeding-edge cluster that is Mare Nostrum 4 at the BSC.
Adjusting the settings for optimal pictures at the beautiful Nasir al Mulk mosque in Shiraz (Iran). (Image provided by Marc Sindlinger)
In my time off I enjoy a few things. Chief amongst them is road-racing. Along with Aikido, it helps to keep me in shape – at least to some degree. As to the arts, I practice photography whenever I expect interesting motives to be found and if time permits.
A particular quirk of mine is my focus on cybersecurity and privacy, which makes any publication a drawn-out iterative procedure. But I still think that it is worth the extra work, for the less the internet knows about oneself – the better .
Bike tour to a place with a collection of well-preserved historic windmills and houses during my ERASMUS semester in Amsterdam
Hi, my name is Jan Packhäuser, and I am a 23 year old student from Germany. I was born and raised in a small Bavarian town called Miltenberg which is located close to the border of the federal states Baden Wurttemberg and Hessen. After graduating from a school with an IT profile with a university entrance diploma, I decided to move to Ulm to study mathematical economics.
While studying at the University of Ulm I was interested in numerical mathematics, especially in solving optimal control problems. At an early stage of my studies I noticed how time-consuming computations can be and this channeled my interest more and more toward the field of high performance computing.
During the PRACE Summer of HPC program, I will spend two months at the Faculty of Mechanical Engineering of University of Ljubljana, to work on a parallel algorithm for non-negative matrix tri-factorization.
In my leisure time I enjoy playing chess or doing sports. Also, I likes music a lot and play the trumpet.
I am really looking forward to this summer. Besides working on an interesting research project I want to discover everything that Slovenia can offer. On our way from Ostrava to Ljubljana Leon, Paras and me have already spent an enjoyable afternoon in the mountains. This has been a decent foretaste yet.
In the mountains, north-western area of Slovenia, together with Leon in the middle and Paras to the right.
Here you can find a small video that has been taken during the introduction week in Ostrava at the supercomputing center IT4Innovations.
I’m 23 years old and in my third year of the five year MPhys in Mathematical Physics at the University of Edinburgh. This summer, I will be participating in the Summer of HPC, based at the Barcelona Supercomputing center.
Originally from Norway, I couldn’t tell you the precise reason I decided to follow the thousand year old Norwegian tradition of conquering the British isles, nevertheless, it is going according to plan. I shall finish my ancestors work!
Academically, my interests range from the abstract corners of Mathematics, through Physics to Computer Science. This is the reason I have decided to leave the safe cocoon of theoretical physics and maths to go to the Barcelona Supercomputing Center (BSC-CNS) to learn more about Deep Learning (DL) and visualisation. My project title is “Monte Carlo and Deep Learning Methods for Enhancing Crowd Simulation“.
I chose to participate in the PRACE SoHPC program both due to the cool project I got, and the prospect of meeting more people from around Europe. We are representing most of Europe, from Finland to Spain, and the UK to Greece. The main point separating SoHPC from other Europe wide summer programs, is the wide variety of fields we are coming from. There are Naval Engineers, Computer scientists, Chemists, Physicists, Mathematicians and so on. In academia, you seldom meet people from such a wide range of fields, so this is truly an amazing experience.
Apart from my wide academic interest, I consider myself an avid traveler and hiker. Below are some of my more recent highlights that were pictured.
Walked up this random hill near Munich for an hour or so, then we found this view!
Actually, this was taken just a 15 minute walk from one of the main north south roads in Norway!
Awright troops, my name’s Jamie and I’ve been from Glasgow my entire life.
Currently I’m doing my PhD at the University of Glasgow, researching the way viscosity in the solar atmosphere interacts with the magnetic field there. As part of this work I attempted to secure funding for a research trip to the sun but apparently it’s too warm this time of year, even at night.
Since my funding fell through, I decided to feed my inner masochist and apply for the PRACE Summer of HPC (high performance computing) program, mainly for three reasons. Firstly, being forced to be paid to work on a fascinating project on radiosity in computer graphics in another country sounds simply awful. Secondly, I unfortunately love playing music, as you can read below, so the idea of heading to Dublin, a hub of Irish folk music, fills me with the most foul feelings of dread. Lastly, I really do hate people so traveling to Ostrava in the Czech Republic to meet and work with a horribly wonderful bunch of people just seems like hell.
Seriously though, as we say in Glasgow, I’m pure mad buzzin’, pal.
Jimbles The Science Guy
So you already know my main work in my PhD, in a field called magnetohydrodynamics, but I’ve dabbled in a fair few areas over the years. I did my undergraduate integrated masters in both Maths and Physics so for my Masters thesis I investigated double diffusive convection, where a difference in temperature and a difference in salinity in water, for example, leads to some interesting fluid flows and some rather cool patterns. Moving a little bit earlier, my Bachelors dissertation focused on 2-dimensional topological quantum field theories. Turns out 2D topological quantum field theories are not so interesting to me and not particularly useful to anybody else.
A Wee Bitty Fiddlin’
At the tail end of 2013 I picked up the fiddle (violin) again and started playing traditional music. First the kind of cheesy trad(itional) tunes you find at a typical Scottish ceilidh, then I branched into the wonderful Glasgow trad session scene (sessions are held in pubs or cafes and you go along and just sit down and play, often for a few free drinks, kind of like a cooperative open mic night) and found all sorts of amazing contemporary tunes, along with some banging tunes written yonks ago. In 2015 I was privileged to lead the Glasgow University Folk Music Group and since then my friend Ayden and I have started a duo, creatively named Jamie ‘n’ Ayden. We’ve played in a number of gigs around Glasgow, playing mainly our own compositions and we even managed to find ourselves on the radio during this years Celtic Connections, Glasgow’s folk music festival! Check out one of our tracks below and you can find our Facebook page here, our soundcloud here and some of my own compositions here!
If you want a wee taste of some other interesting Scottish folk, check out Imar:
Or Elephant Sessions if you fancy a bit of rock fusion:
Or even EDM style Niteworks:
Art? Aye, Art.
Just recently I’ve gotten extremely interested in computer art, specifically generative art produced by ideas from mathematics and the natural sciences. It’s a wonderful mix of coding, science and art that allows me to express myself creatively, beyond my music and allows me to explore artistic ideas without being bogged down by my truly abysmal drawing skills. As part of a talk I recently gave on this kind of computer art, I wrote a few wee online toys for making some of the neat images you can check out below. They’re nice to look at but it’s more fun to go play with them yourself so go check them out here!
Sound wave-like pattern from the y-coordinate of a double pendulum. This is an example of taking data from a chaotic system and making it visually interesting. Have a wee play about yourself here.
A Julia fractal, one of the classic examples of mathematical art. This was made while trying to introduce myself to the web programming tool WebGL so please don’t look at the code on my github if you’re that way inclined. It’s awful. Do play with the fractals yourself here though!
Chaotic particle paths in the famous example of chaos theory, the Lorenz attractor. This was one of the first ever examples of a chaotic system but is extremely simple, consisting of just three short equations and giving some really interesting behaviour. As with the rest of these pieces, you should so totes omg go check it out here.
Fractal produced from the regular expression “.*[13|24].*”. Have a play with the randomly generated ones over here!
You can read about these and some of my other projects on my personal blog at jamiejquinn.com!
Hello World.f95, my name is Dimitra Anevlavi and I am from Greece. I am 22 very proud years old 🙂 and I am currently in my 4th year of studies in Naval Architecture and Marine Engineering at the National Technical University of Athens. Since I was a little girl, I was inspired by science and mathematics. I strongly believe that my grandfather, who was also an Engineer, had an impact on my decision to follow these studies with his intelligence and love for creation.
In my first steps at the university I discovered my passion for Computational Engineering while working on semester projects. Many of my Professors played an important role in inspiring me in this field and continue to do so up to this day. I am quite fond of Computational Fluid Dynamics and I am currently working on my diploma thesis on the modeling of deformable hydrofoils that operate under the sea surface and harvest energy from
waves. I guess that body-fluid interaction for modeling oscillating hydrofoils and biomimetics will be my thing in the next few years. For that reason, I take numerical methods by hand and jump right into the water. The PRACE Summer of HPC program is a unique opportunity for me. Not only because it is actually my first project abroad but also because it is full of new and exciting parallel programming techniques, which I will definitely implement in my thesis. During the summer internship I will be working on the development and validation of real-time earthquake hazard models at the Edinburgh Parallel Computing Center (EPCC).
550 km of backpacking to go until we reach Santiago de Compostela 🙂
In the future, I am looking forward to becoming a young researcher and contribute to society by improving peoples lives. It is what you pursue in life that keeps you motivated, so I guess I really am “the girl who is smiling and computing what it takes to keep her ship sailing”. One of the first tasks an engineer does is to identify the physical problem that needs to be solved and then after evaluating many parameters, to come up with a plan to produce the desirable results in a cost-effective amount of time and resources. For that reason, I look forward to use the knowledge that I have gained in order to produce results that can benefit society in general. But you have to broaden your horizons for that and think outside the box, right?
For this reason, traveling is also one of my favorite things to do, since every time you discover a new place you are also discovering important things about you as well. In 2015, I followed by best friends in a pilgrimage in Spain, the Camino de Santiago that had as a starting point Pamplona and as a final destination the famous city in La Coruna- Santiago de Compostela. Walking 300 km with backpacks sounded extremely difficult for me but support from loved ones and determination was all I needed to keep going. Since then, I grasp every opportunity I get to travel around the world, learn more about cultures, music, cuisine, history and people.
I will keep you all up to date about my new adventures in Scotland! Lots of Python code awaits me 🙂
Hint : Check out my introduction video!
and a cool collaboration with my friends Sam, Jamie and Andreas 🙂 Cheers
So, I’m supposed to write a blog about myself and my project during the summer. I have no idea what to type and I suddenly feel a real kinship with this dog.
Truthfully, I’d much rather hear about you, what are your interests? Tell me in the comments, I’m a great listener (well…… I’ve perfected the smile and the nod).
Ok, where to start? I’m told your name is usually a good way to begin these things, mine’s Andreas. I’m a 22 year old chemistry graduate hailing from tropical Birmingham, England.
Insert interesting fact about myself.
One thing that people always tell me is that I’m very photogenic, demonstrated perfectly by the picture below….
Why? I’m the guy smiling with only half his face in case you were wondering
If you were to ask what inspired my passion for all things science I wouldn’t be able to point towards a specific person or moment (although both of my parents hate science, so maybe I subconsciously followed this path as an act of rebellion?). I don’t think the why really matters, just that I’m here now and I get to work with really, really, (one more really? I think so) really big computers.
Computers, chemistry and caffeine
When I tell people I’m a chemist, they usually assume I make drugs or blow stuff up (maybe its just the vibe I give off?) so I wouldn’t be surprised if right now you’re asking yourself what a chemist is doing in a program about high performance computing. I’m a actually a computational chemist. That just means I do in-silico experiments rather than in a laboratory (you could say that I actually do make drugs and blow stuff up…… just from the safety of my computer). Computational chemists concern themselves with the fundamental properties of atoms, molecules and chemical reactions. Essentially we develop code to simulate molecules and solids (for example we might simulate ice melting).
Like computer scientists, we develop code and ingest copious amounts of caffeine, however, we can also tell you the molecular structure of caffeine (aka 1,3,7-trimethylxanthine, shown below), how to make it and how it works (take that computer science!).
As a chemist it’s obligatory to post a chemical structure
Over the summer I’ll be working at the Computing Centre of the Slovak Academy of Sciences on the parallelization of software that models nanotubes using the Hartree-Fock method. My first task is to use MPI to parallelize the code. If successful, the next step would be an MPI + OpenMP hybrid implementation. My plan for this blog is to post updates on my project, and any excursions I may take over the summer (if I have time I might also write a series of posts describing what the Hartree-Fock method is, but I haven’t fully committed to that yet).
So, come back for that (or don’t, I can’t force you to keep reading this blog), in the mean time here’s a video that some of us made to celebrate all the parallel programming we’re going to be doing over the summer.
Hi, my name is Sam Green and I’m from a small city called Waterford in the South of Ireland.
A photo of me while at the cliffs on the island of Malta. A large portion of my childhood was spent exploring the local beaches in the south of Ireland so I think this is a fitting photo to show I have a life outside of academia.
I grew up in the countryside surrounding the city (technically Kilkenny countryside but lets not get into those sort of details) and until a few years ago didn’t see life outside of there. I finished secondary school while I was 17 and got the grades I needed to enter the college of my choosing.
This lead me to the big Dublin City, Ireland’s capital city if you’ve never heard of us, to do a Bachelor’s Degree in Physics and Astrophysics at Trinity College Dublin. During my 4 year course here I discovered my passion for astronomy, computers and more importantly the desire to be a researcher. This was mainly due to research I carried out during my final year where I worked on code to model the differential emission measure (the amount of material emitted at different temperatures) of delta-spots (highly magnetized sunspots) during a solar flare event on the Sun. I decided a Masters degree would be important in achieving my goal of entering academic research and so I entered a MSc program called Space Science and Technology at University College Dublin. This course (eventually) proved to be an important decision for my academic career – mainly due to the 3-month internship that was required to complete the degree.
This leads me to the Dublin Institute for Advanced Studies (DIAS, to see more please visit https://www.dias.ie). I began an internship in DIAS working on a small project with a guy named Jonathan Mackey (soon to be my PhD supervisor). His research, in a nutshell, involved massive stars and their explosions as supernovae at the end of their lives (if you’d like to know more please visit https://homepages.dias.ie/jmackey/research.html). I began working with a code he created called pion, which is a grid-based fluid dynamics code for simulating the circumstellar medium around massive stars. I used this to create simulations of stellar wind bubbles around massive stars. This is where I was first introduced to high performance computing and began using the computing facilities at ICHEC. Over the next 3 months I learned more and more about HPC, Python, C++ and parallel computing (oh and of course physics) and how I felt I was in the right area of research. So I applied for a PhD position within DIAS and by the end of the internship I got a 4-year scholarship to continue of my research as a PhD student.
Photo of Dunsink Observatory in Dublin, Ireland. The image shows the main observatory building where the famous mathematician William Rowan Hamilton used to live while he was a professor in Trinity College Dublin.
Since I started my PhD about 7 months ago, I have continued to work on 2-dimensional simulations of massive stars and how they interact with their surroundings. My current work involves modeling the Bubble Nebula (NGC 7635) and trying to understand its formation and hence create a model to describe it. DIAS also runs and maintains an observatory on the outskirts of Dublin called Dunsink Observatory. Now if I didn’t mention anything about this place I think Hilary (Hilary O’Donnell organizes all the public outreach that goes on within Dunsink) would kick me out. Since September, I have been voluntarily helping out at public events and workshops held here. Since then, I have developed talks that I now give on a regular basis to the general public, schools and college societies (ask David Bourke). I also work on a meteor camera system that monitors meteors entering the Earth’s atmosphere (see Nemetode.org for more information).
During the beginning of my research, my supervisor sent me an email about the advert
Working hard during a hands-on session at the Summer of HPC training with in Ostrava, Czech Republic.
of PRACE’s Summer of HPC program and how this could be a good way to further my knowledge of HPC and get a chance to do a project outside of Ireland. Of course I jumped at the idea and submitted my application. Within a few months I was picked to do a project at EPCC in the University of Edinburgh working on a project entitled “Interactive weather forecasting on supercomputers as a tool for education”. This project caught my interest because every aspect of it includes things I am interested in. I have always been curious at how weather forecasts are modeled and how supercomputers are used to do this. I also look forward to learning how this type of computing can be used for education. My time at Dunsink Observatory has taught me the importance of outreach in astronomy and science as a whole. I know the next 2 months will both exciting and educational!
My name is Alessandro Marzo and I’m from Pesaro, Italy. If you never heard if it, just know that it is famous for two things: 1) it is the birthplace of the famous Italian composer Gioacchino Rossini and 2) it’s the only place in Italy where you will find mayo and boiled eggs on a pizza, named after the composer himself.
I’m currently pursuing a Master’s Degree in Applied Physics at the University of Bologna, focusing on Medical Physics and High Performance Computing. I’ve always loved Physics so much that during my studies I struggled to choose one topic to focus on and I switched fields a couple of times. That is until I discovered High Performance Computing! To me HPC is the perfect balance between theory and experiments. Carrying out a computational simulation of some complex physical process (that can match experimental results) allows you to have a higher form of understanding of the phenomenon, while acquiring practical knowledge in the meantime. And in hindsight, I realized this is the reason that I first decided to study Physics, and the reason why after my Master studies I intend to pursue a Phd in Computational Physics.
But that’s not everything about me! I also love hiking and enjoy long walks in the city or by the beach. While at home, I like to spend my free time reading books and watching movies and TV shows. I am a big fan of Stanley Kubrick, but I also like Lars von Trier, so I am pretty excited to visit the country where he comes from this summer!
Me hiking in the Dolomites
I am going to spend the summer in Denmark – the happiest country of the world, at the Niels Bohr Institute, University of Copenhagen working on the project Tracing in 4D data. My task is to implement a parallel version of tracing algorithms for muscle tissue in 3D over time to study and visualize the fast micrometer scale internal movements of small animals, in my case blowflies, while they try to escape from lethal doses of radiations that we need to make nice pictures of them!
Here’s me presenting in parallel with my friend Konstantinos Koukas:
My name is Konstantinos Koukas and I am a 22 years old student from Athens, Greece. I am currently pursuing a Bachelor’s degree in Computer Science at the Department of Informatics and Telecommunications of the University of Athens. I plan to graduate in the summer of 2018 and intend to continue my further studies towards a Master’s degree. My research interests include Database Systems, Data Mining, Big Data, Distributed and Parallel Computing as well as Machine Learning. I have been working as a teaching assistant in the undergraduate course ‘Introduction to Programming’ of my university for the past two years.
During my studies, I attended a parallel computing class where I was fascinated by the ability of unlocking the potential processing power of modern computers using parallel programming techniques. This is why I decided to apply to the PRACE Summer of HPC programme and had the opportunity to participate in a hands-on training week in Ostrava, Czech Republic, making new friends and getting introduced to a number of HPC technologies.
I am excited to continue my experience, spending two months during this summer in Denmark, working on accelerating climate kernels, a project hosted by Niels Bohr Institute, University of Copenhagen. My task will be to improve performance of ocean numerical solvers in the Versatile Ocean Simulator (Veros) project by porting them to run on different accelerators, particularly GPGPUs and Xeon Phis. I am eager to not only enable more efficient climate simulations and gain more experience in supercomputing but also meet interesting people and discover the culture of the happiest country in the world.
I am an enthusiastic programmer and enjoy learning about new technologies, which is why I like participating in local hackathons – as they are a great opportunity to build small projects and expand your skills. I am also an avid supporter of free and open-source software. Apart from programming, I love traveling to foreign countries as well as exploring the numerous treasures of my homeland, Greece.
My friend Antti described me as
A cool guy
Here is me presenting in parallel with my friend Alessandro Marzo:
Hello there, my name is Adrián Rodríguez Bazaga and I’m a 21 years old guy from Valencia (the home of the Paella!), but I come from the beautiful island of Tenerife in the Canary Islands, Spain. This is where I just finished my Bachelor’s Degree in Informatics Engineering at the University of La Laguna. In September of this year I will be travelling to Barcelona to pursue my Master’s degree in Innovation and Research in Informatics with the specialization on Data Science and Machine Learning offered by UPC (Polytechnical University of Catalonia).
Me at the President Garden in Bratislava
Throughout my degree I learned many things about the world of technology, arousing my curiosity in certain fields of Computer Science. Specifically, I am very interested in Data Science (Data Mining, Knowledge Discovery), artificial intelligence (Machine Learning, heuristics), Bioinformatics (Genomics, etc.), parallel algorithms and High Performance Computing (HPC). To obtain conclusions from data sets that initially seemed to have no value is something that strikes me; including the part of data analysis, mathematical statistics and Deep Learning, since it provides a potential tool that can be used to make important decisions.
In 2016, I had the opportunity to work as a research intern at a renowned research institution, where I worked on a project in the field of Big Data: RDF processing solutions through Big Data for the discovery of relationships between concepts in the DBpedia. Also, from December 2016 until June 2017 I worked on a research project at a research group at the University of La Laguna, thanks to a grant by the Spanish Government. Specifically, the project was called ‘Exploiting Open Data sources through Data Mining, classification and regression techniques with Spark to analyze traffic flow’, where we propose the use of Data Science and Machine Learning techniques with Apache Spark. The use of techniques such as decision trees and multilayer perceptrons for the prediction of road traffic congestion level, focused on the road network that connects the container terminal of Tenerife’s port to the highway’s access. This is important to look at as it is the main traffic bottleneck when delivering products inside and outside of the port.
In June 2017, I finished my final degree project: ‘Heuristics and Big Data in mathematical optimization problem: extension to the Tourist Trip Design Problem’, where my objective was to solve a problem that by definition is NP-hard using an artificial intelligence approach. Specifically, approximate algorithms (heuristics) such as GRASP with LRC, among others were used. Furthermore, I had to work with data from different datasets and link them (Linked Data). All of them together were used to try to gather information about every point available on the Earth. So, we are talking about a volume of more than 100 million instances (the so-called Big Data concept).
My interests are still greater, I’m interested in learning, researching and developing tools that allows to improve medical applications using Data Science, which is my current interest: Bioinformatics. This interest was motivated, among other reasons, by my visit to the ITER Supercomputing Center, where the TEIDE HPC supercomputer is hosted (the second most powerful in Spain), and where IonGAP (an integrated genome assembly platform for Ion Torrent data) is used as part of a chain of tools to research on genomics, and Data Mining for medical diagnosis. Medical data mining has a great potential for exploring the hidden patterns in the data sets of the medical domain, which can be utilized for clinical diagnosis. Taking into account that available raw medical data are widely distributed, heterogeneous, and voluminous, my interests are to collect that data in an organized form to build medical information systems that can help to reduce the huge rate of deaths that can be prevented if a diagnosis can be made in time. To make this possible, we need to use a High Performance Computing (HPC) approach, which mostly refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.
During this summer, thanks to the PRACE Summer of HPC programme, I will be working at the Computing Centre of the Slovak Academy of Sciences, on the project “Apache Spark: Are Big Data tools applicable in HPC?” where my objective is the implementation and optimization of the routinely used quantum chemistry Hartree-Fock method in Scala with Spark and in C++ with MPI to perform some benchmarking. Other quantum chemistry algorithms such as Density Function Theory and Second-order Møller-Plesset perturbation will be also looked into, so I’ll need to deal with the quantum many-body theory, trying to bridge HPC with the Big Data world, and resulting in visually appealing outputs such as molecules, orbitals, among others.
Greetings from the far north! My name is Antti and I like all sorts of things mathematical. There’s just something very savoury about a well-structured, intuitive and dumbfounding mathematical proof. Basically, I like to look for problems and solve them. I especially like it when something is incomprehensible; that thing is then a mystery waiting to be unraveled!
Mathematics is all very fascinating, but sometimes you also need to do some calculating. That’s hard work and I’m quite lazy. So most of the time I let the computer to do that stuff. Programming computers is difficult and time-consuming, especially if one doesn’t have the know-how. That’s why I decided to participate in the PRACE Summer of HPC program, where 21 students from around Europe will learn about data visualization and high-performance computing from the top experts. My home institute, University of Eastern Finland, in which I study applied physics, was more than happy to send me off to this prestigious international summer study program.
Here’s me on top of an old steel mill. It’s perfectly safe, you see: I have a helmet. Click on the image to go to a blog post about Ostrava, written by my friend, Dimitra.
The first week of this program went whizzing by in Ostrava, Czech Republic, where all of us learned about parallel programming using all sorts of compute clusters, vectorzation in modern processor chips and data visualization using ParaView software. Days were spent on lectures and hands-on exercises and nights were spent enjoying each other’s company in local restaurants. Without a doubt, I can say that the last week was the most fun I’ve had for a long time! It also turned out that I’m a pretty good dancer, at least on the tables of an Irish pub.
The next seven weeks I’m spending in Jülich, Germany at Jülich Research Center. It was originally the nuclear research center of West Germany, but gradually the nuclear physics research diminished while the presence of other disciplines grew. Nowadays, it is one of the biggest research centers in Europe carrying out top research in many different fields. My project concerns programming GPUs to compute Coulombic forces, i.e. the attractive and repulsive electric forces between charged particles, and to carry out these computations blazingly fast using the Fast Multipole Method. It roughly is an approach to solve the problem by clustering the particles into larger cells and computing forces between them. To make the code more accessible for more users and devices, I utilize an open source framework called Heterogenous-Compute Interface for Portability, or HIP. The framework enables programming for multiple types of GPUs simultaneously. I still don’t know the specifics of my project, but you’re are able to read the extent of it from here.
Lastly, since this is an introductory post, to get to know me even better, the following is a small subset of my favourite things:
Favourite class of a linear partial differential equation: Elliptic equations
Favourite numerical approach in problem solving and simulation: Monte Carlo method
Hi everyone! I am Shukai Wang, a Chinese girl from a very very very beautiful city, called Hangzhou (you should visit!), I have spent my past 7 years in the UK but I am now following the two-month PRACE Summer of HPC program in the Czech Republic. I have completed my undergraduate degree in Chemistry at Imperial College London with a placement year at the Schlumberger Gould Research Centre at Cambridge. My project was to develop the hydraulic fracturing fluids which are the fluids pumped into the ground to produce shale gas. During the placement, I realised the importance of inter-disciplinary research skills, thus I decided to pursue my PhD in the simulation of HIV proteins (Vpu) in membranes using Molecular Dynamics simulations, at the same institution. This is where my HPC journey began.
Thanks to the full support of the PRACE Summer of HPC program for my visa and travel arrangements, I will join the IT4Innovations Centre, the best High Performance Computing centre in the Czech Republic. The campus is nice, big and quiet, with a big forest just 3 min walk away from the hotel. My project is Performance Visualization for Bioinformatics Pipelines. The idea is to process the visualisation of results from performance analysis of machine learning and to understand the bioactivity trends of the molecules. I hope I could become a Python master after my project, particularly in data visualisation and machine learning.
During my spare time, I enjoy travelling. I have been to 17 countries, experienced the fascinating northern lights in the north pole, the extraordinary milky way in the Sahara Desert, helped people in Bosnia to rebuild their home after the war and volunteered in the once-in-a-life-time London Olympics 2012. I am also interested in business and believe that a combination of business and technology will lead our future. I have completed five real consulting projects with the society at Imperial College and participated in the Accenture Hackthon competition, being nominated for the award of rising star. I am also a big fan of swimming and painting.
My name is David Bourke, and I am a Computer Engineering student studying for my integrated Masters degree in Trinity College Dublin (TCD). Before coming to TCD, I completed a Bachelors degree in Electronics and Communications Engineering at the Dublin Institute of Technology (DIT). I became interested in High Performance Computing while developing a lattice model protein folding simulation as a student at DIT. This protein folding simulation was deployed to the Fionn supercomputer, which is owned by the Irish Center for High-End Computing (ICHEC).
Last summer, I worked with a security startup and developed software for digital camera source identification and fingerprinting based on Photo Response Non-Uniformity (PRNU). I have experience with Python, Golang, C, C++, CUDA, Matlab, BASH and Julia. I have also worked with ARMv7, x86 / x86_64, MC6800, and MC6809 assembly. I have worked with VHDL and Verilog, and believe that FPGAs will be increasingly utilized in High Performance Computing as it becomes more difficult to develop faster processors due to the end of Moore’s law.
When I am not debugging code, I enjoy rock climbing, hiking and fencing, with saber being my preferred weapon. Aside from sports I also plays the guitar and enjoy live music.
I will be working on the project, “Visualization of real motion of human body based on motion capture technology“. This project will involve generating a 3D model of a human skeleton based on images captured from computed tomography scanning, and creating a visualization of motion using motion capture technology and inverse kinematics. The visualization will be developed using Blender and Unity3D. I chose this project because I am interested in computer vision, image processing and visualization. Particularly, I am interested in learning about biomedical image processing. I am enjoying my stay in the Czech Republic so far, and look forward to learning more about the country as I stay here.
Hello everyone! My name is Arnau Miró. I am a 27 year old PhD student from Universitat Politècnica de Catalunya. I am currently located in a very nice city in Spain called Terrassa, which is about 30 km away from Barcelona. Do you want to know more about me? Keep reading and watch my short introduction video!
I am currently working in the interesting world of Aerodynamics and Computational Fluid Dynamics (CFD) as part of my PhD program. The thing about CFD is that the more complex the problem is, the more computational power is needed to solve it. This is how I got introduced in the world of High Performance Computing. Currently, I am also a collaborator in the Computer Applications in Science and Engineering (CASE) department at the Barcelona Supercomputing Center (BSC).
This summer, I will be staying in this beautiful city, Bologna, working on the project “Web visualization of the Mediterranean Sea” supervised by Dr. Paolo Lazzari in the OGS – National Institute of Oceanography and Experimental Geophysics. I believe it will be a great opportunity for me to further dive into the world of HPC as well as enhance my knowledge about visualization, which is crucial for today’s researchers. I also expect to spend a great amount of time this summer visiting Bologna and the beautiful cities nearby as well as tasting the excellent Italian cuisine.
What else can I say about me? When I’m not programming or putting CFD simulations on the supercomputer, you can find me watching series, cooking or grabbing a beer with my friends. If you still cannot find me, it is because I am probably lost in the mountains, doing one of my favorite hobbies: landscape astrophotograpy or taking nice pictures of the stars, like the one below.
The Milky Way and the Dark Horse Nebula behind the oak tree.
I’m really excited about the opportunity that the PRACE Summer of HPC program offers. It’s always nice to meet different people from different backgrounds and cultures.
Ahoj! My name is Petr and I come from a magical land of beer – Czech Republic. I live and study in Brno, the second largest city, sometimes called the Silicon Valley of central Europe.
Last spring I finished my bachelor’s degree at Brno University of Technology (BUT), Faculty of Information Technology and I continue to study there for a Master’s degree in the Information Technology Security program.
Since my earliest days in IT, I have been interested in data visualisation. During recent years this has been one of the main issues in big data (which is my point of interest as well). This is why my bachelor thesis topic was in a similar field—”Visualization of Network Security Events” which focused on visualising large datasets of network events.
With big data comes the need for high-performance computing and that is how I got interested in the PRACE Summer of HPC program, where a couple of projects were right in my field of interest. I was lucky enough to be chosen for such a project and during the summer I will develop a web visualization of HPC system’s energy load with guys from the CINECA facility in Bologna, Italy. With this project in mind we are going the other way around because the HPC system itself will generate really big data (e.g. monitoring or statistics information) and the tricky thing here is to pick the correct data to collect, analyze, store and eventually visualize.
Every now and then I try to pickup a new skill (in IT, of course) and most recently I dived into IoT (Internet of things). How? Simple, I bought Raspberry Pi Zero W and a couple of sensors to tinker with. Now I know the temperature of my flat when I am away. It might sound silly but the next project I will undertake is to build a quadrocopter from scratch controlled by a Steam Controller. Cool, huh?
That’s what you call “stack overflow”
When I power off my PC (which I usually don’t anyway) I enjoy a good book or movie. One of my most favorite writers is Douglas Adams and his Hitchhiker’s guide to the Galaxy book series. Another passion in my life is food. I love to cook and I love to eat it. That is another reason I am really looking forward to be in Bologna. Bye bye, diet!
To finish off with my favorite quote (with a small addition):
Happily hiking in France, minutes before getting lost in the Mountain.
My name is Ana María Montero and I am a 23 years old Physics student. I am from Spain and I come from a nice town close to the border with Portugal called Badajoz, where I also study. I am about to finish my Bachelor’s studies in Physics at the University of Extremadura where I am preparing my thesis regarding the behavior of classical one-dimensional liquids (which means I put a lot of marbles in a line and check their properties).
Next year I will be travelling to Germany to pursue my Master’s degree in Simulation Sciences offered by RWTH Aachen and Forschungszentrum Julich as I would like to specialize in computer simulations. This is the main reason why I applied several months ago to be here today, participating in the Summer of HPC program and working in the best computing centers in Europe.
Apart from this, and though it seems like debugging stuff is my favorite activity (if it is not my favorite one, it definitely is the one I dedicate most time to) I also enjoy doing a lot of other activities: Regarding the musical world, I play the piano and the violin whenever I can and I also sing in my university choir. Regarding outdoors activities, I love hiking (even when I get lost in the middle of a mountain in France) and traveling (even when that means that I have to face the heartless Polish winter).
This is how a Spanish girl looks like in Poland in January.
This year’s vacation however, consists of spending a couple of months in Dublin, the city of Guinness, live music and Temple bar. But Dublin is also the city of ICHEC, where I will be working for two months studying and predicting “El Niño” and its consequences worldwide (who knows, maybe this helps Irish people predict their unpredictable weather). So, in general terms, I have already set up the algorithm for my summer plans, and, speaking in FORTRAN language, it looks like this:
—————————————————–
! Ana’s summer program 2017 for FORTRAN speakers:
Do day=1st of July, 8th of July, 1
Amazing training week in Ostrava
End do
Do day=9th of July, 31st of August, 1
If (day==weekday) then
Work at the ICHEC in Dublin
Else
Visit and enjoy Ireland!
End if
End do
-----------------------------------------------------
Enjoying the training week in Ostrava
Nevertheless, as amazing as the plans look like right now, by the end of the summer I will be able to show those plans to the world. My visualization work at the ICHEC will help me bring “El Niño” and its consequences to life and my camera will help me capture the rest of Ireland’s wonders.
After an incredible week at IT4Innovations in Ostrava, I cannot wait to start my project in Dublin and to put into practice everything I have learnt.
I spent 8 weeks in Bologna (Italy) during my summer for SoHPC. It was a great experience to do for multiple reasons :
First and foremost, I discovered a new country : Italy ! It’s always an awesome experience to uncover new traditions and a new culture.
Then I learn a completely new computer science field : High Performance Computing and the architecture of supercomputers. That was particularly fun to do as I was helped by some super talented people who could guide me.
Visit of the supercomputer Galileo100
However, it was also a stressful experience: looking for an apartment in a foreign country online, taking a leap into the unknown… I hope I won’t get grey hair
I would recommend doing SoHPC for everyone with a minimum knowledge of computer science, it’s a rewarding experience !
During my summer of HPC I combined a low level acquisition program with a database to create a beautiful dashboard monitoring jobs runtime. In other word, visualize in real time execution of parallelized program (A better view for saving energy in HPC using a low level acquisition program)
Introduction
Programs (called Jobs) used in supercomputers consume a lot of energy which it is a limiting factor for scalability of these computers. These programs parallelize the computations across multiple “nodes” across multiples cores of CPUs. And they need to be synchronized during their computations to coordinate. However, such parallelized applications often waste CPUs power during synchronization of the processes.
This is where COUNTDOWN comes in ! It is a tool which reduces the CPUs frequency (therefore reducing energy consumption) during idle / synchronisation time while trying to be as impactless as possible.
In order to helps users, understand how their programs are running with COUNTDOWN and how much energy they use, my goal is to design a dashboard to monitor programs which are using Countdown and extend Countdown to send all the necessary information about the job.
Overview of existing solutions
MPI Profilers (which are programs used to get debug data from MPI programs) already exists from IBM or Intel. However, they comes at a cost:
They lead to perturbations of the performance of jobs
They report the data only at the end of the job and not during runtime
Countdown is a MPI Profilers that tries to be as impactless as possible and can also reduce the energy consumption of a given job. It is developed in CINECA (+link) and can produces timeseries reports. This is useful especially to get runtime profiling.
What is ExamonDB
Examon is a distributed and scalable monitoring infrastructure with a database. By using Countdown with Examon it is possible to send MPI Profiling information during runtime to the Examon database. Then it become only a matter of design to create a beautiful dashboard using the open source, state of the art, interactive data visualization, web application: Grafana.
The Grafana dashboard
The dashboard automatically set the timerange according the the Job ID specified!
The design of the dashboard is split into 3 different parts:
The Summary : This is where all the general information about the Job are and some global features such as the execution time and the total energy used.
The Timeseries : It is the part of the dashboard with all the runtime information
The GPU If enabled, Countdown can also profile GPUs data
From left to right: My supervisor Mary Kate, me and my girlfriend Eva Maria.
Actually, I thought the last blogpost would be my last, but after a fortunate encounter in September, I have to publish this short post.
Eventually, we met in person.
After having spent two months with intense running of simulations and learning a lot about HPC and molecular dynamics I rewarded myself with a train journey through the south of Spain. Starting in Seville and proceeding to Cordoba and then Granada and after a one day visit to Saragossa I finally made it to Barcelona.
Barcelona was the place where my project took place during summer, although fully remote, due to the still present pandemic. So even tough I had a very engaging exchange with my supervisors during the summer, I never had the chance to meet them in real life, shake their hands or chat face to face. However, I made up for this in September, almost a month after SoHPC 2022 officially ended and got to visit the Barcelona Supercomputing Center (BSC), the MareNostrum 4 supercomputer in a former church as well as it’s successor, the in-progress MareNostrum 5 supercomputer in the basement of the new office building of BSC.
I was warmly welcomed by Mary Kate and Julio, my former mentors, in the lobby of the BSC office building. We had a lively conversation at the balcony and then Mary Kate and Julio gave me a personal tour of the BSC site and, of course, the MareNostrum supercomputers. And as if that wasn’t all, I was able to meet the rest of the Fusion Group members at a delicious lunch to which Mervi Mantsinen, the head of the Fusion Group, invited us.
Down below you see the currently operating supercomputer MareNostrum 4, which I used to run my jobs, built into a church. A very surreal scene.
This is where my jobs were running this summer: the MareNostrum 4 supercomputer at the Barcelona Supercomputing Center in Spain.
IT ,Vienna Technical University, from first Turkish Coffe Ceremonies
Schnitzel Lady writes, I miss the Turkish Coffee Ceremonies in Vienna Technical University, IT department. And of course,
‘Ich möhte eine Melange , bitte‘. 🙂 First of all, I am very happy to have been selected for this project and to spend the summer in Vienna. It was an experience for me that I will never forget. First I want to thank Siggi. Because He always helped me and taught me a lot (Ezgi please write to here, dot / dot dot / ) . Er ist einfach perfekt
Then Markus was very helpful. Markus always said to me; Ezgi, get 1 more Slurm, please.. Because I really had a hard time understanding what they were doing at first. And I learned to say ‘alles kaput‘ when things went wrong. But they always helped me. And we completed the project on time. I just couldn’t visit the Albertina Museum, it was a little sad not being able to go when it was so close to it.
And then We met Dominik ( @dominikf) , who lives in Vienna within the scope of SoPHC, in the garden of Vienna Technical University, went to the Palm Garden with him, we sat on the grass and talked about projects and life. After the pandemic, I realized that we all have similar concerns. Dominik is smart and has a great clean energy 🙂 Good to meet Dominik. We’ll meet again, I know. You can find it here ;
Here I saw a place that I was very curious about. We visited the VSC-3 (Vienna Scientific Cluster) center with my mentors ,Siegfried and Markus . In an area slightly outside the city, there were many VSC centers within the research centers affiliated with the Technical University of Vienna. So very, very much GPU and CPU.. I couldn’t imagine it was like this. So where so many processors come together.. Just like a science fiction movie, but actually just HPC..
And in My Project (Project Number : 2222), a drug design study is carried out to block viral ds-DNA associated with many viruses, especially HIV/AIDS and HPV, using HPC. GPUs are used to speed up MD simulations and MM-GBSA/MM-PBSA calculations are performed within HPC. With the calculations we have made, we have obtained different and exciting results than we expected. You can watch my final video here. It was a great program, it is announced to those who are interested..
And when I was in Vienna, I felt very motivated and happy. While I was there on the subway, I always listened to this song ‘KADEBOSTANY – Another Sunrise’..
I wish a world full of health and peace for all of us.
I have already told you what I have been doing these two months, but I want to say goodbye to this experience by adding some final thoughts that are not so obvious.
Most of us students have always worked for ourselves. Sometimes we have had group work, or we have had to present our work, but always in a totally controlled environment, and always with rather small and simple things. What happens when you are part of a gigantic project? When you have to touch something that has been touched by dozens of people before you, and that will still be used by many others after you are gone?
If there is one thing we have all done, myself included, in these two months, it is programming. And when you program in code that belongs to other people and will be used by others, that comes with a certain responsibility. You can’t just go in and do things any way you want, leave something a mess because it works anyway, or use whatever programming style you like, without thinking about how the rest of the code is written.
These two months I have been working on modifying the iPIC3D code, a code, as I said, for plasma simulations. In my previous post I explained what I programmed, but I didn’t tell you how I did it. And that is just as important.
Imagine if I had arrived, got into one of the many modules of the code, and started to define variables and write routines. I might have ended up with similar results. Why not? It’s not so different from doing it on your own. My post-processing routines need little more than knowing what shape your inputs are in and where to get the information you need. I could have done it all fast and furious and it would have worked.
But then those routines would have been inefficient, which would have been a problem in the future. And even worse: they would have been almost impossible to modify and improve by other people. How many of you, if you have ever programmed, have written code that only you could understand and use? We have all sinned at it, especially in personal programs that will never be used by other people. But, when you work with something beyond yourself, this becomes important.
My post-processing routines are written in the same style as the rest of the code, commented according to the same rules, and make use of everything possible to avoid unnecessary calculations that might have resulted from not studying the code properly before starting work. If now someone, who neither knows me nor has contacted me nor knows what I have done in the code, wants to modify my routines, they will be able to do so without any problems. They have all the information and facilities they need in the code.
And part of working on a collaborative code is to hide in it, leave everything as tidy and clean as possible, and, when you leave, nobody needs you for anything, because the code you have written is not yours, it belongs to the programme.
If you ever work on a programming project, never forget this. Your colleagues, the ones that are there now and the ones who will come when you are no longer there, will thank you for it.
Hello everyone, and welcome to my final post! In this post, I will present the 2 test cases I used to verify the 3-D laminar incompressible flow solver I developed during my project using the FEniCSx computational platform. The test cases refer to the 3-D laminar flow around a cylinder confined inside a channel, with a parabolic inlet profile both for steady and transient state.
Test case geometry.
In order to validate the results the drag and lift force coefficients were calculated and compared to available experimental data. Different meshes with various numbers of elements were investigated. Furthermore, different basis functions were used. For the steady-state case, we can see in the following figures how the solver is able to accurately predict the coefficient values, and how this accuracy increases for more number of elements.
Relative error for drag (left) and lift (right) coefficients for different number of mesh elements and basis functions orders.
In the following figure we can see the velocity contours in 2 symmetry planes of the computational domain.
Velocity magnitude contours for the steady-state case.
Next, a transient case was simulated, again the accuracy of the solver was verified using the drag and lift coefficient data. 3 different setups were investigated with varying mesh sizes and basis functions orders. In the following videos we can see the time evolution of the velocity magnitude contours for the transient case.
Time evolution of velocity magnitude at the XY middle plane.Time evolution of velocity magnitude at the XZ middle plane.
You can check my final presentation for more info!
My final video presentation.
Before this post ends, I would like to thank my mentor Dr. Ezhilmathi Krishnasamy for helping me throughout this project. I would also like to thank PRACE for providing me with this fantastic opportunity. I learned many new things about HPC systems and parallel programming. I highly recommend it to everyone applying next year. I leave you with a picture with Mathi and my co-workers during the program: Filippo and Alexander during our visit to the AION clusterroom.
From left to right: me, my mentor Mathi, Filippo, Alexander and the AION cluster.
Every phase has to come to an end, and it is inevitable to exonerate this moment from such a reality. Talking about my gains during this program. Hitherto, my exposure in Europe has been strictly about academic experiences. But, henceforth, I believe that I can share buttressing ideas if invited to talk on professionalism as a topic of discussion. Working as an intern at Capgemini Engineering Company during this program exposed me to life outside the walls of the lecture rooms. I got to learn not only about the project but also the etiquettes of working in a multinational company. It granted me the opportunity to learn, wine and dine with a highly professional group of people. Most especially, working with my mentor Yassine El Assami I learned good characters and awesome manners on how to nurture a mentee. Thus, regardless the Summer of HPC 2022 has finally come to an end, these experiences are priceless and will forever be part of me.
Coming to the overview of my final report. I believe that my previous posts have highlighted the aim and objectives of this study. And also, the perceived difficulty towards reaching our goal. Now, I’m going to be walking you through the summary of the project, so stay tuned …
To refresh our minds, the concept of neural networks has been proposed in place of the conventional approaches for solving problems related to physical models due to its capability to learn any kind of continuous function. The project focused on building neural networks to predict a maximum resistance or reserve factor of mechanical components. And to improve the precision relative to the lowest possible error thresholds.
Here, the accuracy is defined as the ratio of errors under a threshold and the total number of samples. While the precision of the models corresponds to the smallest threshold with a perfect accuracy.
The project considered two mechanical cases which are linear and nonlinear components. Used synthetic datasets for each case, which means that both the size and the quality are controlled. The datasets contain parameters of geometry, material properties and applied forces as features while the target is maximum resistance.
The models have been built leveraging Keras modules of Tensorflow. Mixed precision policy was incorporated to train the dense layered models. The Kerastuner module was quite useful to search for the most suitable parameters while the callbacks module helped in saving the best model configurations during training.
Discussing the results.
The linear case is a set of parallel beams that have only widths as variables and the other parameters are constant. It is noticeable that numerical precision has significant influence in the model training of this case. The obtained results show that both single perceptron model and multi-layer perceptron models produced similar performance, yielding perfect accuracy down until 10−6 for a single numerical precision. Conversely, an improvement in performance was obtained using double precision and half precision exhibited very poor performance compared to the other two (see fig1).
Figure 1: Comparison of different numerical precisions.
The nonlinear case is a beam under tension with more variable parameters such as geometrical and material properties. Here, the precision achieved is good but bounded around 1% error threshold for both the single layer and multi-layer models. Numerical precision seems less significant compared to training and capacity error (see fig2 and fig3).
Figure 2: Comparison of the best performant models.Figure 3: Comparison of the performance of models of fixed widths and different depths
The conclusion
This project buttresses the use of neural networks as universal approximators to solve mechanical models. In general, other studies for which the components share similar parameters can leverage the concept too. It tends to be more useful when the components to be studied require various levels of complexity, as there is no need for significant modifications. Although the precision of a prediction appears to be bounded but still good for lots of applications. Finally, high performance computing has been used for the computational demands required for this study. Parallelization using MPI has been implemented to better manage processing resources and realize statistical approaches.
Working on this project has enhanced my understanding of machine learning practical applications and most importantly working on high performance computing clusters. Hence, I shall be on the lookout for more opportunities within this domain to further my understanding. I think the next opportunity I’m looking forward to is the High Performance Parallel IO and post-processing @ MdIS.
Working on this project has enhanced my understanding of machine learning practical applications and most importantly working on high performance computing clusters. Hence, I shall be on the lookout for more opportunities within this domain to further develop my understanding. I think the next opportunity I’m looking forward to is the High Performance Parallel IO and post-processing @ MdIS.
Finally, my profound gratitude goes to the organizers of PRACE SoHPC for such an awesome opportunity to be part of this project. I would like to thank Mr. Karim Hasnaoui from CNRS, IDRIS laboratory. Also, I extend my gratitude to Capgemini Engineering Company. Lastly, my special appreciation goes to my mentors for their relentless guidance towards the success of this project. I thank you all.
It has been amazing summer, and the participation in the SoHPC program was an exceptional experience. I am really happy with the results of our work as we put in a lot of effort in the making. We tested permaFoam under different test cases and evaluated it’s performance with different parameters, while running everything on Olympe Supercomputer.
The performance analysis conducted is split into three parts: the Execution Time Analysis, the Linear Solver Analysis and the source code Profiling. The results obtained are really interesting and can be viewed in the presentation below:
Presentation of project 2207.
Finally, I cannot express in words how grateful I am for this experience. I obtained great inside on how HPC infrastructures work and more in depth knowledge on scientific applications. I would like to thank my mentor for being supportive, encouraging and spending so much time on this project. Apart from that thank you to PRACE and the organising team for making this SoHPC amazing !
Here we came to the end of the Summer of HPC. I would like to dedicate this post to explaining my experience and delivering our final product.
First, business
I talked previously about our project aim and what will be doing. Now, let me present our results. I am proud to say that we have successfully built a tool to decompose a mesh and fields for parallel execution faster than the existing tool, decomposePar. It performs twice as fast as decomposePar and gives the correct output. There is still a memory problem in the tool but due to time limitations, we were unable to solve that problem. Hopefully, that issue will be resolved in a later version and the tool will be ready!
The name of the program is parallelDecompose. It is open-source and anyone can read the source code and make adjustments to the code. We have written a report and recorded a video to present our findings. Although the report is not published yet, you can watch the presentation below:
Summer of HPC 2022, Presentation of Project 2207
A perfect summer
I cannot describe in words how grateful I am to be accepted into this program. Even though it was short and remote, I learned so many things that I could not learn by myself. The project was challenging but we overcame it.
I cannot finish this blog without acknowledging my mentor, Thibault Xavier. He started preparing us even before the official start of the Summer of HPC. He allocated time from his personal life and helped us whenever we have questions. He was encouraging and supportive. I am also grateful to have Mr. Xavier as my mentor.
Thank you so much to the PRACE Summer of HPC Team for organizing this program and enabling us students with this experience. I hope to meet you again sometime, adios!
Well, I’ve already left Nice and I have finished my experience with PRACE, so let’s review a bit of what has come out of this work I’ve dedicated my time to.
As I said in my previous post, I have focused on implementing two new routines within the same code, so that they are performed efficiently and in parallel without the need to store much heavier data that would later be used to get the desired information: one to calculate the tensor of temperatures in the whole space, in a coordinate system relative to the magnetic field, and another to calculate the energy spectrum of the particles in the simulation.
The Temperature Tensor
One of the relevant quantities when analyzing a plasma system is the temperature at each point in space. This temperature is related in a straightforward manner under reasonable assumptions to the pressure, which, in general, does not have to be isotropic: that is, it is not, as we are used to, a unique value associated with each spatial point, but a mathematic object representing the forces exerted on a on a differential element of area of the fluid. This is the pressure tensor.
This tensor is what the code gave before this work, creating 6 different files (the tensor is symmetric) with the value of each component of the tensor in all the cells of the plasma simulation box. Now, there is a catch here. You want to calculate the temperature tensor from here, but not in the code’s cartesian coordinates, but in coordinates relative to the magnetic field in each point of space. This means that you have to take the 6 components of the tensor and realize a change of basis to the matrix to define the tensor in a coordinate system where the principal axis are one parallel to the magnetic field and two perpendicular to it. Note that this coordinates are dependent of the point in space where they are calculated, because the magnetic field is not going to be uniform in general.
The code works iteratively from given initial conditions, and you can ask it what information you want it to write to files in which cycles. So, I have implemented a new routine that calculates the temperature tensor in the cycles you ask for. In addition, this routine doesn’t create 6 different files for each component, but a single file with all the information of the tensor for each point and a format adequate for visualization. Now it is super easy to take temperature information of the simulations.
The Energy Spectra
Observationally, if we want to study plasma particles in space, what we do is count the number of particles that are in different energy ranges, moving in a particular direction. This means that we are not interested in knowing the positions and velocities of all the particles, but only in knowing the energy spectrum at each point: that is, the probability that a particle has one energy or another.
This information is much less burdensome than the complete information of all particles in a simulation. In the cycles of interest, we could simply calculate how many particles are in each cell of the simulation box and place them in different energy bins. In that way, what used to be hundreds of particles per cell, each with its velocity vector, is now simply a series of numbers indicating the number of particles that fall between different energy ranges chosen by the user. A substantial improvement worth implementing directly in the code for parallel calculation.
This is more complicated than the previous example. Let’s think about how the code is designed: the different processes running in parallel have an associated region of space, in which they perform all their computations and which contains certain specific cells. Those cells, in turn, have a certain number of particles inside them.
Each process thus has its own set of particles, and I can make each of them perform a cycle by going through all its particles and putting them in “their place”. Each particle is in a certain cell (its spatial location) and has a certain energy (its corresponding energy bin to construct the energy spectra).
You do this with all particles and you end with a complete 3D matrix, which each space element having an array of N integer values, where N is the number of established energy bins, and the values correspond to the number of particles in these bins. With this, you can construct plots as the one you see above.
These two new routines will allow anyone using the code for the study of plasma in space to have access to relevant information that can be compared almost directly with observational data without having to deal with inefficient post-processing and space problems. The code itself directly provides the relevance information and calculates it by making full use of its parallel design.
I hope that these two new routines will be useful for the study of space plasma in the future!
The summer is over, but my project is not! For the past 4 weeks, I’ve been working mostly on two things:
translation to CUDA code of the most common matrix allocation function
translation to CUDA code of the SpMV algorithm (Sparse Matrix to dense Vector multiplication)
The first being a function which translates a given COO-formatted matrix into an RSB-formatted one, I used heavily the CUDA Unified Memory in order to ease my job and it seemed to work. Then, I found working on the SpMV algorithm a lot harder than I thought since the implementation was contained (and replicated) inside a C file of 50’000+ lines. I translated a simpler version into a CUDA kernel and it still took quite a bit of time to do so, since I discovered a memory issue in it and there was not enough time to fix it. For this reason, I am not able to show you results, but I can assure you that it will be fixed in the future.
This bing said, I would still like to thank all the people that have helped me throughout this experience. First of all, dr. Ezhilmathi Krishnasamy of the University of Luxembourg, my mentor, really helpful, really patient and always with a clever suggestion to improve my work. Kostantinos and Alexander, my Summer-of-HPC-mates, who always gave me a nice joke to cheer me up while working. Also, all the guys at the PCOG department of the University of Luxembourg. A really big hug and a big thank you for the wonderful experience. I hope to see you again in the future.
I’m assuming that you, the reader, already know what a matrix is since it’s a really common concept to find in mathematics. What you may not know is the definition of a sparse matrix. Surprisingly enough, there’s none! No formal one, at least, but only rule of thumbs. This is due to the fact that the sparsity of a matrix is not a “strong property” like the diagonality. So, the definition that I like the most is this one: “A matrix is defined as sparse when its number of non-zero elements is comparable to the number of rows or columns”. So, for example, a matrix with 1000 rows and 1000 columns could contain up to 1’000’000 non-zero elements but, in order to be defined as sparse, it can only contain up to (around) 10’000 elements, less than 1%.
When performing matrix calculations with computers, it’s really common to find much bigger matrices, so big in fact that they can only be stored in their entirety on disk and not in the RAM. For this reason, the sparsity of this type of matrices can be exploited in order to save up space. Some special formats have been developed during the past decades and RSB, the one my project is about, is the best one in terms of cache efficiency. But let’s go in order.
The first one, and also the simplest one, to be invented is the COO format. It’s a bad acronym for COOrdinate list. As the name suggests, a sparse matrix gets represented as a list, or array, of triplets, for each non-zero element, each one containing: the row index of the element, the column index of the element and the value of the element. This one is already pretty effective at its job since the amout of memory required to use this format does not depend on the actual size of the matrix but just on the number of non-zero elements inside it, the actual information we want to keep.
After COO, somebody has invented the CSR and CSC formats, which stand for Compressed Sparse Rows/Columns. These two are an improved version of the first one since they try, as the name suggests, to compress the rows/columns that have too many elements. In fact, they compress contiguous adjacent row indices of elements to remove even more redundancy from the matrix representation.
The RSB format, which stands for Recursive Sparse Blocks, is a bit more complicated to say the least. It is a hybrid sparse matrix storage format which combines three data structures at three different levels to both obtain a very good memory usage and get a really high cache efficiency. At the highest level, the root level, submatrices are stored with a Z-Morton sorting. This is a space-filling curve (you can see an example in this post’s image), similar to the Hilbert’s spiral, that helps preserving spatial locality of bidimensional data points, meaning that each blocks is stored close to its neighbors in memory. At the intermediate level, submatrices/blocks are organized in a Quad-tree data structure. This part is crucial because it allows to parallelize matrix operations better since each block is subdivided based on an estimate of its size in memory. This way the RSB format assures that each blockcan be stored entirely in the CPU cache, thus improving the overall performance. Finally, at the lowest level, the leaf level, each submatrix/block is stored using the common COO, CSR or CSC formats based how many non-zero elements are present and how they are distributed around the matrix.
As you can probably tell by the length of this post, the RSB format is pretty complicated and the algorithms that use it are, too. Surely, it will be a challenge to port this code into CUDA and we’ll see how it turns out in the next post.
After all these technical posts I would like to wrap up with a personal one, highlighting my experience and shining a light on the people I met along the way.
The Summer of HPC 2022 was a really fun and, above all, valuable experience to me. I was able to get more confident in using and working with supercomputers as well as getting familiar with the HPC workflow in general. It was a great opportunity to refresh my interest and knowledge in nuclear fusion and to put my theoretical knowledge of molecular dynamics into practice by actively setting up and running simulations. Furthermore, I also really enjoyed the outreach aspect of the SoHPC programme. I tried to maintain an active blog with a coherent sequence of posts that build on each other and are interrelated. Sharing my posts on social media was always rewarded with positive responses, which further encouraged me. Collecting our results and creating the presentation slides and final video together with my colleague Arda Erbasan was also a great experience. The video can be watched below:
Speaking of my colleague Arda Erbasan, he was a really nice partner while working on the project and I enjoyed working with him. Arda brought a ton of experience in material science, especially DFT, which was of great value while working on the project. We had sophisticated discussions and it was easy to find solutions to problems arising during the work on this project. Go check out his blog!
Besides the scientific activities, I especially enjoyed the exchange with my mentors Mary Kate Chessey and Julio Gutiérrez Moreno. They have found a good balance of setting us targets and goals on the one hand and letting us experiment and explore things on our own on the other hand. In addition, they were always available and helpful, and the working atmosphere was a very pleasant one from the beginning. I also felt very welcome in the BSC Fusion Group lead by Mervi Johanna Mantsinen and I enjoyed the regular meetings where I had the chance to get a sneak peak into other peoples work in the field of computational methods in nuclear fusion.
The materials for fusion power-group. It was pleasure working with you! From left to right: Jose Julio Gutiérrez Moreno, Mary Kate Chessey, me and Arda Erbasan.
Since my project was one of the online projects, I was based at home in Vienna most of the time. Nevertheless, I had the opportunity to meet another fellow SoHPC student and very talented ongoing researcher, Ezgi Sambur. She was assigned to project 2222 on HPC-Derived Affinity Enhancement of Antiviral Drugs in Vienna. You will find her blog here!
Ezgi is working in the field of computational chemistry and her project was dealing with computational drug design where she achieved excellent results. I met Ezgi in a park near the University of Technology in Vienna, where we had a lively discussion about our experiences
at the Summer of HPC porgamme but also chatted about our academic careers, our goals and ambitions, and other casual topics.
With all that being said, you see, my participation in the Summer of HPC 2022 was a very enriching experience, both from an intellectual as well as a social point of view. To anyone out there interested in computational science and technology and/or high performance computing and meeting new people from different backgrounds, I can only recommend applying!
Thank you very much to the PRACE Summer of HPC Team for the organisation and providing us students with this opportunity. Hope to see you again sometime, goodbye!