So, before beginning, I want to remind you that my project involves the convergence of HPC and Big data, and more details can be found here. The project’s main aim is to explore the benefits of each system (i.e., Big data and HPC) in such a manner that both will benefit from each other. So, several frameworks are developed to meet this, and the most popular two of them are Hadoop and Spark, and for my use case, I have worked with Spark.
Case study (IMERG precipitation datasets)
At the same, I have a case study from my field of interest, i.e., precipitation. The use case will deal with analyzing the Integrated Multi-Satellite Retrievals for GPM (IMERG), a satellite precipitation product from the Global Precipitation Measurement (GPM) mission by the National Aeronautics and Space Administration (NASA) and Japan Aerospace Exploration Agency (JAXA). The dataset comes with multiple temporal resolutions such as 0.5 hourly, daily, and monthly. Here, I have used the daily dataset from 2000 – 2020, which is approximately 230 GB (each daily data is around 30 MB). The main reason why I’m using this dataset is that I have faced several problems with working these large amounts of data, especially in R.
Although HPC is often equipped with enough memory to load and analyse the datasets, using R with HPC for big data analysis is not often the first choice. But, as I have my script in R and I’m very convenient with R, I just want to stick with R to analyse the datasets. Here, where Spark will be useful. Although Spark initially started with Scala, Java, and Python, it later introduced the R interface. So, to work with Spark using R, we have two main libraries. One is SparkR (developed by the spark core team), and another is SparklyR (developed by the R user community). Both libraries share some similar functions and will converge into one soon. So, we are using both libraries.
The bottleneck and result
The main bottleneck of dealing with Netcdf datasets is that the data is stored in multiple dimensions. Therefore, before starting any analysis, the datasets should be stored in a more user-friendly format such as an R data-frame or CSV (I prefer an .RDS format). The R script to convert Netcdf to an R data-frame was slow, which takes approximately 235 seconds for 30 files, each with 30MB on the Little Big Data Cluster (LBD), which has 18 nodes each with 48 cores. However, as the R script was designed for a single machine, it does not really use the benefits of multi-core clusters.
Therefore, the same R script was applied through the spark.lapply() function, which parallelizes and distribute the computations among the nodes and their cores. Initially, we started with a small sample (30 files), and the bench-mark (run for 5 times) results were shown in Table.1. Comparatively, SparkR has significantly faster than R. For instance, on average, R takes 235 seconds to complete the process (i.e., write an .RDS). In contrast, for SparkR (i.e., write into parquet format), it was just 5.58 seconds, which is approximately 42 times as much faster than R.
Method
Minimum
Mean
Maximum
R
231.17
235.34
239.57
SparkR
5.39
5.58
5.7
Table 1:Benchmarking R versus SparkR in reading and extracting one month of IMERG daily datasets (units are in seconds)
So, that’s it for blog#2!
Hope you enjoyed reading it and if you have any question, please comment I will happy to answer.
I think we can all say that we have had an amazing summer, and I can’t thank enough PRACE and Vienna Scientific Cluster for having hosted us through it. We also hope you enjoyed our video “2133 Pedro Hernandez Gelado and Rajani Kumar Pradhan, Convergence HPC Big Data”, but as recent newbies in the field, we though there might have been some confusing terms for first timers using Big Data and HPC. Hence we decided to make a very brief glossary of terms that we used in the video, in case you felt lost or more curious to investigate some of them, with a large references section to standard texts in the field.
From Pedro and Rajani, over and out! Thank you for an amazing summer!
As of yesterday we presented our final results and submitted the report on the project we have been working in during this summer. It was a very fun day as we also got to see what other groups have done, and it feels nice to be able to relax a bit before the studies start again in the autumn. Let’s talk abut how the final weeks went and pick up the loose ends from the last blog post!
This project has been a lesson in how to readjust your project in line with new results, and from the last blog post we changed the goalpost quite a bit. As it turned out we could not extrapolate much between different parameter values for different matrices as the correlations simply were not present in the small data set we analysed. Because of this we turned our focus to instead finding optimal parameters for single matrices.
What we then found was three problems with the Bayesian sampling technique which we had chosen:
The sampling of the real results where sometimes uneven in workload which gave raise to outliers with inaccurate reward values and a skewed model as a result.
Because the algorithm always sampled the theoretical maxima of the number of points it tended to get stuck early in a local maxima, pointing toward a need of
The function surface of the statistical model was sometimes poorly fitted for our needs as fitting one good reward value was less prioritized than fitting many clustered samples.
As the project had already undergone a couple of changes when this was discovered and since gathering data was such a long process, we could not look further into these issues more than pointing them out to give further research some thoughts to start with. While a more thoroughly positive result would have been a fun way to end the summer, we are still happy with what we managed to do given the time frame. For a more detailed summary of the final algorithms and its issues, please see the video linked above with our final presentation.
If you have followed the soHPC blogs during this summer, I hope you have had an as intersting journey as we have had and that you have learned a lot!
Hello people! I wanted to talk about my experience with the Summer Of HPC program in my last blog post. This summer, I have worked on a topic completely out of my expertise which was very valuable since I have had the chance to explore new territories and learn different methods than I am used to.
To catch up, I have done a fair share of studying to understand gear geometries and documentation reading to learn using the modeling software with Python scripting. I have also worked on parallel processing for the first time, which was great considering that it is very important for scientific computing. This was my first experience using HPC systems but I am sure it won’t be my last.
Finally, I would like to thank PRACE, the coordinating team and mentors for their efforts organizing this program. I hope everyone had a great time and learned a lot of new things! I am sure this program will continue improving skills of aspiring students all over the world in the upcoming years.
After 2 months, I came to the end of Summer Of HPC. I would like to share with you the optimization methods I followed in this two-month great adventure. In my previous blog post, I explained how the function compute_density_at_grid_points() works. Before I clarify the optimization methods, I strongly recommend that you take a look at my previous blog post so that you can better understand the effects of optimization metrics.
OPTIMIZATIONS
I would like to share the optimization I applied with you.
Effects of optimization method on PDB file 1RTQ
Here version 5 refers to last years implementation. All individual optimization steps showed continuous progress towards improved runtime conditions.
Version 6
In this version, the device function pointer is used to eliminate if statements. The achieved performance improvement was on the order of 10% (see the chart, blue versus the green bar).
Version 7
Slater densities are calculated according to different atom types. While calculating densities, different constant variables are created according to orbitals and atomic numbers every time the density functions are called. These are pre-calculated and assigned fixed values instead of being re-calculated in the density functions’ calls. Acceleration was observed at approximately 300%(see the purple bar in the chart).
Version 8
Density calculations include as many exponential operations as the number of shells of the atom. In addition, after calculating the distance at each grid point, a power operation is performed. These processes are very costly compared to the Fast-Math library. For this reason, the Fast-Math library is used. In addition to the seventh version, an acceleration of about 33% was achieved(see the dark-red bar in the chart).
Version 9
In the profiling results, the number of instructions for creating and deleting local variables in each device function consists of most of the instructions compared to the total number of instructions. For this reason, by following the method of minimizing function calls, all functions were gathered under a single function, namely, compute density at grid points. Acceleration was observed approximately 50% in addition to the eighth version(compare cyan to the dark-red bar in the chart).
Version 10
If statements were revised from the most frequent atom to the least frequent atom. It provides us with an additional 20% performance speedup compared to the ninth version(see the red bar in the chart).
Version 11
Data types were converted from double to float since 7 decimal points are sufficient for numerical accuracy. It provides us with an additional 20% performance speedup compared to the tenth version(see the dark-green bar in the chart).
The obtained results show between 8x acceleration for larger sized molecules and 13x acceleration for smaller sized molecular structures.
The End
During these two months, I improved my skills by addressing different topics with different approaches every day. I had a very enjoyable 2 months. I am very happy to participate in the Summer of HPC 2021. Thank you to everyone who gave me this opportunity.
Hi for the last time! Welcome to my third and last post. If you haven’t read my other blog posts go check. In my last blog, I gave details about the project when we were in the middle of Project 2114. Today, I will talk about the results of the project and my experiences during the Summer of HPC Programme 2021.
As I am writing this, I am thinking about how fast the Summer of HPC Programme has passed. After two months of working on the project, I learned lots of things. But also I met amazing people. Before saying goodbye and thank everyone, let’s look at the results of our project.
Results
One of the routines optimisation results.
After all of the routines were optimised, and we were satisfied with the efficiency of our code, we began comparing the original version and the optimised version.
As you can see from the screenshot above, the original code on the left and the version being optimised _v2 on the right. In _v2, we used multiple optimisation techniques. The final result’s distribution comes from the fact that the microbenchmark is being run multiple times. It helped more clear to see the comparison of the original version and version two.
The project aim was to reach the larger data sets that the Genomicper Package can handle and to be able to process existing data sets faster. Optimizations that worked faster with larger data sets were preferred to those that showed performance improvements only for the smaller data sets. As a result of the improvements, we have made genomicper package should be able to handle larger data sets faster. More importantly, we have introduced a methodology that should continue to improve the performance of the package in the near future. I am proud of what we did!
To wrap up, the Genomicper Package is an open source project that researchers can benefit from. Anyone who wants to take a look at the package go check here!
Saying Goodbye..
As I am at the beginning of my career, this project gave me a lot of opportunities. It allowed me to understand new optimization techniques, helped me to see the solution to real-life problems.
I would like to express my feelings. Thank you, PRACE, for this amazing opportunity. Thank you to my fellow PRACE summer of HPC teammate Aybüke Özçelik. Thank you to my project Mentor Dr. Mario Antonioletti, my Co-mentors Dr. Pau Navarro and Dr. Claudia Cabrera for their patience and interest. I learned a lot of things under their guidance. They helped all along the way, even it was online. Unfortunately, I didn’t come to see the beautiful Edinburgh and work with this team, because of the pandemic. I hope that maybe one day we will meet.
Last but not least thank you for reading this. I would like to say goodbye to this amazing programme for giving me the possibility to meet such amazing people and to improve myself.
After this two month being part of the PRACE Summer of HPC, contributing to the MPAS (Model Prediction Across Scales) together with Jonas Eschenfelder, I not only learned how to run MPAS atmospheres, on the national Supercomputer: ARCHER2, how to investigate performance of simulation runs, depending on the cores and set-ups like simulation time and physical parameters, and how to process the output to visualize them with python, but also experiencing how well this remote collaboration across Europe with our project mentors Evgenij Belikov and Mario Antonioletti from the EPCC in Edinburgh and with my project partner Jonas Eschenfelder from the Imperial College London, worked out.
At the beginning, the excitement about having the possibility to work on ARCHER2, and to run a new atmosphere model on this Supercomputer, where over the top – also when there came some installation and set-up problems up, I was always optimistic overcome those, because the hope, that after break down this barriers the results will be worth the persistence. And it has proven true – we were not just able to run performance tests, but also write an visualisation script in python. The latter process was my favourite part of our project, and thereforeI will shortly describe the steps to get this visualisation script.
As an output from the MPAS simulation runs, we get this compressed netCDF files, I plottet the containing physical parameters directly to a sphere (see also my second post blog for that). But in this script it was necessary to differ between the different locations (cell, vertex, edge). But there is the possibility to transform this three different locations to longitude and latitude, and then you have a consistent connection between the output parameters and the locations. Furthermore it is easier using common python maps, like Cartopy, to plot them in the Background for a better orientation. Not only for global simulations, but especially for regional simulations, see figures below.
Temperature plot over 8 hours as a gif. The higher temperature drop in the alpine and pyranaese region can be seen as a validation that the grid is correctly centered and also indicates that the simulated data are realistic.}Meridional Wind plot over 8 hours as a gif. The dark spot in the middle of the figure, is a well know wind pattern, the so-called Mistral Wind. That this is visible in our output data, again reflexes the validity of our MPAS run.
During PRACE Summer of HPC, I not only learned a lot of new skills, it was also enriching on a personal level and made me more self-confident – because experiencing the success when you work on things, which are for you completely new, but you stay tuned to, I am sure, will also encourage me for future projects.
We have come to an end now and we all gained a lot for both hard and soft skills. We’ve done so many things and I will try to cover all together in this post.
At first, as I said in my last post, I want to tell about Precision Based Differential Checkpointing (PBDCP) technique which is the method I’ve implemented through the project.
PBDCP is implemented for float and double type data sets. Floating point numbers map decimal numbers to a unique bit representation and they have 3 parts: sign bit, mantissa and exponent. The idea of Precision Based Differential Checkpointing is to cap the last bits of mantissa that are the least significant bits according to the given precision value. As an example in Figure below , if the precision value given by user is sixteen, it truncates the last seven bits of mantissa, (i.e. makes them 0).
Example of pbdcp operation for ieee754 floating point representation
Therefore, PBDCP allows us to take benefits from dcp share which shows the difference between checkpoints, those would be stored, for such small changes.The advantage is transferring less data and preventing from creating continuous traffic for such minor changes.
Usage
To use this mechanism, there are some basic additions in the configuration file. Firstly, enable_pbdcp value should be set to 1 and pbdcp_precision value should be entered. Then, in the application when calling FTI_Checkpoint function, user should specify the level for pbdcp which is FTI_L4_PBDCP.
Experiment Results
As expected, the smaller the precision the most benefit we can get from dcp share since the value remains unchanged for a certain interval but also the larger the rmse value which is route means square error for uncapped and capped values and it can also be seen in charts left Precision against DCP share and RMSE on top on the page in which: dcp blocksize=1024, Iteration= 200, Checkpoint interval= 5 (which means there are 40 checkpoints) values are used.
And for comparing pure dcp and precision based dcp, again as expected dcp share is lower in pbdcp and an example result of an execution with values as: Block size= 1024, Precision= 4, Iteration= 200, Ckpt Interval= 5 is shown in graph below.
So this is the work I studied on during this internship. My project mate Thanos and I prepared a video presentation for SoHPC which you can watch here…
Ending
I had a great two months. I was a bit nervous in the beginning but it lasted only till I met with mentors, coordinators, BSC group, other students so on and so forth. They are all such helpful, warmhearted and sincere that no need to hesitate at all. I am greatful to all for giving us the chance to be part of this awesome adventure together.
Today I want to tell you what I was working on the last few weeks. As I have mentioned in my previous blog post (https://summerofhpc.prace-ri.eu/how-to-distinguish-different-surfaces-of-molecules/), one goal of this year’s project is to distinguish different disconnected surfaces automatically. It can happen that the computation yields several very small surface components which are physically irrelevant. For example, in the image above we see one large surface (grey), and several tiny surface components inside of the large surface. The goal is to sort out these small surfaces automatically. For doing that we need a criterium when a surface should be sorted out and when not. The approach that we used in our implementation is the computation of the volume of each single surface component. If the volume falls off a certain threshold value, the corresponding surface can be sorted out.
In this blog post, I want to describe how we computed the volume of a molecule. The divergence theorem, also known as the Gauss-Ostrogradsky theorem (https://mathworld.wolfram.com/DivergenceTheorem.html), gives us a formula for doing this.
Let S be the surface of a molecule. Then, the theorem leads to the following formula:
Volume of the molecule =
Here, for each point (x,y,z) on S, \vec{N} is the unit normal pointing outwards of the surface S at (x,y,z). To use this formula for our computations, we approximate the integral by a finite sum:
Volume of the molecule
Our program computes a triangulation that approximates the surface of a molecule. Each is a triangle vertex of this triangulation, is the average normal of all the triangles containing , and is one-third of the sum of the areas of the triangles containing (see the image below).
The average normal (yellow) is the average of the triangle normals (pink) of those triangles that contain the vertex
In our program, for each disconnected surface, we compute its volume. As a threshold value, we chose the volume of a molecule, which is approximately 14 Ångström^3. We tested the computation of the volume for several molecules. For example, the grey outer surface of the molecule in the title image has a volume of 4642 and the volumes of the small surfaces inside lie within the range of 0 and 0.7. So, with our computation, we can sort out these small surfaces automatically.
This was a rough overview of the computation of the volumes. If you are interested in more details, feel free to ask your questions in the comments below.
So it actually happened. I’m sitting here, on a rainy day wearing a sweater and thinking summer is over. The summer of HPC is over at least, if the sun ever comes out again is questionable as well (thank you to London weather). But what have we been up to in the last few weeks of the project?
Performance of MPAS atmosphere for a regional cut. Notable is the superlinear speedup and efficiency above 60% even after the run time plateaus.
We ran our final experimental set up of MPAS. This time instead of running full global simulations we only used a regional cut, this is a fairly common method in atmospheric modelling and allows for fast running of very detailed set ups. We chose a cut above the western Mediterranean, with varying cell sizes between 60 and 3km, running between 1 and 8 hours. The results of this showed were very promising for MPAS. we saw both good strong (see the figure to the right) and weak scaling behaviour, which allowed us to simulate 8 hours of weather in under 20 minutes! We go into more detail in our final presentation that you can watch here.
Animation of changing wind speeds for a 8 hour simulation starting at 6pm on January 1st 2010. Note: negative wind speed meaning southward direction
This run was also our chance to test out the new visualisation script we developed in our project. MPAS used to rely on NCL, a plotting language created by NCAR who also developed MPAS, for any visualisation. They recently announced however a stop of development on NCL to move all plotting towards Python. Since a model is of limited use without the possibility of visualising the outputs in a good way that humans understand, we decided to write visualisation scripts that can be used for MPAS. This worked very well and allowed us to create beautiful animations of the regional experiment. My favourite one is shown here, this is the wind speeds of an 8 hour run. The main feature is a persistently strong wind in the Gulf of Lion. We believe this is the Mistral, a persistent wind pattern bringing cold strong winds to the South of France. Moreover, when looking at where the Alps should be, the wind almost seems blocked by them. As an Earth Scientist, I really enjoy staring at this, finding patterns and trying to explain what they show. It also shows very well just how powerful these kind of weather models are.
Overall, I really enjoyed my time working on this project. I have never imagined myself working on a supercomputer, especially not on one of the most powerful machines in Europe. Wanting to go into research in my later career, this helped me discover a whole new side of research and I’m very thankful for getting this opportunity. Having had only very little knowledge about computing, it was at times quite challenging but I learned so much about programming. Also, whenever I truly didn’t know how to solve any issues, our amazing mentors Evgenij Belikov and Mario Antonioletti helped guide us through it all. Even though we sadly weren’t able to work on this project in Edinburgh, but had to work remotely it was a lot of fun and a great show of how important cooperation across Europe is for research. So this is it from me for the Summer of HPC, thank you to all of you who have read through my articles and came with me on this journey!
Welcome to my third blog post, where I will be talking about my expirience with Summer of High Performance Computing (SoHPC). Once again my name is Ioannis Savvidis and if you would like to get to know me a little better and see what we have done up until now go check my first post which is an introductory blog of who Ioannis is, and my second post where I write a bit more about the project i work.
So in my last post I explained how we managed to find a working implementation of sparse BLAS library, and we successfully implemented it to our existing code. After some testing to a methotrexate molecule, the results that came back were not looking good. The new code with the sparse BLAS implementation was slower than the old code.
In order for our code to shine we knew that we have to go for bigger molecules where the matrices are even more sparse. That’s how started testing to 5 different alkanes molecules by running the old and new code and checking if the run time is getting better or even if we are having faster run times. The alkenes that we used as i said in my second post were C6H14, C12H26, C18H38, C24H50, C30H62, C36H74. A schematic together with methotrexate can be seen on the left.
Normalized run times of the new code for different alkanes on different number of cores
At first a single node of a supercomputer was used exclusively to do our testings. Each node of the supercomputer has 4 CPUS @ 8 cores. So we tested the timings of the codes for different number of cores, specifically 1, 2, 4, 8, 16, 32. From our testing alkane molecules, almost all of them had faster run times at either 4 or 8 core with the majority of them at 8 cores, while the slowest timing were coming in the 32 cores runs.
From all the data we collected, we were able to determine that for C36 why had an average of 4.2% improvement of runtime on our new code.
After that point and with more nodes to our disposal, we tried an implementation of parallelism by testing how the run times scale with 1 to 4 nodes that are set at 4 and 8 cores. Unfortunately, the timings weren’t consistent and further troubleshooting is needed to be done. Finally, on our last week on the project we increased the number of basis set function for C30 and C36 molecules and tested the new code. Again at the end we run to some numerical errors and we don’t have the time to run more tests.
That concluded our journey with SoHPC. I want to say that I’m really happy with the outcome of our project and i feel really greatfull for the opportunity of participating on this year’s programme. I learned a lot of new things under the guidance of our mentor Dr. Jan Simunek and help from my colleague Eduard Zurka. Also, big shout out to Dr. Leon Kos and PRACE for organizing this programme and everyone that helped us from the training week until the end. Ultimately, I want to say to everyone that reads this post and has an interest in HPC or wants to learn more about HPC and has an interest in computational materials science/chemistry, to apply to next year’s SoHPC.
As stated previously, after searching for possibilities to accelerate our Python programs, we now summarized the results for you! First, in a serial form and second, in parallel. Come and check it out! It could provide you with information you need to combine HPC and Python
Serial programming and the time in seconds per iteration
Scale factor 256! Huge problem size with a Reynolds Number of 2.0 for the viscosity
Look how slow the naive Python version is, we even had to adjust the graph. The NumPy version is nearly 800 times faster than naive Python, and the performance of NumPy version after NumExpr optimization improves greatly, even compared to the performance of C and Fortran versions. This indicates that NumExpr has a good optimization for NumPy arrays calculation at this problem size. In comparison to the serial CPU versions of the CFD program, the implementation of the GPU-based Python Numba is 6 times faster than the NumPy version and even slightly faster than the C and Fortran versions. It is important to ensure that the data copying from CPU to GPU and back is minimised.
The boundaries for the serial programming are now clearer. Due to the fact that we work on a computer cluster, we should use its potential and spread out our code to multiple cores and maybe also nodes!
Split the work and run in parallel!
Speed up on one ARCHERE2’s node (up to 128 processors)
Overall, the speed-up of all parallel programs increases with the number of ranks. For the Python programs, the performance improvement is small with a small number of ranks (1 to 16), as well as for C and Fortran. The NumExpr optimized version is close to C and Fortran at 16 ranks. In the case of a large number of ranks (32 to 128), the performance of all programs improves significantly. The Python NumPy MPI consistently lags behind the other three, while the performance of NumExpr optimized version is still close to C and Fortran. We see that after more than 16 ranks, the scalability of the NumExpr version is better than non-NumExpr version.
.. going even further with unreleased information (only here):
Running MPI CFD versions on ARCHER2 nodes; SF=512, Re=0
The results are given in comparison to the time spent for one iteration with Python MPI on just one full node (128 processors). Interesting is that the speed up is super-linear for all the versions. With using 64 nodes on ARCHER2, we accelerate Python MPI by a factor of 356x in relation to using one node. Fortran and C are around 750 times faster! We still think the Python MPI results are quite convincing. If you are interested in more results, you are welcome to leave a comment!
After all the reading, check out our short summary video:
Thanks for enjoying the blog. See you next time to more HPC events?!
Thanks to David Henty and Stephen Farr for the huge support and making the project so interesting!And of course to Jiahua, my partner during the Summer of HPC!
As summer turns into autumn, we bring to a close the Summer of HPC. I have had a great time learning new things, solving new problems, exploring new areas and working with new people. And here at the end, I suppose it’s time to provide closing thoughts.
Results of the project
As previously discussed, over the course of my project I developed from scratch a program that calculated view factors between surfaces. It took in a surface mesh, triangles describing the shape of the objects, and tested to see if had a line of sight to one another. If they did, the view factor was calculated, computing the double area integral via gaussian quadrature. A simple version of parallelization was also implemented, where each processor was assigned a set of triangle pairs to test and calculate.
What I wish I’d had the chance to do
While I’m proud of what I’ve made, there’s many things I would have loved to do if there had been more time. One of the first things would have been a better implementation of parallelization. Currently, some processors could be assigned a set of pairs where none of them see each other, meaning the processor needs to do little-to-know calculation, while another processor will have many. There are better ways to parallelize, ways that would more evenly distribute the workload across processors.
Additionally, while integration via gaussian quadrature is an acceptably fast and accurate means of calculating the view factor, it is by no means the only or explicitly best one. With more time, I would have hoped to test multiple methods, analyze which works best under circumstances of distances, angles and sizes of the triangle, and changing the program to choose the best calculation methods for each triangle-pair.
One other thing I would have liked to do more of is proper testing on HPC systems. While the project was designed with them in mind, I was running it just on my own laptop the whole time. With more time, I would have liked to grow my knowledge and experience of working on HPC systems.
Working Together from Afar
At this point in September 2021, most people will know the troubles of remote working,. But I had to learn a lot working with a partner in Germany and mentors in France – not least of which being remembering time zones. Keeping up communication, cooperating on solving problems and knowing what others are working on are new skills I’ve had to learn working this project. As things slowly get better, the next piece of work I’ll have to do will probably be in person, but I’m glad for having learnt these skills in an increasingly global remote world of science and technology.
Closing thoughts
So as I bring my Summer of HPC to a close and begin my next venture – studying a Masters in Quantum Technology, I am very thankful for the opportunities the PRACE’s Summer of HPC has given me and for what I have learnt. I would like to thank PRACE for giving me this opportunity. I also must say thank you to Daniel Pino Muños and Modesar Shakoor for their mentorship over this project, and to Mukund Kashyap for working with me. Thank you all, I hope one day I have the chance to thank you in person.
And with that, I close with thank you, and goodbye.
Hello! This is the final follow up on our project and luckily, we can now show off fancy looking plots from the results we’ve got.
We’ve settled for Gaussian processes (GP) as the method for performing optimisation. Additionally, we had to change the paradigm from considering matrix features (as described by Adrian here) to building our model for every individual matrix, due to added complexity in the first approach.
In essence, the algorithm is based on sampling the model, that is evaluating the system with a preconditioner computed with MCMCMI from a given set of parameters. As you can imagine, sampling is the most time-consuming and computationally intensive part of the algorithm and we do need to take numerous samples to meaningfully explore the parameter space. The expense is not so critical to us because we have headed towards the “parameter search” direction.
Samples are associated with a set of MCMCMI parameters (stochastic and truncation tolerances) and a reward value which has been defined as the ratio between the runtime of solving without preconditioner and with one. Given a number of samples, the code, iteratively, performs regression on them; computes an optimal set of parameters moving up the expected gradient towards a maximum of the GP distribution; takes a new sample guided by optimal parameters and strategies that avoid biased behaviour (e.g. getting stuck in a corner).
Figure 1: Visualisation of the sampling processFigure 2: Behaviour of the algorithm with increasing number of samples. Left plot demonstrates convergence to certain reward values. Right plot shows the effect of clustering.
In the end, we’ve been able to match the data reasonably well, however discovered that our optimisation doesn’t lead to an observable improvement, but to an oscillating behaviour around the reward value of 1 (see Fig. 2).
Three issues have been highlighted:
Outliers
Overexploiting the model built from initial samples
As sampling progresses and points gather in clusters, the model can get too smooth
If you’d like to read about this in more detail, see our report! And while you’re at it don’t miss the other reports from all the amazing projects this year!
To conclude, I’d like to thank the mentors for this project, Anton Lebedev and Vassil Alexandrov. This summer has been very interesting and really valuable for me in terms of getting experience of research and working with HPC systems. I gained more confidence in my coding skills and learned a lot. I would highly recommend this programme to anyone interested in HPC (even if you don’t know much about it and simply want to learn) and wish best of luck to whose who decide to apply for the next year! Bye!
Hello again. This is a little retrospective blog, as my project (2102) winds up and we finalize our report. The Covid-19 pandemic has obviously dictated alot of how the project played out, but despite this I’ve had a brilliant time. I’ve spent it working with the fusion group at the Barcelona Supercomputing Center (BSC), on a project modelling defect cascades in tungsten during nuclear fusion (see previous blog here).
The fusion group were really supportive and let us join their weekly meetings, which really helped us feel part of something other than for reports or presentations! It also allowed me insights into the different directions the group pursued. The project supervisors were Julio Gutiérrez Moreno and the group organizer Mervi Mantsinen and were both fantastic, endlessly patient with our issues and quick with clever solutions.
From the get-go we were introduced to the HPC at the BSC, MareNostrom, and the LAMMPS setup Julio had prepared to get us started. I had never used LAMMPS before so it was fun to get to grips with it as I began to plot out my project. Over the weeks we were exposed to the full gambit of research work from testing code and debugging our simulations to literature review and diving into the current research to situate our work. It gave me a really great insight into the roles individuals play in a large research group.
The other project student Paolo and I have gotten on well which really helped to make the project a happy one. We approached the project from different ends, him working on the calculations of thermal conductivity while I produced defect structures. We were able to support each other quite well when issues arose and the cooperation made the presentations and video we produced very enjoyable.
In terms of results, I was able to establish a successful procedure for the cascade. A picture of it in process is shown below (Figure 1). Initially many atoms are displaced in a wave rippling across the structure but most of these will settle back into usually occupied positions, leaving a lesser number of permanent defects, after having time to settle. In line with literature, the number of defects formed was proportional to the energy of the cascade up to 200 keV. Various potentials for tungsten were tested and showed slight differences in the number of defects formed, but as the differences are small more repeats are being performed to give more statistically validated results. This then allows future work to tie the projects together and calculate thermal conductivity for these cascades, and move onto various alloys of tungsten which are also of interest e.g. WTa.
Figure 1 – Rounded picture of 60,000 atom cascade simulation in the middle of a cascade (left) and after being allowed to settle (right). the larger atom is the initial cascade atom, given high velocity.
A side of things I really enjoyed was getting to grips with the literature on fusion, such as learning about the state of the art in fusion technology and the position of various large experimental reactors in their long-running timetables. The overlap of these massive engineering projects with our atomic level theoretical chemistry, is a fascinating area to study.
Wrapping things up has left me a bit surprised, the months went by so quickly! Overall it was a great way to spend the summer!
by Mario Gaimann & Raska Soemantoro (joint blog post)
Hi everybody! It’s Raska and Mario here, back for a joint blog post. As we’re nearing the end of Summer of HPC 2021, we’ll be talking about how the project has gone since the beginning.
If you’ve been following our previous updates, you’ll know that Raska’s given a quick overview of our project and how our novel seabed classification system works (if you haven’t, be sure to check it out here). Since then, we’ve come up with newer features and developments to our software.
Previously, we’ve explained that we use a convolutional neural network (CNN) to perform classification tasks on a set of labelled training data. To do this, we use Marconi100’s GPUs (Graphics Processing Units) as explained by Mario in his last post (check it out here). As explained in their name, GPUs were designed to process graphics such as videos and images – two types of data that can be quite heavy to operate on. For this reason, GPUs perform extremely well with large bits of information, such as our training data.
Typical output for training a model using the Tensorflow framework.
Our system would have not been possible without the power of such a computing device; our training data initially consisted of a database of arbitrary image ‘cuts’ from the provided map, each containing a seabed relief. We decided to go even further with these cuts in our latest development. We employed a method called Selective Search which scans for proposed areas where seabed reliefs may exist. This gives us much more data with much higher accuracy. Because these areas equate to regions of interest, this technology is also known as Region Based Convolutional Neural Networks (R-CNN).
Furthermore, we also decided to train two models – one that learns whether a relief exists, and one that learns the type of a relief (should they exist). This helps because relief recognition on a seabed map actually consists of two tasks; relief detection and relief classification. The classification task only applies for training data that has been detected as a relief in the previous task. In programming terms, we effectively nest the second model within the other. For both of these tasks, we managed to achieve accuracies of over 90%, which enabled us to make geohazard predictions with high confidence.
Lessons (machine) learned from our AI & HPC adventure
Coming up with our seabed relief recognition architecture has been a tough but exciting challenge. We’re very glad that our physicist-engineer partnership worked very well – we could always depend on each other’s support. For most of the project, everyone involved in were situated all across Europe; two of our mentors in Italy, one mentor in Southampton (Veerle subsequently went on a marine science expedition to Cabo Verde — check it out here), and us mentees separated between Manchester and Munich. Communication via Slack and Zoom was just not the same as talking from person to person – sometimes, there were misunderstandings about the scope of what needs to be implemented by whom. Even so, we were able to tackle this by providing weekly catch-ups with our mentors and daily updates between the both of us.
Zoom call with our mentors Veerle Huvenne (NOC) and Silvia Ceramicola (OGS); Massimiliano Guarrasi (CINECA) is missing here.
Defining the scope of our project – devising a software that is able to automatically recognize subsea structures, based on the MaGIC dataset for the Calabrian coast – helped us to stay focused towards our goal. With our work plan, we knew how much time we had for each phase of the project and how realistic progress would look like. This was particularly useful to assess how much time we could spend on exploring architectures and tuning hyperparameters, for example. When there were any issues, we met up and managed to resolve them quickly. In the end, our project management enabled us to deliver on the target set out in the project description: to build an automated tool to recognise many different types of seabed structures form oceanographic data (click here for the full project description).
The tight time frame of the Summer of HPC (only eight weeks!) influenced how we approached our project. Once we had researched the relevant technologies to use, we were quick to get our hands on the implementation, knowing that composing the complete machine learning pipeline from scratch would consume quite some time. Our working mode was quite dynamic: we discussed our strategies and subsequently implemented our code independently, which was followed by a phase of integrating all parts into one unified software.
With more time and people involved, doing a more fundamental literature review as well as defining the software design before implementing it would be certainly useful.
Sightseeing in Manchester: Manchester Gay Village and the Old Quadrangle at Manchester University.
In Manchester, United — Our Real-life Meetup!
In the end, our Summer of HPC was a summer full of coding, learning and of course fun. We even got to meet up in real life in Manchester at the end of August! (Seeing what each other looked like in full after all these Zoom sessions where only our heads are displayed was in fact really interesting). During this time, we worked hard on our final video in and around Manchester University Library. For this, we even acted play-acted as marine geographers, quarrelling where geohazards in a subsea map would be located. We also explain step-by-step why you should care about locating geohazards, so be sure to check it out here!
In Manchester Museum we marveled at giant fossils and tiny frogs.
Besides work, we spent some time exploring Manchester. We visited the University of Manchester’s historic Old Quadrangle with its beautiful, ivy-covered buildings, and checked out some other faculties and departments. During one of our lunch breaks, we visited Manchester Museum, a university museum dedicated to natural history. Apart from dinosaur skeletons and minerals we also visited the Vivarium, a section focused on conserving reptiles and amphibians, where we enjoyed watching little, colorful poison-dart frogs, iguanas, and other species. Not to forget, we lived with our very own reptile friend, the lizard Dana. Carrying out a lizard-sitting mission for one of Raska’s friends, we can say that Dana became the mascot of our project, and we had lots of fun playing with her (check our video to see more of Dana).
Our project mascot, the lizard Dana.
More to come!
With a finalized tool for the automated recognition of seabed structures, lots of HPC impressions and hands-on experience, our Summer of HPC came to a happy end. But is this really the end of our submarine geology adventure? Well, maybe not! There is still a lot to do: we would really like to explore the full potential of our tool, fine-tune the parameters of our models, try out different architectures, and improve plotting our map of geohazard predictions, just to name some of our ideas. Both of us are ready and motivated to write the next chapter of our “AI geologist” story, together with our mentors Silvia, Veerle and Massimiliano.
At this point we would like to thank you for following our HPC journey! We hope that you enjoyed it and that we passed some of our enthusiasm for AI and HPC on to you. Stay curious!
The summer of HPC is coming to an end and this will be my last post. Therefore, I think this is the perfect occasion to present the results of my work on the Boltzmann-Nordheim equation and to summarize my experiences of the last two months. As mentioned in my first blog post the goal of my project was to improve the computation of the collision term in the simulation of the equation.
How to measure improvement
“Improving the computation” is very unspecific, hence we will have a closer look at what we were aiming for in our project. When running the computation of the collision term, we can measure the time the computer needs to execute the code. This is one thing we wanted to improve. Additionally, we can also have a look at the scaling of the code. In my second post, I explained, that we want to use multiple processes of a computer to speed up the computation. When investigating the scaling of a code, we investigate how effective it is to use more processes (or “a bigger computer”) to execute the code faster.
Good vs. bad scaling
If we code a program that behaves like building Lego-Grogu as in the video, we have achieved a good scaling. Doubling the number of builders halves the building time. We expect half the execution time of a program when we double the number of processes used. We try to avoid problems, which do not have an execution time proportional to its resources. An example of such behaviour can be playing Symphony No. 1 from Beethoven. No matter how many musicians are playing it, it will always take the same amount of time.
During the Summer of HPC, I tried to improve the scaling of a part of a program, such that it scales more like the gardeners’ problem.
About communicating processes
In my project, I focussed on the term Q1q, which currently takes the longest to compute. The key idea to improve it is to use a new pattern to communicate between processes and to distribute the data in a new way between the processes. We assume, that we have MxN processes, and similar to an MxN matrix, we can split it along the rows and the columns. Moving data along the rows and columns leads to an even and quick distribution of the data. In the following, we analyze the results in more detail.
To analyze the results of two months of coding, I run two experiments! For the first experiment, I run a simulation on a grid with 64×64 grid points and initially use 2 processes. For the following tests, I double the number of processes and each time I measure the time needed to compute the term Q1q for the new (hybrid) and the old (classical) computation.
In the second test, I fixed the number of processes to 8 but increased the grid size, starting from a 16×16 grid up to 128×128. That way, we can analyze two influences. For the first, how effective the scaling of the computation is, e.g. how using more processes helps to speed up the computation. For the second test, we can see how increasing the grid size slows the computation down.
The results of the experiments. On the left, the results from a simulation on a 64×64 grid with a different number of processes. On the right are the results of the test with 8 processes but a variation in the size of the grid.
In both graphs, we can see, that the new hybrid method is an improvement of the initial version(classical). We see, that the hybrid method benefits more from using more processes than the initial version. For the two processes, the runtimes are very similar, but by increasing the number of processes we can compute the Q1q-term faster.
Having a look at the second graph, we see, that the grid size still has a huge impact on the computation, but we still manage to be faster than the classical computation.
A resumee
As this will probably be my last post, I also want to draw a personal conclusion about the Summer of HPC. For me, it was two months of intensive coding, which I enjoyed very much. I want to thank my two supervisors, Alexandre Mouton and Thomas Rey, who supported me during this project. It was a great opportunity for me to improve my skills in MPI programming and coding in general. Apart from the Coding part of the Summer of HPC, I had a lot of fun writing these blog posts and getting creative with the videos. I hope you enjoyed them as much as I did. If you are thinking about taking part in the next Summer of HPC, I can recommend it to you! It was a great experience for me.
If you want to read more about the Boltzmann-Nordheim equation, have a look at the Blog of my project Partner Artem!
Goodbye (and good luck on your project, if you are applying for Summer of HPC 2022).
In this post we want to answer and give a notion of what quantum technology is. We will discuss its most revealing aspects, the importance of it and the barriers it will have to overcome for it to be useful. We will deal with four aspects that from out point of view are the most interesting: quantum computation, metrology and quantum internet.
Quantum computation
Quantum computers can speed up certain types of calculations in unimaginable. Problems that use so be solved in exponential time can now be solved in polynomial time and all thanks to the qubit. Qubits differ from classical bits in that they can take any value between 0 and 1, that means, we can express a qubit as . This leads to being able to create physical states such as entanglement where the quantum state of different qubits cannot be described independently of the state of the others, including when the qubits are separated by a large distance being this the basis of most quantum algorithms.
We use quantum computers for tasks like, optimization problems, database searching, machine learning any many more. We also normally use them combined with classical computers in order to achieve their best performance.
So what’s the problem with quantum computers? Estimations say that we would need (depending on the quality of the qubits and the problem that we are solving) over a million qubits for quantum computers to be more efficient than actual classical computers. The theory behind them it’s clear and it works, the problem is the hardware and the difficulties that arise when wanting the qubits to be in states like entanglement. Think that IBM’s biggest quantum computer is a bit over 50 qubits although the number of qubits scales up rapidly over the years.
If I had to give my opinion on this, I would say that quantum technology is something worth spending time and research resources on. Just like nuclear fusion is something that can give us great benefits in the long term, even changing the course of humanity.
Metrology
Quantum metrology is nothing else than a way of making more precise measurements by using quantum effects such as quantum entanglement and quantum squeezing being able to encode a lot of information within a few particles. What we want to achieve in quantum metrology is to give precise measurements that haven’t been consider in classical theories. For example, using a classical method like Centra Limit theorem may reduce errors by utilizing an amount that is proportional to n-1/2 but with quantum effects, we can make central limit theorem much more easier limiting the error for an amount proportional to 1/n.
In contrast to quantum computing, quantum metrology doesn’t require large number of qubits in order to be efficient and although it may seem counter-intuitive quantum uncertainty doesn’t make measurements to be less precise.
Quantum internet
Quantum internet will be a network that will allow quantum devices to exchange information through qubits within an environment that takes advantage of the laws of quantum mechanics such as, again, quantum entanglement and superposition.
As we say, quantum Internet can bring interesting improvements to the current Internet. One of the most important would be to achieve a much lower ping, practically non-existent. This would significantly improve communications, something that may not be so noticeable for the home user, but would be for industry in certain sectors.
With quantum internet we will be able to shorten distances, interconnecting equipment which is kilometers appart.
It will also provide security; this is based on the fact that a measurement in quantum mechanics changes the state of the electron. So if you encode a message with quantum particles and you message has been intercepted by a hacker, the hackers measurement will change the behaviour of the particle.
Encryption systems used currently like RSA will be obsolete as they will be easily breakable with quantum computers (in the case of RSA using the algorithm of Shor). That said, there currently exists encryption systems that are already quantum proof).
Quantum simulations
Quantum simulations allow us that instead of having to model a system mathematically in order to study simulations on it, we can model the system directly with another system.
Quantum simulation has so far been applied mainly to problems in solid state physics, drawing analogies between a qubit lattice and other lattices (of atoms, spins, etc.) such as those studied in this branch of physics. In particular, work has been done on simulations of the Hubbard model, spin Hamiltonians, quantum phase transitions, disordered or frustrated systems (including spin glasses, superconductors, metamaterials and systems exhibiting topological order).
Lucía Absalom Bautista and Spyridanus Andreas Siskos.
The Deustch Jozsa algorithm stands out as one of the first quantum algorithms that performs better than the best classical algorithm. It is a simple and elegant algorithm that originated from the Deustch algorithm (for a single cubit) and was generalised to several qubits.
So what’s the problem we want to solve?
We are given a unknown Boolean function i.e. given a string of bits it returns either 0 or 1, and this function is either balance or constant, meaning that a constant function will return 0’s or 1’s and a balance function will have 0’s for half of the inputs and 1’s for the other half. Is our function f balanced or not?
Solution:
In a classical computer in the best case scenario we would need at least 2 queries to the oracle to determine whether the function is balanced or not. In the worst case we need half the number of inputs plus one. The power of quantum computation lies in the fact that with a single query to the oracle we are able to know if our function is balanced or not.
Our function implemented in the oracle is described as follows where is addition modulo 2.
So let’s describe the algorithm!
We get two quantum registers, one with n qubits initialized to and a single qubit initialized to . .
We apply a Hadamard gate to each state to create superposition state. .
We apply the quantum oracle that goes as follows:
We then apply a Hadamard gate to each qubit in the first register.
The last step is that we measure the first register. which evaluates to 1 if is constant and to 0 if is balanced.
Example circuit
Choosing and where f(x) is balanced. The implementation of the algorithm on a circuit would look like:
Hello again, and welcome to my last blog post. Today as I write this post, the beautiful journey “Summer of HPC 2021” is about to end. The video presentations of all the teams were fantastic, and it seemed that we all loved what we did and had a good time. Through all this experience, I met new people, I learned new things, and above all, I had a creative summer that will be unforgettable for the rest of my life!
Before closing this article I would first like to give a brief overview of the timeline of our project. Despite the difficulties we faced, we managed to cope and carry out the problem that had been assigned to us. We have both worked hard and we hope our result will benefit the scientific community.
Profiling
The first thing to do when it comes to optimizing an application is to see how much time is spent. Using the VTUNE tool we found that the most time-consuming function was MatSOR which in turn calls the function KSPSolve. This function is responsible for solving a linear equation. So what we wanted to do was transfer these time-consuming operations to the GPU.
From CPU to GPU using PETSc
By reading the documentation of the PETSc library but also seeing various slides from training events of the specific library, we were able to identify how we could transfer the required operations of vectors and tables to the GPU. The library does not offer any correspondence between CPU functions and GPU functions, but all that needs to be done is to define the above with a special CUDA type.
For example, if we have a vector V and we want this to be defined on the GOU and not on the CPU, we should simply declare it by calling the VecSetType (V, VECCUDA) function. For an array M, the corresponding function is MatSetType (M, MATAIJCUSPARSE). Everything that has to do with the transfer of data between the different GPUs or between CPU and GPU is done by the library so we did not deal with this part. The difficulty we encountered at this point is that the CPU version of the code used a preconditioner function with the multigrid method. This function, however, did not have direct support for GPUs so we could not use it. Testing all the preconditioner functions with GPU support we came up with the default which was the Incomplete LU which had the best performance.
Yes, but is it correct?
When making a code change the first thing we need to make sure before taking measurements or continuing on further optimizations is to make sure our result is correct. What we wanted to make sure of at this stage of implementation was that we were not altering the original CPU output. We visualized the outputs of both implementations, the initial one and our own using ParaView, and checked the accuracy of our result. The two outputs were identical so we were ready to continue to the next stage.
Performance comparison
So, through a series of experiments, we now wanted to measure the performance of our implementation and compare its performance, changing various parameters such as the number of nodes, the number of graphics cards, and the number of MPI processes. One observation we made is that application performance on both the CPU and the GPU depends mainly on the number of MPI processes. Something that does not affect performance is the resource topology i.e. if the GPUs are all in one or separated in many nodes.
Our effort seems to outperform CPU implementation in 1, 2, 3, and 4 MPI processes, which is satisfactory and encouraging. If you liked our project and you were interested in our work, you can watch our video presentation in which we present the results of our research in detail.
Our Video Presentation
I would like to thank my partner for our cooperation, the mentor for the directions and instructions he gave us, and PRACE for giving us this opportunity to have such an experience.