Project reference: 2106

The NMMB/MONARCH (formerly NMMB/BSC-Dust) (Pérez et al., 2011; Haustein et al., 2012) is an online multi-scale atmospheric dust model designed and developed at the Barcelona Supercomputing Center (BSC-CNS) in collaboration with the NOAA’s National Centers for Environmental Prediction (NCEP), the NASA’s Goddard Institute for Space Studies and the International Research Institute for Climate and Society (IRI). The dust model is fully embedded into the Non-hydrostatic Multiscale Model NMMB developed at NCEP (Janjic, 2005; Janjic and Black, 2007; Janjic et al., 2011) and is intended to provide short to medium-range dust forecasts for both regional and global domains.

This model is used at the Earth Sciences department as a research tool and as a forecasting model (https://dust.aemet.es). The model uses different datasets as inputs and there is a previous work of interpolating these outputs to a common grid. Furthermore, once the model finishes, the user can retrieve the output in different grid configuration and vertical distribution. To complete all these tasks, the Computational Earth Sciences department has been developing the Interpolation Tool (IT) in Python. This tool uses a wide range of interpolation methods to provide flexibility to the atmospheric scientists.

The goal of the project is to extend the flexibility of the current tool, providing an API to be called from other tools or services, not only stand-alone by the MONARCH model, and decouple the current implementation from external tools as CDO or ESMPy. The candidate will work on the mathematical aspects of the tool and the computational performance to perform post-processes in a big files in an optimal way and improve parallelization, especially regarding the horizontal interpolation, working on techniques of domain decomposition.

Project Mentor: Kim Serradell Maronda

Project Co-mentor: Francesco Benincasa

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Daniel Cortild, Brian O’Sullivan

Learning Outcomes:
The candidate will work in an operational project, within the Computational Earth Sciences team and will learn how to deal with earth system model outputs in an HPC environments as Mare Nostrum and other clusters, supporting different job schedulers and software stacks. In addition, the student will improve his mathematical knowledge and will learn how to profile and evaluate the computational performance of a given tool.

Student Prerequisites (compulsory):
Python developing skills.
Mathematical knowledge about interpolation methods.

Student Prerequisites (desirable):
Experience with HPC job submission.
Experience with Earth Sciences model formats (netCDF).
Experience as software engineer to develop a custom API.

Training Materials:
Python in High Performance Computing (https://www.futurelearn.com/courses/python-in-hpc) is a great course and will give a very valuable background to the student.

Workplan:

Week 1-2: discover BSC Earth infrastructure, learn to run the tool and explore different use case. Start looking the code to get familiar.
Week 3-4: Gather requirements for the API and start modifying the code to get rid of external dependencies.
Week 5-6: Develop API access and custom interpolation methods.
Week 7-8: Validate results and computational performance.

Final Product Description:
An improved version of the IT tool. This new version will have better performances, improved capabilities, remove external dependencies and will be easier to be used in other tools of the department with interpolation needs.

Adapting the Project: Increasing the Difficulty:
Start coupling the improved IT tool to the other department tools (i.e. Providentia).

Adapting the Project: Decreasing the Difficulty:
Get more support from the research engineers working in the development of the tool and focus only in the first part of the project (modifying the code to get rid of external dependencies and implementing custom methods).

Resources:
The student will have accounts to access the Earth Sciences environment (including access to GitLab, wiki and storage) and an HPC account to MareNostrum and Nord3. Both accounts will be provided by the center.

*Online only in any case

Organisation:
BSC – Barcelona Supercomputing Center

Project reference: 2105

Biomedical and Life Sciences are two of the main areas presenting a considerable growth in literature through the last decade, which is demonstrated by the increase in articles indexed in PubMed (a database of biomedical articles).

An example of a BioNLP task that has received increasing attention in the BioASQ challenge, where participants have to index abstracts with multiple labels (i.e., a multilabel text classification), the performance of the proposed systems has increased considerably over the baselines and the current system used by the National Library of Medicine (NLM).

However, this shared task considers only the abstracts of English articles, not covering other languages with a considerable academic writing volume, such as Portuguese, Spanish, and French.

The goal of this research project is to provide a prototype of classification in the form of a BioASQ submission that may take into account the Spanish language as input. Given the extremely usage of deep learning architechures, the student will benefit for learning on how to perform experiments and how to design codes suitable for running on GPUs and on multiple GPUs.

An example of extracting mentions from PubMed literature, and assigning MeSH concept identifier for each mention.

Project Mentor: Marta Villegas

Project Co-mentor:Maite Melero Nogues

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Aslihan Uysal, Lazaros Zervos

Learning Outcomes:
The student will benefit from learning with the BSC text mining group architectures related to deep learning. The student will also be able to learn on how to code and perform experiments with GPUs and coordination of jobs using multiple GPUs (parallel GPU training).

Student Prerequisites (compulsory):
Knowledge about Python
Knowledge about text processing

Student Prerequisites (desirable):
Knowledge of any deep learning framework
Previous experience with data science projects

Training Materials:
https://www.coursera.org/learn/text-mining
https://www.coursera.org/learn/python-text-mining
https://pytorch.org/tutorials/

Workplan:

Week 2- Research bibliography (student 1 monolingual, student 2 multilingual)
Week 3 – Plan/Schedule for the project with a conceptual model sketch for both students
Week 4 – Monolingual proof-of-concept (student 1) and Multilingual (student 2)
Week 5 – Benchmarking with state-of-the-art (students 1 and 2)
Week 6 – Improvement on the proof-of-concepts and large scale experimentation (students 1 and 2)
Week 7 – Benchmarking with state-of-the-art (students 1 and 2)
Week 8 – Final Report (students 1 and 2)

Final Product Description:
The expected outcome is a proof-of-concept of a classifier for scientific articles written in English and preferably in another language. The second outcome is the tailoring of current solutions for HPC environment.

Adapting the Project: Increasing the Difficulty:
The project could be adapted by expecting the student to identify sections in the scientific articles that are more relevant for the classification, thus providing explainability capabilities.

Adapting the Project: Decreasing the Difficulty:
The student could perform the experiments only on the English language, which already has a vast number of baselines and resources available.

Resources:
GPUs, PyTorch, TensorFlow and MKL libraries on HPC clusters.
Training data, which can be supplied by the hosting group.

*Online only in any case

Organisation:
BSC – Barcelona Supercomputing Center

Project reference: 2104

Machine learning applications will dominate edge and mobile applications in the future. While most of the training takes place on HPC clusters and many are deployed in the cloud, some of these applications still have to run on mobile devices and edge due to security and other software requirements.

To run these applications in low energy devices such as edge devices, these models are compressed significantly to reduce both the energy and memory footprint, a process called pruning and quantization. While most of these applications are resilient to such low energy environments, a certain level of resilience is required depending on the application. To achieve this, a careful trade-off is required between the model size and accuracy.

In this work, we shall train and deploy a resilient mobile application. To do this, we shall first familiarize ourselves with training machine learning models in HPC. We shall rely on MPI for data-parallel training to accelerate the training process.

After training, we shall build a small android application and convert the model to a mobile application. We shall then follow best practices to optimise the mobile application for both energy efficiency and resilience.

Project Mentor: Leonardo Bautista Gomez

Project Co-mentor: Albert Njoroge Kahira

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Mehmet Enes Erciyes, Jakub Raczyński 

Learning Outcomes:
The students will learn the foundations of Deep Learning and training Deep Learning Models in HPC clusters. They will also learn how to create Machine Learning Mobile Applications.

Student Prerequisites (compulsory):
Proficient with Python
Proficient with Linux

Student Prerequisites (desirable):
Familiarity with Tensorflow or Pytorch is an added advantage.
Experience or familiarity building Android applications.

Training Materials:
https://mpitutorial.com/tutorials/
https://pytorch.org/tutorials/
https://arxiv.org/pdf/2012.00825.pdf

Workplan:
1st Week: Training Week
2nd Week: Getting familiar with MareNostrum
3rd Week: Fundamentals of Deep Learning (CNN training and inference)
4th Week: Distributed Machine Learning Training
5th Week: Pruning trained models
6th Week: Quantisation of trained models
7th Week: Android Application for Deep Learning
8th Week: Final Report and wrap up

Final Product Description:
1) A tool for training Deep Learning applications in HPC systems
2) A mobile DL application that is resilient to errors

Adapting the Project: Increasing the Difficulty:
A web app can be built and hosted in cloud to supplement the mobile application.

Adapting the Project: Decreasing the Difficulty:
We can remove the mobile application part and focus solely on training Machine Learning models in HPC clusters.

Resources:
Students will have access to the MareNostrum Supercomputer and specifically, the GPU based cluster called Power9.
Python will be primarily used as a Programming language and all the other software required for the project will be installed on MareNostrum.

*Online only in any case

Organisation:
BSC – Barcelona Supercomputing Center

Project reference: 2103

High Performance Computing (HPC) clusters have evolved to complex systems with a tremendous number of components. An HPC cluster is much more than the mere sum of its parts. A fast CPU is nothing without a fast memory. Without intelligent network topologies and high bandwidth, the data cannot be communicated between processes running on different nodes. The performance of one component depends on the interoperability with others. Besides, the architecture is only a piece of a whole. In order to address the synergies of the system, perfectly aligned drivers and countless software packages are necessary.

As supercomputers grow ever larger, the number of components does so in equal parts. The largest clusters comprise hundreds of thousands of nodes, that is hundreds of thousands of CPUs, GPUs, local storage devices, network cards, etc. With that, the likelihood of failures becomes a pressing issue. The mean-time-between-failures of a cluster at exascale is expected to lie within hours. As computing hours are expensive, it becomes prohibitive to run applications at that scale without any protection against failures. A common technique to protect an application is checkpoint-and-restart. In essence, this means writing snapshots of the application state to disk, and upon a failure, recover the state from the snapshot and continue execution. Sounds fairly easy, however, IO is typically the bottleneck of HPC applications and creative solutions are necessary to deliver good performance.

There are techniques that do not depend on IO, for instance, replication, redundancy, and preemptive process migration. Such methods, however, are often application specific or too expensive in terms of resources. In fact, checkpoint-and-restart is often the only viable solution. Modern checkpoint libraries leverage all available storage hierarchies in sophisticated checkpoint schemes. Some advanced techniques are checkpoint encoding, checkpoint staging, partner-node checkpointing, differential checkpointing, and incremental checkpointing. In some cases, checkpoint creation can be overlapped with the application execution, for instance, when part of the computation takes place on accelerators.

We would like to provide insight to fault tolerance techniques including all the techniques from above, and to supervise the development of a novel differential checkpointing method based on precision boundaries. Our objective is, implementing the developed mechanism into the multi-level checkpoint library FTI – Fault Tolerance Interface. The library is maintained at the BSC. FTI is already ported to numerous applications, such as ALYA, LULESH, GYSELA5D, Melissa-DA, HACC, and many others. The mechanism can be tested with those applications on several clusters, for instance, at MareNostrum4 and CTE-Power. The clusters are equipped with, SSD, and NVMe local storage devices. The student will learn to submit applications at scale and will be able to test her/his implementation on the clusters, leveraging the available storage technologies.

Project Mentor: Leonardo Bautista Gomez

Project Co-mentor: Kai Keller

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Kevser İLDEŞ, Athanasios Kastoras

Learning Outcomes:

  • Getting familiar with cluster scheduling
  • Improve parallel programming skills
  • Get insight into resiliency in HPC

Student Prerequisites (compulsory):
The student should be familiar with linux environments and should have basic knowledge of C.

Student Prerequisites (desirable):
A plus would be knowledge in shell programming, python, Fortran and C++.

Training Materials:
https://fault-tolerance-interface.readthedocs.io

Workplan:

Week 1) Learn how to access the clusters and how to submit jobs.
Week 2) Complete the FTI tutorial and to get in touch with the library.
Week 3) Implementation of the differential checkpointing.
Week 4) Implementation of precision calculations.
Week 5) Integrate precision calculations into differential checkpoint.
Week 6) Run experiments with HPC applications from different fields.
Week 7) Run larger experiments with more complex applications.
Week 8) Wrapping up and generating a report of the work done.

Final Product Description:

  • A novel differential checkpointing implementation in FTI
  • New insights about scope for precision based differential checkpointing

Adapting the Project: Increasing the Difficulty:
FTI provides checkpoint encoding using the Reed-Solomon algorithm. The encoding is not yet ported to GPU. This would be an interesting work to consider in the project.

Adapting the Project: Decreasing the Difficulty:
We can focus merely on experiments to investigate the dependency of certain kinds of applications on data precision.

Resources:
The students will need their own laptops and an appropriate IDE for software developing in C, Fortran, Python and C++

*Online only in any case

Organisation:
BSC – Barcelona Supercomputing Center

Project reference: 2102

Mitigating climate change is one of the most important humankind challenges at the time. Since the 19th century, fossil fuels’ global use has increased and dominated world energy consumption and supply. The rapid rise of global energy consumption and the cumulative global CO2 emissions from burning fossil fuels play a critical role in climate change. Nuclear fusion is one of the most promising alternatives for generating large-scale sustainable and carbon-free energy. Fusion is the process that takes place in the sun and generates enormous quantities of heat and light.

Since the 1950s, researchers have tried to replicate nuclear fusion on Earth, a historically coined process as “building the sun in a box”. However, as one can imagine, the conditions in a fusion reactor are quite harsh. Hydrogen gas in the reactor is heated at a very high temperature (over 108 °C) until it forms a plasma; controlled by powerful magnets, the atoms fuse and release energy. Unfortunately, due to the extreme complexity and conditions required to achieve a controlled fusion reaction, no design has achieved positive net energy gain.

The development of new materials is key to meeting and overcoming nearly all the world’s challenges. The study of suitable materials with the desirable chemical and physical properties to be used on the reactor walls is one of the challenges that still need to be overcome. In examples like this one, where experimental data are not available or difficult to obtain, computer simulations are critically important. Furthermore, the continual development of increasingly powerful computers makes discoveries in materials science more achievable than ever before. In the present project, we will use classical molecular dynamics simulations to study the fundamental properties of metals at the atomic scale. We will study materials that can be potentially used as protective layers in fusion reactors. The outcomes from this study will be key for the further development of more resistant materials to be used in fusion technologies.

Schematics of a plasma in a tokamak (left), polycrystalline tungsten metal structure simulated with molecular dynamics (right) and MareNostrum-4 supercomputer (background).

Project Mentor: Julio Gutiérrez Moreno

Project Co-mentor: Mervi Johanna Mantsinen

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Eoin Kearney, Paolo Settembri

Learning Outcomes:
Training and experience in a nuclear fusion research project involving numerical modeling. The student will learn how to simulate materials at the atomic scale and realistic conditions using one of Europe’s largest HPC platforms.

Student Prerequisites (compulsory):
Background in Materials, Physics or Chemistry.

Student Prerequisites (desirable):
Experience on scientific programming.
Familiarity with Linux

Training Materials:
http://fusion.bsc.es/
https://fusionforenergy.europa.eu/
https://lammps.sandia.gov/tutorials.html

Workplan:
The candidate will be part of the BSC Fusion group where she/he will work in close contact with the group members and the supervisor. Regular monitoring (daily / weekly) of the work is planned according to the student’s progress and the tasks available.

Work packages (weekly schedule):

  1. Introduction to materials for fusion (W1)
  2. Training and introduction to molecular dynamics simulations and visualization tools (W2)
  3. Introductory problems with LAMMPS code (W3)
  4. Simulation of fusion materials with LAMMPS, post-processing, and results analyses (W4-W7)
  5. Report writing (W8)

Final Product Description:
The outcomes from the proposed project are key to improve our understanding of the fundamental properties of metals in fusion reactors and will show the possibility of simulating realistic large-scale metallic systems based on molecular dynamics.

Adapting the Project: Increasing the Difficulty:
The range of materials and structures is quite broad, so that the project can be easily adapted depending on the student’s interests and capabilities. The project will evolve from relatively easy tasks which are already of high interest and that can be completed within a few weeks, e.g. formation of single and multiple vacancies in a single crystal bulk metal, towards surface models or polycrystalline structures, which require larger atomic structures with more complex geometries.

Adapting the Project: Decreasing the Difficulty:
In the unlikely scenario that the analyses of dynamic models (time and temperature dependent) get too complicated for the student or require too much time to converge, we can limit the study to static models, from which we can extract already some valuable information like the equilibrium structure or the mechanical properties of the system. Working with static (frozen) models, we can make use of more complex and accurate interatomic potentials, and define real-scale structural models with a limited computing cost. Still, the study on the evolution of mechanical properties upon vacancy formation is an interesting topic and many structures can be analysed within the period allocated to this internship.

Resources:
The molecular dynamics simulations will be run on LAMMPS, which is an open source code. The student will make use of BSC’s in-house resources like MareNostrum supercomputer.

*Online only in any case

Organisation:
BSC – Barcelona Supercomputing Center

Project reference: 2101

Outstanding capabilities for computation are useless if data is not where it is needed in time. This is common knowledge in computer architecture, even more so now that systems are extremely parallel and memory bandwidth gets scarcer by the day. For this reason, correctly moving data throughout the memory hierarchy of the system is the essence to achieve the performance, and, more importantly, efficiency, to reach the Exascale.

To account for this, the MEEP Project, as part of the ACME HPC accelerator architecture, proposes a smart data movement and placement policies, which autonomously transfer the data among the different levels of the memory hierarchy. These will enable seamless access to data and avoid unnecessary stalls in the processor pipeline. This will lead to faster and more efficient computing, in which neither time nor energy are wasted waiting for the data that is needed, the access pattern of the applications is better leveraged and the available bandwidth is more efficiently used overall.

The selected candidate(s) will work with Coyote, a new in-house RISCV architecture simulator, in the research, analysis, implementation and experimentation for the new data management policies proposed in MEEP. These policies will be shaped by the behaviour of select workloads widely used in traditional HPC, such as DGEMM, SpMV or FFT among others, and also future HPC applications.

MEEP project Logo, where Coyote and this Summer HPC are contextualized:

The ACME architecture to accelerate HPC applications:

Coyote Simulator environment:
 a) ACME model to be simulated

b) Behavioural visualizations for L2 memory cache

Project Mentor: Borja Pérez

Project Co-mentor: Teresa Cervero

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Regina Mumbi Gachomba, Aneta Ivaničová

Learning Outcomes:
– A deep understanding of how different HPC workloads behave and how they stress the resources of the system.
– Expertise with simulators and their visualization tools, which are the first tool in the proposal of new designs for HPC architectures.

Student Prerequisites (compulsory):
– Object oriented programming
– Moderate understanding of computer architecture and how to program for its efficient use.

Student Prerequisites (desirable):
– C++

Training Materials:
– Paraver trace visualization tool: https://tools.bsc.es/paraver
– MEEP web Page: https://meep-project.eu/
– Coyote: https://github.com/borja-perez/Coyote

Workplan:

  • Week 1: Training.
  • Week 2: Familiarization with the proposed architecture and target workloads.
  • Week 3: Familiarization with the simulation tools.
  • Weeks 4-7: Analysis of data management policies and potential proposal of new ones.
  • Week 8: Conclusions.

Final Product Description:
A comparison analysis of different data management policies for the evaluated HPC workloads, and the extraction of relevant conclusions on their adequacy for the HPC context.

Adapting the Project: Increasing the Difficulty:
It is possible to increase the complexity of the project by considering more features and low-level details to define the architectural blocks. The blocks that have to fulfil with the data management policies. The results of the analysis and the inputs of the candidates will be used to develop and implement new policies.

Adapting the Project: Decreasing the Difficulty:
Focus may be shifted to the visualization side of the analysis. This would also provide valuable learning for the candidate, such as the definition of metrics to evaluate HPC architectures, correct visualization to obtain insightful knowledge and the associated tools.

Resources:
The Coyote simulator is developed in BSC and open source. The required computing equipment and facilities will be provided by BSC.

*Online only in any case

Organisation:
BSC – Computer Science- European Exascale Accelerator

Applications are open from 15th of January 2021 to 12th of April 2021. See the Timeline for more details.

PRACE Summer of HPC programme is announcing projects for 2021 for preview and comments by students. Please send questions to coordinators directly by the 11th of January. Clarifications will be posted near the projects in question or in FAQ.

About the Summer of HPC programme:

Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 66 top applicants from across Europe will be selected to participate in pairs on 33 projects supported and mentored online from 14 PRACE hosting sites. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. PRACE is looking to financially support selected participants during the programme that will run from 31th June to 31th August 2021.

For more information, check out our About page and the FAQ!

Ready to apply? Click here! (Note, not available until January 10th, 2020)

Have some questions not found in the About section or the FAQ? Email us at sohpc16-coord@fz-juelich.de.

Programme coordinator: Dr. Leon Kos, University of Ljubljana

The past two months have flown past and boy has it been eventful!!!! I started out this summer having close to zero experience in coding for HPC and now I can proudly say that I have successfully parallelized a research-standard Plasma Kinetic simulation code on my own. That too with wonderful results!!! First, I would like to summarise the results of our work.

Wrap-up of Project 2019

The cool output from the OOPD1 plasma code. You can see the movement of the plasma particles (Argon ions) here

We were tasked with using Graphics Processing Units (GPUs) to speed up particle in cell (PIC) codes, which are used for plasma kinetics simulations. For a sweet, technical summary of the code, I would suggest you to read my colleague Victor‘s blog entry and for a very basic idea of what are GPUs and how GPU coding works, you can go through my previous blog entry. We worked with a simple PIC prototype called SIMPIC and a more full-fledged PIC code called OOPD1. Our task was to achieve speedup of these codes using GPUs and I can confidently state that this objective was achieved. Without going in to much detail, we concluded the following:

Offloading the intense parts of the PIC code to the GPU really worked wonders!!!!
  • Our GPU version of the code is much faster than the original version.
  • More the number of plasma particles, better the speedup. This means we get even better performance for simulating larger and more relevant plasma systems.
  • Some parts of the code show best performance in CPU. Hence, a hybrid CPU-GPU version shows best performance overall.
  • We used StarPU as a task scheduler which assigns certain parts of the code to the CPU/GPU as tasks.
  • Now, our code can be run on any architecture without needing to change code.

We explored the performance of our code even further by using a lot of code visualisation tools like NVIDIA Visual Profiler and the ViTE trace visualizer. We also ran and tested our code on different architectures like the VIZ cluster at our own site (obviously!) and the MARCONI supercomputer in Bologna (the 2nd fastest supercomputer in Europe!) and got very good results.

Visualising a PIC code using ViTE. You can see the arrows indicating data transfers between CPU-GPU and the non-red parts indicate the computational parts of the code!!

At the end of the program we had to showcase our outcomes from the project through a video presentation which is available on the SummerofHPC channel on YouTube. You can see our video below where we have presented our results in a more concise, technical manner.

Our final video presentation

Final Thoughts and Moving Ahead

As I look back now over the past couple of months, I really cannot believe that we could achieve this much without physically being at the project site and in contact with my colleagues and mentor. Of course, I have learnt so much from this program and I will forever be grateful for being selected for this year’s Summer of HPC program. This was my first experience of a research internship and I think I will take forward a lot of positives from this to my future endeavors. Even though this is the official end of my internship, I will however continue to work on this research effort in the LECAD Laboratory where we will try to implement the same things on even bigger, more complex plasma codes. Also, I got to meet a lot of new, wonderful people from various countries through this program. Working together on a common research topic/goal despite our varied cultures will be something that I’ll always cherish. Finally, I would like to thank PRACE for organizing this wonderful program in spite of the various setbacks due to the pandemic situation. Thank you all for also reading my blog posts and hope you enjoyed and learned something from them!!! This is Shyam Mohan signing off!!!

It has been a few weeks after the summer of HPC (soHPC) has finished, and I wanted to wait a bit before giving my final thoughts on the works made during these two past months.

In the first place, I wanted to say thanks to the organization that made this experience possible. Especially this year, with the COVID-19 situation, I am glad that the organization decided to carry on with the soHPC adapting it to the current situation and remote work.

Of course, there have been some parts of the experience that we missed on, for example meeting everyone during the training week, being in the HPC center and meeting the team there, or having the experience of living abroad for two months.

But, on the other hand, and due to the remote-working situation we had the opportunity to be more than one student by project, and my case that provided the opportunity to do teamwork. This skill nowadays is crucial for almost any job and here a chance to improve it was provided. Regarding this, I would like to thank my project’s partner, Benedict, for making collaborative work easier, despite working remotely.

In relation to the project I participated in, being able to participate in this project has opened the doors to another knowledge field, quantic, and especially quantum computing. After these two months, I can say that I am aware of what this field can provide to other science fields, such as genetics. Moreover, now I have a piece of basic knowledge, that was acquired while doing practice, which in my opinion, if possible, is the best way to learn anything. And for this, I would like to thank our mentor Myles for its guidance.

To sum up, as a student I have enjoyed these last two months and I’d recommend the experience. I would like to encourage any science student (especially women) that is enthusiastic about HPC to enroll for incoming editions of soHPC.

Hi again! In my last post I mentioned how we parallellized the decision tree algorithm with MPI and how much performance gain we obtained with increasing number of cores. In this post I would like to briefly discuss how the performance is further improved.

MPI vs GASPI

As shown in the performance plots in my previous post, time spending for synchronization, is an important issue for a parallel program. In MPI’s standard send/receive commands, processor cores blocks each other during the communication, which means that processors should wait each other until the message is successfully left by the sender and delivered to the receiver. This kind of communication strategy is called two-sided communication.

Two-sided communication

We can’t completely get rid of communication overhead, but we can overlap communication and computation task by using a different parallelization paradigm.

GASPI (Global Address Space Program- ming Interface) is considered as an alternative to MPI standard, aiming to provide a flexible environment for developing scalable, asynchronous and fault tolerant parallel applications. It provides one-sided remote direct memory access (RDMA) driven communication in a partitioned global address space. This definition may be hard to grasp, but what we should focus on GASPI is its capability of providing one-sided communication environment without tough sycnronization requirements. One-sided communication allows a process to make read and write op-erations on another processor’s allocated memory space without the participation of other processors. Unlike two-sided communication, the processor whose memory is being accessed, continues its job without any interruption. This means that the processors continue their computations alongside with their communication.

One-sided communication with GASPI

GPI-2 is an open source programming interface allowing implementation of the GASPI standard. It is compatible with C and C++ languages to develop scalable parallel applications. The working principle of GPI-2 is demonstrated in figure below.

Working principle of GPI-2

Unfortunatelly, we didn’t have enough time to finilaze the implementation of GPI-2 for our algorithm, so I can’t show you the performance results right now. But performance plot of GPI-2 (below) taken from its official tutorial document (http://gpi-site.com.www488.your-server.de/tutorial/) shows that it really works.

Farewell

I hope that you learn from my posts. I guess now it is time to say goodbye. I would like to thank all the people make this organization happen.

SoHPC started about 2 months ago… If you tell me this I think you are lying: how can so many events happen in just two months? I feel like a year passed since the start: we dealt with so many things that I can hardly remember the first ones. By the way, it’s time now to summarize what has been done and how to continue our effort in this project.

I’ve already talked about persistent memory in my last post. I promised to bring you some numbers, but it was harder than I thought: the cluster we worked on used memory caching to reduce the impact of the disk operations. We had to circumvent caching to correctly measure the speed-ups, and this took a little bit longer than expected. We managed to get the results we were looking for, and my mate already made a good post on it (check it out here). To cut the story short, persistent memory showed a big improvement over the uncached disk.

Our project was complete, but we could go further: the next thing in our TODO list was transactional checkpointing. The term seemed a little bit odd to me: how can checkpoints be transactional? For the ones of you that don’t know it, transactionality is usually a property of databases. It is based on the ACID properties, where the name is the acronym of Atomicity, Consistency, Isolation, Durability. These create strong limits on the effects of what can be done in a database: only complete and correct changes will affect the actual data.

This brings us back to the above question: how can a concept so much related to another field be related to checkpoints? It is possible to find the answer if we associate data changes with tasks: only the ones that complete correctly can influence the others. So we started analyzing how can processes influence each other and we came up with an important conclusion: since there is no shared memory in Charm++, the only way two tasks can influence each other is via message passing.

This conclusion showed us the way to pursuit to achieve transactional checkpointing: each task shouldn’t be able to communicate with the others until the end of its execution when it’s possible to decide about its correctness. All the outgoing messages should be stored and sent only at the end. We managed to postpone the sending of the messages, but then we faced a bigger problem: transactionality forces to either send all messages or none. To be sure to send all and only all the messages we had to base on checkpointing, and in particular asynchronous checkpointing, a feature not supported by Charm++. Introducing asynchronous checkpointing in such a complex framework proved to be non-trivial, we tried for a lot but we got no remarkable result. This was a bit frustrating: we were so close to the goal, yet so distant.

Now that everything is finished I miss all those everlasting calls where I and Petar would surf on thousands of code-lines just to change little things. I liked how we worked, and I think we couldn’t achieve the same result by ourselves. We managed to take out the best from each other and complete this project that would have been impossible for many other groups. An additional thank goes to Oliver, always ready to solve all our doubts and issues: being able to discuss and clarify all the problems we encountered helped us a lot.

So this is the end of this last blog-post. I’m going to leave you with the video presentation we prepared: it’s a short one but is very exhaustive and also contains all the data obtained from the testing phases. If you have any questions feel free to contact me on LinkedIn, I’m always happy to answer questions about HPC.

Well… what a summer! If you’d have told me a year ago that I’d be using a supercomputer to run simulations on submarines, I’d have thought you were crazy – and then probably asked where I can sign up! It’s been a fantastic journey the whole way through, now all that’s left is to share my final results with you.

Since my last post we’ve had one or two issues to overcome. Our initial goal of comparing our results to experimental ones fell through, as we only had access to graphical data and not exact numerical values. This meant that we needed to improvise.

In the end, we settled on comparing the effects of different turbulence models based on their drag coefficient. We tested several models, including k-epsilonk-omegaSSTBSL, and SSG. The drag coefficient, as you may be able to guess, gives us an idea of how streamlined an object is (with a lower value implying a more streamlined object). Along with this, compared how the drag coefficient changed with drift angle, and for different configurations of the submarine.

A visual example of what a drift angle is (only for an aeroplane!)
http://www.aviation-history.com/theory/wind_drift.htm

We also took a look at the distribution of the turbulent kinetic energy around the submarine when we considered some other turbulence models. In short, we ran a lot of simulations!

The first 5 seconds of one of the simulations from three separate angles!

Depending on who you are, you’ll either find the next sets of results much more interesting or infinitely more boring. As much as the above video does look cool, the following graphs show us exactly what is happening with the drag coefficient! If you’re in the latter camp, I’m hoping I might be able to sway you slightly.

We were very pleased to see the behaviour that we expected for the different configurations of the submarine, as well as the change with drift angle. As expected, the submarine configuration with all parts attached (configuration 2) had the highest drag coefficient by far for all drift angles, the configuration just containing the hull (configuration 3) had the lowest drag coefficient for all drift angles, and the rest were somewhere inbetween. This did end up showing us something interesting, where for lower drift angles (less than ~6 degrees), configuration 6 has a higher drag coefficient than configuration 5. However, this relation swaps around at angles higher than this.

Somewhat unsurprisingly, we found that the only turbulence model to significantly differ from the rest was the k-epsilon model. Although only two lines are visible in the above graph, the red, yellow, and green lines are all underneath the orange line. In fact, the results used to plot the overlapping lines were all identical until 8 degrees, and even here only SST began to differ very slowly. We believe that the reason the k-epsilon model gave such different results is the way it handles no-slip walls (walls where fluid against it will feel frictional force). In our model, the entire submarine is modelled as a no-slip wall. Thus, since k-epsilon is not accurate for no-slip walls, it gave a vastly different result for every angle. Although, it can still be seen that the shape of the curve as the drift angle increases is very similar for all models.

So there it is! My online internship with PRACE has come to an end. I have loved learning all about CFD and HPC, alongside meeting a bunch of great new people! I sincerely hope you’ve enjoyed reading my posts along the way, and maybe sparked an interest in CFD. Any questions, let me know in the comments below. Thanks for reading!

As you can see from the title this will be the last blog I wrote.SoHpc 2020, each day of which is an adventure, is coming to an end.Let’s take a look at what happened in the last week.Fasten your seatbelts !

Project Presentations

Our meeting, held every Tuesday, was different in the last week.Normally, each group explains what they did and what they planned that week,this week each group shared their video presentations.After each presentation, the questions asked were answered and we saw what everyone was doing all summer.There were very interesting presentations and topics.If you want to watch these presentations, you can access them from the link below.If you comment on the 21st video, I can give you information about the topics you are curious about 🙂

https://www.youtube.com/playlist?list=PLhpKvYInDmFXs3uOSAVgWbflVLNHQW9YM

Online Travel is ending…

It was fun to meet and work with people from around the world, even online.That’s why I’m already excited for Sohpc 2021.If you want to join this adventure, I recommend that you do not miss the application deadlines also if you want to learn more about our projects, you can take a look at the 2020 reports!

If you want to reach me, you can write to my Linkedin profile.I wish you all healthy and successful days !

Sohpc 2020

Supercomputers are able to perform a huge number of operations per second. But having a powerful machine is not the only ingredient for short execution times – the software has to be optimized and has to exploit all available resources.

Our project focuses on the optimization of the computation of the exponential of a general complex matrix in an HPC environment using the algorithms that have been discussed previously. We are working on Jean Zay, a national supercomputer of France which is based at IDRIS. During this summer several extensions have been installed and now its accumulated peak performance is at 28 Petaflops.
These are 28 000 000 000 000 000 operations in one second! Pretty amazing, isn’t it?

Jean Zay has CPU as well as GPU nodes – in the case of linear algebra for dense matrices, it is expected that the performance on GPUs is better than on CPUs. As a consequence, we started with an implementation of the matrix exponentiation on a CPU and benchmarked it on one core and on 40 cores. Both algorithms – Taylor series and diagonalization – showed a speedup comparing the execution on one core and on several cores for different matrix dimensions. The next step was moving to the GPU.

As expected, the execution time of the version for the GPU showed a significant speedup compared to the implementation on a CPU. Especially the implementation using Taylor series performed very well but diagonalization for large matrices is almost as fast as using Taylor series.

Conclusion

Matrix exponentiation is very well suited to be performed on a GPU and we obtained efficient programs to compute the exponential of a general complex matrix. The speedup of our different versions is very satisfying and the final result can be used in the future.

As I mentioned previously, the main goal of our project is to improve the performance of a ‘decision tree’ algorithm using High Performance Computing (HPC) parallelization tools such as MPI and GASPI. In this post I will try to explain how MPI can improve the learning speed of decision tree.

Parallelization of Decision Tree

Decision tree is a well-known supervised ML algorithm. I’m sure that many of you at least heard about it, but if I try to explain it in one sentence, I can say that it is a binary tree that predictions are made from the roots to the leaves. Even though it can be used for both regression and classification, in this study we focused on classification problem. Our decision tree is relatively simple; it is a recursive algorithm and contains only one hyper-parameter, maximum depth, controlling the stoping criterion ( there are others but we didn’t cover them for simplicity).

Algorithm splits the data sets to nodes according to ‘selected’ treshold values of ‘selected’ features in the data set. Each node is assigned with a GINI index which is a simple mathematical expression describing the purity of the node. The aim is to select optimal feature and threshold couples minimizing the GINI index for each split. The algorithm stops when the maximum depth is reached or when all the inputs are classified.

Our parallelization strategy is distributing features among different computation units. In this approach, each processor core is responsible for one or more feature sets to survey all possible splits and compute corresponding Gini indexes to evaluate them. Therefore, in every recursion, each core finds a best feature and threshold couple minimizing the Gini index for their portion of the data. After this, all the best feature and threshold couples are gathered together at the master core 0. The master core compares them according to their Gini indexes and broadcast the best result to all the other processors.

Results

As we can expect, the results show that the higher the number of cores, the higher the speed-up rates. Total computation time is decreasing with increasing number of cores. Here, the computation time corresponds to run time of learning (data fitting) process of the algorithm.

It is also seen that the rate of change is also decreasing with increasing number of cores. This is the result of inevitable communication overhead. In an ideal world, workers can do their job independently, but in real world, synchronization mostly is a must. There is no chance to escape from communication overhead required by synchronization in MPI parallelism. So in reality, speed up is always less than the increase of the number of workers.

The figure abouve shows the portions of the total computation time dedicated to calculation and communication. As it is seen that time spending for communication is increasing with increasing number of cores. Higher number of cores means that higher number of send and receive commands and for each command, cores should wait for each other. On the other hand, chunks of data which is transferred is getting smaller, and this makes faster the data transfer. Therefore, the increase in the communication time is not linearly increasing.

As I told you that MPI makes machines learn faster!

If you are interested in knowing the whole story, I recommend checking out previous entries, if you haven’t already, that are about the project I was working on, the progress I’ve made during the summer and opportunities you can get if you opt to apply for the programme.

We’ve reached the end of the road, or at least this one. It was quite a summer for me and I would like to tell you about the impressions I’ve gathered during the summer and wrap up the story about the project in which I’ve participated.

Benchmarks

After we managed to finish the first phase and performed initial testing, which was discussed on the previous post, we’ve put our work to a real challenge – scaling!

Evaluation has been done by performing already present Jacobi application which task is to determine the solutions of a strictly diagonally dominant system of linear equations. This method is well-known and broadly used in the field of high performance computing. The program it self was not what was interesting, but the fact it used online Checkpoint and Restart (C/R) system for fault tolerance while performing the calculations is why it was a perfect fit for our needs.

We’ve performed test multiple times each time varying the run configuration. It consists of two main parameters:

  • Number of Chares
  • Number of Cores and Nodes

By changing input array size and block size ratio, the application automatically adjusts by dividing the calculation into respective Chares objects, which directly impacts the size of checkpoint that is saved each time. Also, varying number of processing elements and ordering them on different nodes affects the fault tolerance system cause it fundamentally saves the application’s state by saving the state of every Chare and with more cores comes more of them.

To fully evaluate the performance gained, we’ve performed checkpointing in different modes, both previously present and newly added, for each configuration. Modes used for evaluation are:

  • Memory mode (built-in) – uses DRAM for storing checkpoints
  • FSDAX – uses Persistent Memory by accessing it via regular file I/O API
  • PMDK – uses Persistent Memory by accessing it directly via Persistent Memory Developer Kit (libpmem library)
  • Disk mode (built-in) – uses disk for storing checkpoints; it exploits memory caching by default (which doesn’t provide full persistency)
  • Disk mode w/ O_DIRECT – similar to previous, but explicitly requires operations to be performed directly to disk and not using memory as a cache

Results obtained by our evaluation can be best understood by looking at the charts bellow:

Jacobi benchmarks for every configuration

As you can see, every mode behaves pretty much the same with configuration changes, where the Memory mode is always the fastest (as expected), after it coming FSDAX and PMDK approches with times comparable with previous, then the Disk mode which profits from memory caching and shows times close to persistent memory and at last Disk w/ O_DIRECT being hugely behind all the previous modes.

We can conclude that:

  1. Both persistent memory approaches give great gain over regular disk and are similar in terms of the performance
  2. Disk w/ memory caching also shows good performance, but is not fully reliable cause intermediate cached data that isn’t yet flushed to the disk can be lost during the failure which makes the content on disk obsolete
  3. Disk w/ O_DIRECT solves previous data consistency problem, but proves to be really slow and thus makes overall application inefficient (especially when checkpointing frequently)

Future

Speed-ups obtained with persistent memory proved our initial desire – to achieve persistent checkpointing with major performance improvement. With that, my teammate and I tried to tackle our ultimate goal – to make fault tolerance transactional.

To make the system completely transactional, it is needed to comply with a set of constraints named ACID which stands for:

ACID compliance
Source: Yalantis

These strict constraints aren’t usually used outside of database field, but with today’s trend where HPC systems are becoming larger and larger, probability of failure occurring during the execution must now be accounted when designing large parallel applications.

Fundamentally, to make Charm++ checkpointing ACID compliant, it is required to do two things:

  1. Send messages only after the entry method is successfully executed – which we were able to do for basic message sending (excluding array broadcasts)
  2. Make checkpointing asynchronous – this point was hard to implement for a short period of time cause Charm’s system for checkpointing is designed to be used synchronously

Final impressions

Although we haven’t reached our ideal goal, my team and I are really content with what we’ve managed to achieve during these last two months. We’ve proven the expectations promised by the storage innovation we had a chance to work with and hopefully showed you that this memory has a real potential and can be useful in the field of HPC especially for enhancing current fault tolerance systems.

We’ve also summarized the whole project in a short video which you can check out bellow:

I’m grateful for having an opportunity to work on this cool research project and I must say that being part of the team and not working alone exceptionally this year turned out to be probably the best part of the whole experience.

If you got interested in this kind of research internship, do not hesitate to apply for next year’s Summer of HPC and be a part of this amazing programme! Of course, if you have any other concerns, I’m more than available to help you with those, just contact me via LinkedIn or leave a comment bellow.

Thanks again for being with me this summer and I hope we’ll be seeing each other soon on some new adventure!

Image adapted by Alexander J. Pfleger.
Original image by Pk0001, CC BY-SA 4.0, via Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Iceberg_in_the_Arctic_with_its_underside_exposed001.svg.

In the last blog post we had a look at quite superficial performance improvements for Python programs. The limits were set by the basic performance of Python and the number of existing modules. This time we try to surpass those limits by creating our own modules from scratch – in C++. Again we will start with a simple square function, but the concepts stay the same for more advanced functions.

To call C-code from Python we need a Python binding. There are several different libraries to achieve this. One of the most popular libraries is pybind11. It is rather compact and requires only a few additions to existing C-code. Also, many marvellous C++ features can be used, since pybind11 uses a C++ compiler.

To begin, we will only allow single integer values for our square function A^2. In C++, the function can be written like this:

int square(int A){
    return A*A;
}

In order to use this function in Python, it needs to be converted to a Python module. This can be done by altering the code as following:

#include <pybind11/pybind11.h>

int square(int A){
    return A*A;
}

PYBIND11_MODULE(bind_sq,m){#
    m.def("squareCPP", &square, "NOTE: squares integers");
}

In the first line, the pybind11 library is included in C++. The second line is the already implemented square function. The last lines generate the actual module. Also, short documentation can be added. After compiling the C++ code, the module can be loaded and used in Python:

from bind_sq import squareCPP

squareCPP(A)

To use more advanced functions, the same concepts need to be applied. To use NumPy arrays like in the previous example, some further additions need to be made in the C++ code. About three lines need to be added for each array. Those cases are well explained in the pybind11 documentation.

The performance boost of pybind11 using arrays is shown here:

For this figure, the jacobi function is used, since it has more impact on the project. The new module is ten times faster than the already optimised Python code. The performance is similar to a stand-alone C++ program. The third line in this plot is generated by a parallelised module and provides a second boost by a factor of ten. We will have a look at this method in the next paragraphs.

To further increase the performance of the function, parallelisation techniques like OpenMP can be used. The C++ code has to be slightly altered but no changes are made in the python project. This helps to keep the code clean while parallelisation is done in the background. In the previous figure, the performance of an OpenMP module with 18 threads is compared to the performance of the simple pybind11 module. Depending on the problem size a drastic speed-up can be noticed.

If you consider building your new modules right now, I would highly recommend you to use a Linux system, since it makes the setup a lot easier. If you are new to Linux, you can install a distribution like Ubuntu on a virtual machine, without making major changes to your computer.

Most people have heard of the exponential function that maps an arbitrary real (or even complex) number x to e^x but what happens if x is not a number but a matrix? Does the expression e^A with a square matrix A even make sense?

The answer is: Yes, it does!

In order to understand what the expression e^A means, we take a step back to the exponential function for scalars. When we have a look at the power series of the exponential function,

\exp(x) = \sum_{n=0}^{\infty} \frac{x^n}{n!} with x \in \mathbb{C},

we can see that there are only multiplications, additions and divisions by a scalar involved. These operations can be generalized to matrices easily. Hence, we can define the exponential of a matrix A as

\exp(A) = \sum_{n=0}^{\infty} \frac{A^n}{n!} .

The next question is: How can we compute the matrix exponential for a general complex matrix?

There exist several different algorithms, our project focuses on two of them: Taylor series and diagonalization.

The most intuitive one is using the representation above and replace the infinite sum by a finite one to obtain the Taylor series. The number of summands that one has to compute depends on the accuracy that is needed – although it is only an approximation, it serves its purpose in many applications.

The second approach to compute the exponential of a matrix in our project is diagonalization. At first the matrix A is decomposed to a product of three matrices Q , D and Q^{-1} ,
A= Q D Q^{-1},
where the columns of the matrix Q contain the eigenvectors of A , the matrix D is a diagonal matrix with the corresponding eigenvalues stored at the diagonal and Q^{-1} is the inverse matrix of Q . With this decomposition, the computation of the matrix exponential is very easy because the following equality holds

\exp(A)=Q \exp(D) Q^{-1}.

The only expression that has not been calculated is \exp(D) and this matrix is again a diagonal matrix with the exponential of the diagonal entries of D . If we multiply the matrices Q , \exp(D) and Q^{-1}, we obtain the matrix exponential of A .

But this is not HPC, right?

Wait and see…

Hello everyone, unfortunately, this is my last blog post, but I am very happy to have spent 2 months in the Summer of HPC.  I don’t remember a summer when I learned so much and improve myself.

In my previous post, l try to explain HPC system, my works, and some tools to measure HPC system’s file system performance. If you still did not read the blog please firstly read it and then continue with this blog to understand easily what happened.

In the last of the internship, I worked on ORCA which is a scientific application to gather information about nano chemical molecules.

Figure 1: Workflow of test execution and analysis for ORCA

I worked on it because my job is about to learn how ORCA works, making performance analysis on Cartesius supercomputer which is a HPC cluster in SURFsara, and prepare a userinfo for users in SURFsara. Figure 1 represent this workflow. Also, I want to show a plot in Figure 2 about my work. Below, you can find the figure about measuring execution times of a basic ORCA test file which about performing geometry optimization of a [Fe(H2O)6]3+ molecule. To measure performance I used the total execution time in milliseconds(msec). The figure shows the execution time on one different type of node using different numbers of MPI tasks per node. This test shows good scalability up to 6 tasks per node, and then degraded scalability for higher core counts. The test case is too small to scale to a full node, and the overhead of the parallelization is too high compared to the compute work when the number of tasks increases.

Figure 2: Performance results depending on the number of MPI tasks for [Fe(H2O)6]3+ Molecule Test Case, ORCA 4.2.1

More details about the results of our analysis on Cartesius will soon be available on the ORCA userinfo page of SURFsara.

Presentation Video

End of the internship we prepared a video presentation about our project with my teammate Jesus. In this video, you can find information about our project.

Bye Bye Summer Of HPC 2020

I can say sincerely, I will miss the program, I highly recommend you to join this program in the next years if you want to spend the most effective summer of your life. Finally, thank you for all Summer of HPC 2020 family, the program’s coordinators, and my mentors in SURFsara.

Bye-bye Summer Of HPC!!!

You can contact to me on LinkedIn. Feel free to send a message!!!

In this last blog post, I will review the experience of my associate Sara and myself within the summer of HPC in a quantum computing analogy. Since we did not know if the summer of HPC would be good or not, we were like Schrödingers cat in a state where it is both at the same time.

In the first two weeks, the foundation of our quantum computing journey was laid out. While the training week in the first week gave us some of the necessary capacities to get started, Myles (our mentor from ICHEC) put us into a superposition entangling the basis we had to start coding and solving problems in a quantum computing way.

In the third week, we had to go through our first Pauli-X gate and get our world flipped upside down, while experiencing the struggles of not being able to program the way we are used to. In quantum computing, it is not possible to just save intermediate results to an array or list and fall back to it later on. All actions performed needed to be unitary and reversible, due to the fact that if we tried to save our intermediate results we would need to measure our quantum states and therefore only getting part of the information stored in our superposition and losing the rest of it.

In week four and five we set our first milestone. We managed to encode our strings into a representative quantum state and making the algorithm flexible regarding the input size and content of the string. You could say that we manage to use a controlled-NOT gate (CNOT) to flip our first auxiliary qubit to 1.

In week six and the first half of week seven, we flipped the second auxiliary qubit by improving and implementing our string comparison algorithm. With those two auxiliary qubits at 1, we were in a position to apply a controlled-controlled-NOT gate (CCNOT) and flip our target qubit by finishing a running and scaling application.

In the end, all that is left is to measure our experience to determine if the probability of having a good experience was higher or not. At this section, I can only speak for myself and I really had a great time. The support for us students coming from our mentors and the PRACE staff was overwhelmingly good. We had multiple meetings a week to ask questions and discuss our progress. At times we got stuck, Myles gave us enough hints to solve problems ourself and made it a great learning experience. Also, I am very thankful to PRACE for organising this program and giving us students the opportunity to work at current problems in high-end facilities and to link up with each other as well.

Training week in Zoomanio with SoHPC 2020.

Thank you, Myles and Sara, for a great time and thank you PRACE and all organisers for giving me this opportunity. Until next time 🙂

Follow by Email