Project reference: 2010

Molecular simulations yield plenty of data which have to be processed to obtain desired results. Although the automatization of data verification is still better and better, there is still a necessity to check some properties by the naked eye. For those situations, reliable lighweight visualization software is needed, often taylored directly for the specific purpose. The main aim of this project is to create such tool which could be then used for visualizing the outputs of the molecular simulations performed at our supercomputing center.

The student will get to know the most common formats used in molecular simulations and will learn how to visualize them in the clear and physically relevant manner.

The tool will be implemented in Python3 language which is nowadays widely used in the scientific community, especially for data analysis purposes, due to its high-level functions and easy to use syntax. Moreover, Python provides a strong affinity to the object oriented programming paradigm and its interpreter is platform independent. Thus, the student will be able to implement the tool in a clearly structured way so the software will be easily maintained and extended

Project Mentor: Martin Beseda

Project Co-mentor: RajkoĆosić

Site Co-ordinator: Zuzana Červenková

Participants: Marco Mattia, Denizhan Tutar

Learning Outcomes:
The student will gain experience in molecular simulations techniques data from which he will learn to process and present in a clear and understandable way. The latter goal will be performed by a software tool which will be implemented by the student.

Student Prerequisites (compulsory):

  • Knowledge of Python
  • Knowledge of Linux

Student Prerequisites (desirable):

  • Knowledge of C++
  • Knowledge of Fortran 90 or later
  • Familiarity with OpenGL, PyQT, …
  • Experience with parallel programming

Training Materials:

Workplan:

  • 1st week: training week
  • 2nd week: installation and introduction to the necessary Python modules and data formats
  • 3rd week: implementation of the data parser
  • 4th week: basic GUI for the visualizing tool
  • 5th week: implementation of basic fuctionalities
  • 6th week: enhanced functionalities (rotation, zoom, VdW radii, etc.)
  • 7th week: animation
  • 8th week: finalization

Final Product Description:
A visualizing tool for the data outputs of various molecular simulation techniques.

Adapting the Project: Increasing the Difficulty:
Multiplatform support of the tool.

Adapting the Project: Decreasing the Difficulty:
Simple visualization without taking the physical properties (such as charge delocalization, Van der Waals radius, etc.) into consideration.

Resources:
The student will need just a Linux operated machine with Python3 interpreter and connection to the internet.

Organisation:
IT4Innovations National Supercomputing Center at VSB – Technical University of Ostrava

Project reference: 2009

Deep neural networks (DNNs) are evaluated and used in many areas today, to replace or complement traditional technologies. In this project the student will develop a DNN to detect and localize selected objects in images. As object detection can be used in a wide range of applications, the student decides on the kind of objects to detect and its context at the beginning of the project. The student is free in proposing any application that uses object detection as long as it requires training a DNN with HPC technologies and tuning it for the edge.

The work comprises the selection of the right network topology, the training and validation strategy, as well as collecting and pre-processing training and validation data. The project does not stop with the training of the model in the data centre but considers the complete lifecycle, including deployment and optimization. An application must demonstrate the correct object detection of the trained DNN on an edge device with limited compute resources (e.g. Intel’s Movidius Neural Compute Stick).

Training of the DNN will take place on NVIDIA V100 GPUs (e.g. NVIDIA DGX-2 system) with Tensorflow as deep learning framework. Depending on the selected edge device, deployment and optimization of the DNN will be done with either NVIDIA TensorRT or Intel’s OpenVino toolkits. As a stretch goal, for better portability, the final DNN should be using the ONNX model interchange format.

Last year’s award winning Pablo Llunch Romero used the detection of action units in facial expressions to identify emotions

Project Mentor: Georg Zitzlsberger, M.Sc.

Project Co-mentor: Martin Golasowski

Site Co-ordinator: Zuzana Červenková

Participants: Andres Vicente Arevalo

Learning Outcomes:
The student will be guided through the principles of DNNs, the state of the art networks and Google’s TensorFlow. With the project, the student also gets accustomed to the latest NVIDIA technologies around V100 that will be used for training, and different deployment and optimization toolkits.

Student Prerequisites (compulsory):

Knowledge in Python is needed to create the DNN.
Knowledge of Linux as the development environment.

The project includes training for the student on mandatory DNN basics (training material from NVIDIA and Intel is used).

Student Prerequisites (desirable):
Tensorflow, Nvidia TensorRT, Intel OpenVino

Training Materials:
Introduction to:
https://www.tensorflow.org/api_docs/
https://developer.nvidia.com/tensorrt
https://software.intel.com/en-us/openvino-toolkithttps://software.intel.com/en-us/movidius-ncs

 Workplan:

  • 1st  week: training week.
  • 2nd week: internal student training on DNN technologies and networks incl. first exercises on TensorFlow
  • 3rd week: Selection of network, training strategy and identification of training data
  • 4th week: Training and validation of DNN on NVIDIA V100 (finding right hyper-parameters)
  • 5th week: Deployment to edge device
  • 6th week: Demoing final application of DNN on edge device (e.g. via video stream)
  • 7th week: Optimization for edge device (e.g. TensorRT/OpenVino) to improve the DNN performance for the edge device
  • 8th week: Discussing full processing pipeline (HPC to edge) with quality analysis

Final Product Description:
An environment (processing pipeline) will be created to train a DNN for detecting and localizing a specific object on the HPC side, and to optimize and deploy it to an edge device for use. 

Adapting the Project: Increasing the Difficulty:
Retrain the DNN to detect more different types of objects.

Adapting the Project: Decreasing the Difficulty:
The exact type and number of objects to detect will be selected by the student depending on his/her skills. Also the deployment to the edge is variable in terms of which device to use. Using ONNX is a stretch-goal that might be sacrificed to favour faster solutions vs. using standards.

Resources:

Introduction to:
https://www.tensorflow.org/api_docs/
https://developer.nvidia.com/tensorrt
https://software.intel.com/en-us/openvino-toolkithttps://software.intel.com/en-us/movidius-ncs

Organisation:
IT4Innovations National Supercomputing Center at VSB – Technical University of Ostrava

Project reference: 2008

A Breadth First Search (BFS) is one of the core graph based searching algorithms. It can run in O(N + E), where N is the number of vertices and E is the number of edges of the graph. Due to the irregular memory access patterns and the unstructured nature of the large graphs, its parallel implementation is very challenging. Several parallel programming approaches were implemented in the literature. Especially, the use of GPUs has been an interesting area of research in terms of exploiting the architectural features such as high throughput and use of shared memory

In this project, we first aim to implement a BFS in CUDA using vertex-centric approach, i.e., one thread per vertex in the graph. We will then look into an edge-centric CUDA implementation. Following on, we will analyse the code in the NVIDIA Visual Profiler and investigate how we can improve the performance by using novel approaches such as CUDA dynamic parallelism, Hyper-Q or NVSHMEM to benefit from the architecture.

Upon achieving sufficient usefulness,  functionality, and performance, the code can be associated with Social Networks where each vertex correspond to a different person, and each edge represents a friendship between two people. The question of interest would be then finding people’s degree of separation from each other.

Caption 1: Breadth First Search discovers distances to nodes*
Caption 2: A Social Network Visualisation
* D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning about a Highly Connected World, Cambridge University Press, 2010.

Project Mentor: Buket Benek Gursoy

Project Co-mentor: /

Site Co-ordinator: Simon Wong

Participants: Busenur Aktilav, Berker Demirel

Learning Outcomes:

  • Use of an HPC system and profiling of an application.
  • Writing clean and reusable C for use in other HPC projects.
  • Solving problems using GPU computing and Code profiling.
  • Exposure to Social Networking Graph Search Algorithms.

Student Prerequisites (compulsory)
Intermediate C, Basic familiarity with programming in CUDA.

Student Prerequisites (desirable):
Familiarity with GPU technology.
Familiarity with NVIDIA Profiling Tools.
Familiarity with Search algorithms.
Familiarity with Social Networking Graphs.

Training Materials:

BFS Code Examples:
https://github.com/maxdan94/BFS
https://github.com/bamfbamf/bfs-cudahttps://github.com/onesuper/bfs_in_parallel

Related Publications:

  1. P. Harish and P. J. Narayanan, Accelerating large graph algorithms on the gpu using cuda, In Proceedings of the 14th International Conference on High Performance Computing, HiPC’07, pp. 197-208, Berlin, Heidelberg, 2007, Springer-Verlag.
  2. C. D. Pise and S. W. Shende, Parallelization of BFS Graph Algorithm using CUDA, IJCAT International Journal of Computing and Technology, Volume 1, Issue 3, April 2014.
  3. M. Springer, Breadth-First Search in CUDA, Tokyo Institute of Technology

Workplan:

  •  Week 1: Training week
  • Week 2: Overview of the project; Familiarising with the HPC system at ICHEC; Introduction to social networking graph algorithms and C and CUDA concepts; Learn about BFS; Write project plan
  • Week 3: Serial implementation of the BFS algorithm; Learn about its parallelisation approaches on GPUs.
  • Week 4: CUDA implementation of the BFS algorithm.
  • Week 5: Profiling and Improvements to CUDA Code.
  • Week 6: Applications to Social Networks.
  • Week 7-8: Write Report, Prepare Video Presentation for PRACE and Presentation at ICHEC

Final Product Description:
A clean parallelised code of the BFS algorithm in CUDA; Profiling the CUDA code and Potential application to Social Networks.

Adapting the Project: Increasing the Difficulty:
The student could attempt to implement advanced optimisation of the CUDA code using approaches such as dynamic parallelism or NVSHMEM.

Adapting the Project: Decreasing the Difficulty:
The student could stop at implementing after the CUDA accelerated BFS, or anywhere along that path, and finalise with the profiling.

Resources:
A computer (The student should bring their own.)
Access to HPC system (ICHEC’s HPC system, Kay, will be made available.)

Organisation:
Irish Centre for High-End Computing

Project reference: 2007

Python is widely used in scientific research for tasks such as data processing, analysis and visualisation. However, it is not yet widely used for large-scale modelling and simulation on high performance computers due to its poor performance – Python is primarily designed for ease of use and flexibility, not for speed. However, there are many techniques that can be used to dramatically increase the speed of Python programs such as parallelisation using MPI, high-performance scientific libraries and fast array processing using numpy. Although there have been many studies of Python performance on Intel processors, there have been few investigations on other architectures such as AMD EPYC, ARM64 and GPUs. In 2020, EPCC will have access to all three of these architectures via: the new UK HPC National Tier-1 Supercomputer ARCHER2; our own Catalyst machine Fulhame; the Tier-2 system Cirrus. A Summer of HPC project in 2019 developed a parallel Python version of an existing C program which performs a Computational Fluid Dynamics (CFD) simulation of fluid flow in a cavity. This project involves extending that work to investigate and optimise its performance on a range of novel HPC architectures, and to extend it to make use of GPUs.

Sample output of existing program showing turbulent flow in a cavity

Project Mentor: Dr David Henty

Project Co-mentor: Dr Oliver Brown

Site Co-ordinator: Juan Herrera

Participants: Alexander Julian Pfleger, Antonios-Kyrillos Chatzimichail

Learning Outcomes:
The student will develop their knowledge of Python programming and learn how to compile and run programs on a range of leading HPC systems. They will also learn how to use GPUs for real scientific calculations.

Student Prerequisites (compulsory)
Ability to program in one of these languages: Python, C, C++ or Fortran. A willingness to learn new languages.

Student Prerequisites (desirable):
Ability to program in Python

Training Materials:
Material from EPCC’s Python for HPC course or the PRACE Python MOOC.

Workplan:

  • Task 1: (1 week) – SoHPC training week
  • Task 2: (2 weeks) –Understand functionality of existing parallel C and Python codes and make initial port to new HPC platforms.
  • Task 3: (3 week) – Measure baseline performance on new HPC platforms and develop new GPU-enabled version
  • Task 4: (2 weeks) Investigate performance optimisations and write final report

Final Product Description:

Benchmarking results for Python performance on a range of parallel machines.
Recommendations for how to improve Python performance on AMD EPYC and ARM64 processors.
Development of a GPU-enabled parallel Python application.

Adapting the Project: Increasing the Difficulty:
The project can be made harder by investigating advanced optimisation techniques such as cross-calling from Python to other compiled languages such as C, C++ or Fortran.

Adapting the Project: Decreasing the Difficulty:
The project can be made simpler by considering only one of the target platforms, or by considering CPU-only versions and omitting the GPU work.

Resources:
Access to all HPC systems can be given free of charge by EPCC.

Organisation:
EPCC

EPCC

Project reference: 2006

The aim of this project is to explore the limits of our fully ARM-based HPC cluster, Fulhame. This a relatively new cluster and whilst some codes and libraries are already ported and optimised, there is a lot of work to do in porting, optimising and understanding the strong and weak points of the system compared to others.

This project is well suited for a student looking for experience in practical HPC work, building installing and configuring codes and systems. Any subject is available, from looking at power efficiency optimisation, IO optimisation with Lustre on ARM, MPI scaling etc.

Open cabinet showing several nodes of Fulhame.

Project Mentor: Nick Johnson

Project Co-mentor: /

Site Co-ordinator: Juan Herrera

Participants: Jerónimo Sánchez García, Irem Kaya

Learning Outcomes:
Student will learn how to build, install and optimise HPC libraries & codes for ARM architecture.
Student will learn something about how to operate a production HPC cluster.

Student Prerequisites (compulsory)
Student must have experience in building and compiling code and libraries for a cluster system.

Student Prerequisites (desirable):
Some experience of package management, job and code profiling and optimisation and system level parameters (DVFS etc.) would be desirable but certainly not essential.

Training Materials:
No specific materials at present. Some will be provided closer to the placement and in conjunction with the student depending on which area of the project they want to focus on.

Workplan:

  • Week 1+2: Concentrate on running basic tests on the system, getting access to all codes/libraries etc, begin to identify (if not done a priori) which specific areas to focus on during the rest of the project
  • Week 3+4: Start analysis of codes/libraries/parameters which require work and produce plan, then begin work.
  • Week 5+6: Main bulk of work, branching into secondary problems/code if necessary.
  • Week 7+8: Finalise work, pushing code to upstream repositories, write report and give EPCC talk on work done/results achieved.

Final Product Description:
A technical paper or report describing activities done. Code commits to upstream repositories proposing fixes for any bugs found.

Adapting the Project: Increasing the Difficulty:
Work on more codes or libraries; look at system parameters.

Adapting the Project: Decreasing the Difficulty:
Concentrate more on the applications level tuning using standard profilers/debuggers etc.

Resources:
Access to ARM cluster will be required and provided by EPCC.

Organisation:
EPCC

EPCC

Project reference: 2005

Charm++ is an open source, parallel programming framework in C++, supported by an adaptive runtime system. It uses objects called chares which hold data, and methods to act on that data, which can be invoked emotely by sending messages to that chare.  A message may also contain data, which allows data dependencies to be met. Once a method has been invoked remotely, the chare is scheduled to run its method by an adaptive runtime scheduler. This allows programs to be broken down into a sequence of asynchronous parallel tasks.

One feature of Charm++ is fault tolerance – if a node goes down, chares can be rescheduled to allow the program to continue. This requires regular checkpointing to disk, something which can lead to poor performance. Persistent memory promises performance closer to RAM, but with the persistence of writing to disk! With hardware like that, it may even be feasible to make a Charm++ application completely transactional, so that almost no progress is ever lost in the event of a checkpoint restart.

In this project you will try to rewrite Charm++’s fault tolerance to take advantage of Intel Optane DC persistent memory. The required hardware will be provided by EPCC, courtesy of the NEXTGenIO project [http://www.nextgenio.eu/].

Messages are sent between processes. When a message arrives on process, it invokes a method on a Chare. Chare’s are checkpointed regularly to disk, but if the disk is replaced by persistent memory, it might be possible to maintain a live copy of the Chare’s stat without major loss of performance.

Project Mentor: Dr. Oliver Thomson Brown

Project Co-mentor: Dr. Nick Brown

Site Co-ordinator: Juan Herrera

Participants: Roberto Rocco, Petar Đekanović

Learning Outcomes:
Student will have learnt about the importance of asynchronous parallelism, and fault tolerance to the future of HPC. They will have learnt how to program distributed task-based parallel code using Charm++, and how to program Intel Optane persistent memory.

Student Prerequisites (compulsory)

  • Experience with UNIX-based operating systems.
  • Strong programming skills.
  • Experience with object-oriented programming.

Student Prerequisites (desirable):

  • Experience with C/C++.
  • Experience with message-passing parallelism.
  • Experience with task-based parallelism.

Training Materials:

Workplan:

  • Weeks 1-3: Student familiarises themselves with Charm++ and Intel Optane persistent memory, and develops detailed plan.
  • Weeks 3-6: Student does technical work to achieve their planned goal – replacing Charm++ checkpointing to disk with persistent memory, and/or making it ACID compliant.
  • Weeks 7-8: Student writes up project.

Final Product Description:
Student will have produced a small example code using Charm++ which checkpoints to Intel Optane DC persistent memory.

Adapting the Project: Increasing the Difficulty:
Student can try to make Charm++ transactional and ACID compliant using persistent memory.

Adapting the Project: Decreasing the Difficulty:
Student can experiment with the performance benefits of persistent memory versus writing to disk, and write simple test programs in Charm++.

Resources:

Organisation:
EPCC

EPCC

Project reference: 2004

Anomalies detection is one of the timeliest problems in managing HPC facilities. Clearly, this problem involves many technological issues, including big data analysis, machine learning, virtual machines manipulations and authentication protocols.

Our research group already prepared a tool to perform real-time and historical analysis of the most important observables to maximize the energy efficiency and the maintainability of an HPC cluster.

Currently, the tool is about to be ported on CINECA’s new GPU-based petascale cluster and, when it will be in production (probably by Q2 2020), we will have access to a new dataset.  The project focuses on the data-driven automation of the new cluster. Big-data and analytics will be performed on the new dataset.

In the framework of SoHPC program, we aim to perform some preliminary test on deep learning methodologies and database to extract knowledge from the node and system-level metrics and to correlate it to with node/job failures.

Project Mentor: Andrea Bartolini

Project Co-mentor: Andrea Borghesi

Site Co-ordinator: Massimiliano Guarrasi

Participants: Aisling Paterson, Stefan Popov, Nathan Byford

Learning Outcomes:
Increase student’s skills about:

  • Big Data Analysis
  • Elasticsearch/Cassandra
  • Deep learning
  • TensorFlow
  • Open Stack VM
  • Python
  • Blender
  • HPC schedulers (particularly Slurm)
  • Internet of Things (MQTT)

HPC infrastructures

Student Prerequisites (compulsory):

  • Python or C/c++ language (python will be preferred)
  • Numpy

Student Prerequisites (desirable):

  • Numpy
  • Matplotlib
  • Pandas
  • Elastic Search
  • Tensor Flow
  • Cassandra
  • Spark
  • Blender
  • MQTT

Training Materials:
None.

Workplan:

  • Week 1: Common Training session
  • Week 2: Introduction to CINECA systems, small tutorials on big data  and tensorflow and detailed work planning.
  • Week 3: Problem analysis and deliver final Workplan at the end of week.
  • Week 4, 5: Production phase (log mining and training of the neural network).
  • Week 6, 7: Final stage of production phase (Depending on the results and timeframe, the set of observables will be increased). Preparation of the final movie.
  • Week 8: Finishing the final movie. Write the final Report.

Final Product Description
A full big data to process heterogeneous information as well as a trained deep neural network to predict the node anomalies.

Adapting the Project: Increasing the Difficulty:
A simple tool to perform a live anomalies detection will be prepared and installed on a virtual machines.

Adapting the Project: Decreasing the Difficulty:
If necessary we could reduce the effort, creating only a log mining application using elasticsearch

Resources:
The student will have access to our facility, our HPC systems,and the databases containing all the measurements, system logs and node status information. They could also manage a dedicated virtual machine.

Organisation:
CINECA
cineca_logo

Project reference: 2003

Supernova (SN) explosions are one of the most energetic phenomena in the universe.  They represent the final fate of massive stars. SNe can briefly outshine entire galaxies and radiate more energy than our sun will in its entire lifetime. They’re also the primary source of heavy elements in the universe. Supernova remnants (SNRs), that are the outcome of SN explosions, are diffuse extended sources with a rather complex morphology and a highly non-uniform distribution of ejecta. The remnant morphology reflects, on one hand, the highly non-uniform distribution of ejecta due to pristine structures developed soon after the SN as well as the imprint of the early interaction of the SN blast with the magnetized inhomogeneous circumstellar medium.

In the framework of SoHPC program, as a first step, we aim to perform some SN explosion simulations by using the largely used PLUTO MHD code for astrophysical plasmas. PLUTO is one of the most popular Godunov-type MHD code developed in Europe and can be used to solve many astrophysical problems from relatively small scales (e.g. coronal loops formation) to galactic scales.

Clearly only the use of HPC resources can give us the possibility to simulate the evolution of complex systems like SNe and SNRs.

The second step of our project will be to analyse the data obtained from the SN-SNR simulations using the Interactive Data Language (IDL), or eventually python, and the tool Paraview for data visualization. Particularly, thanks to Paraview, we will produce some 3D models and movies of the SN explosion and subsequent full-fledged SNR. Finally, the 3D models will be also converted and uploaded on SKETCHFAB, one of the largest platform to share 3D models and virtual reality contents, to share student’s results inside the 3D creators community.

Copyright NASA

Copyright S. Orlando
You can find further materials on: https://sketchfab.com/sorlando/collections/universe-in-hands

Project Mentor: Salvatore Orlando

Project Co-mentor: /

Site Co-ordinator: Massimiliano Guarrasi

Participants: Seán McEntee, Cathal Maguire

Learning Outcomes:
At the end of the program the student will be able to:

  • use the PLUTO code to perform an astrophysical simulation;
  • use the Paraview software to extract 3D models from it;
  • use Blender to prepare a movie;
  • analyse scientific data using IDL or python.

He/she will increase also his/her skills on parallel programming models (especially MPI) and on scientific visualization.

Student Prerequisites (compulsory):

  • C (first option) or Fortran (intermediate knowledge)
  • Some concept about parallel computing and MPI (beginner)
  • Python (intermediate knowledge)

Student Prerequisites (desirable):

  • Paraview
  • Blender
  • IDL
  • Numpy
  • Matplotlib
  • Astrophysics
  • MHD modeling

Training Materials:
None.

Workplan:

  • Week 1: Common Training session
  • Week 2: Introduction to CINECA systems, PLUTO and Paraview.
  • Week 3: Problem analysis, initial PLUTO setup preparation and deliver final Workplan at the end of week.
  • Week 4, 5: Production phase (PLUTO simulations will be performed and results analysed with Paraview and IDL). (Depending on the results and timeframe, the number of simulations could be increased or decreased)
  • Week 6, 7: Final stage of production phase. Preparation of the final movie. Creation of the model to be uploaded on sketchfab.com
  • Week 8: Finishing the final movie. Write the final Report.

Final Product Description:
The final results of the project will be:

  • A simplified model of SN-SNR will be prepared using PLUTO;
  • Some vtk files containing the results of the simulations will be prepared;
  • Some movies of the SN-SNR simulations will be published on the web;

A webpage on sketchfab.com containing the 3D models of SN-SNR.

Adapting the Project: Increasing the Difficulty:
If the time will be sufficient, we will increase the number and the complexity of the SN-SNR models.

Adapting the Project: Decreasing the Difficulty:
Depending the time we will have, we can reduce the number of models to 1 and omit the publication of the work on sketchfab.com

Resources:
The student will have access to our facility, our HPC systems, and all the Software and hardware required to complete the planned work.

Organisation:
CINECA
cineca_logo

Project reference: 2002

Neural networks (NN) and deep machine learning are two success stories in modern artificial intelligence. They have led to major advances in image recognition, automatic text generation, and even in self-driving cars. NNs are designed to model the way in which the brain performs a task or function of interest. It can perform complex computations with ease.

Quantum chemistry is a powerful tool to study properties of molecules and their reactions. The rapid development of HPC has greatly encouraged chemists to use quantum chemistry to understand, model, and predict molecular properties and their reactions, properties of nanometer materials, and reactions and processes taking place in biological systems.

An essential paradigm of chemistry is that the molecular structure defines chemical properties. Inverse chemical design turns this paradigm on its head by enabling property-driven chemical structure exploration [1].

The main goal of this project is to investigate NN frameworks which can emulate electronic wavefunction in local atomic orbital representation as a function of molecular composition and atom positions or other molecular descriptors and representations. Other objective is to apply NN frameworks as predictor of molecular properties (HOMO-LUMO gap, charges of atoms or evidence of hydrogen bonds) based on structural properties of these molecules. Next to the aforementioned application part of the project, we also plan to (in)validate the widely accepted fact, that GPGPUs are superior execution platform for NNs to CPUs. To do so, we will compare GASPI/GPI-2 (http://www.gpi-site.com/gpi2) CPU asynchronous parallel implementation with CUDA.

[1] Schütt, K.T., Gastegger, M., Tkatchenko, A. et al. Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions. Nat Commun 10, 5024 (2019)

Schema of GASPI matrix-matrix multiplication and memory layout on a rank (parallel process).

Schematic diagram of descriptors (left) as inputs to the neural networks (right), along with hidden layers, and output.

Project Mentor: Ing. Marián Gall, PhD.

Project Co-mentor: Doc. Mgr. Michal Pitoňák, PhD.

Site Co-ordinator: Mgr. Lukáš Demovič, PhD.

Participants: Neli Sedej, George Katsikas

Learning Outcomes:
Student will learn a lot about Neural Networks, GASPI (C/C++ or Fortran) and CUDA.

Student Prerequisites (compulsory)
Basic knowledge of C/C++ or Fortran, MPI, basic chemistry/physics background.

Student Prerequisites (desirable):
Advanced knowledge of C/C++ or Fortran and MPI, BLAS libraries and other HPC tools. Basic knowledge of GASPI, neural networks, quantum chemistry background.

Training Materials:

Workplan:

  • Week 1: training;
  • Weeks 2-3: introduction to neural network, GASPI, quantum chemistry and efficient implementation of algorithms,
  • Weeks 4-7: implementation, optimization and extensive testing/benchmarking of the codes,
  • Week 8: report completion and presentation preparation

Final Product Description:
Expected project result is (C/C++ or Fortran) GASPI implementation of (selected) neural network algorithm, applied to quantum chemistry problem. Code will be benchmarked and compared to CUDA implementation.

Adapting the Project: Increasing the Difficulty:
Writing own NN algorithm using CUDA.

Adapting the Project: Decreasing the Difficulty:
Applying existent NN implementation to quantum chemistry problems.

Resources:
Student will have access to the necessary learning material, as well as to our local IBM P775 supercomputer and x86 infiniband clusters. The software stack we plan to use is open source.

Organisation:
Computing Center, Centre of Operations of the Slovak Academy of Sciences

Project reference: 2001

The goal of the project is to demonstrate that HPC tools are (at least) as good, or better, for big data processing as the popular JVM-based technologies, such as the Hadoop MapReduce or Apache Spark. The performance (in terms of floating-point operations per second) itself is clearly not the only metrics to judge. We must also address other aspects, for which the traditional big data frameworks shine, i.e. the runtime resilience and parallel processing of distributed data.

The runtime resilience of parallel HPC applications is a vivid research field for quite a few years and several approaches are now application ready. We plan to use GPI-2 API (http://www.gpi-site.com/gpi2) implementing GASPI (Global Address Space Programming Interface) specification, which offers, among other appealing features (asynchronous data flow, etc.), mechanisms to react to failures.

The parallel, distributed data processing in the traditional big data world is made possible using special filesystems, such as the Hadoop file system (HDFS) or analogues. HDFS enables data processing using the information of data locality, i.e. to process data that is “physically” located on the compute node, without the need for data transfer over the network.  Despite it’s the many advantages, HDFS is not particularly suitable for deployment on HPC facilities / supercomputers and use with C/C++ or Fortran MPI codes, for several reasons. Within the project we plan to explore other possibilities (in-memory storage / NVRAM, and/or multi-level architectures, etc.) and search for the most suitable alternative to HDFS.

Having a powerful set of tools for big data processing and high-performance data analytics (HDPA) built using HPC tools and compatible with HPC environments, is highly desirable, because of growing demand for such tasks on supercomputer facilities.

Schema of GASPI matrix-matrix multiplication and memory layout on a rank (parallel process).

Project Mentor: Doc. Mgr. Michal Pitoňák, PhD.

Project Co-mentor: Ing. Marian Gall, PhD.

Site Co-ordinator: Mgr. Lukáš Demovič, PhD.

Participants: Muhammad Omer, Cem Oran

Learning Outcomes:
Student will learn about MPI and GASPI (C/C++ or Fortran), parallel filesystems (Lustre, BeeGFS/BeeOND) and basics of Apache Spark. He/she will also get familiar with ideas of efficient use of tensor-contractions and parallel I/O in machine learning algorithms.

Student Prerequisites (compulsory)
Basic knowledge of C/C++ or Fortran and MPI.

Student Prerequisites (desirable):
Advanced knowledge of C/C++ or Fortran and MPI. Basic knowledge of GASPI, Scala, Apache Spark, big data concepts, machine learning algorithms, BLAS libraries and other HPC tools.

Training Materials:

Workplan:

  • Week 1: training;
  • weeks 2-3: introduction to GASPI, Scala, Apache Spark (and MLlib) and efficient implementation of algorithms,
  • weeks 4-7: implementation, optimization and extensive testing/benchmarking of the codes,
  • week 8: report completion and presentation preparation

Final Product Description:
Expected project result is (C/C++ or Fortran) MPI and (runtime resilient) GASPI implementation of a selected, popular machine learning algorithm. Codes will be benchmarked and compared with the state-of-the-art implementations of the same algorithm in Apache Spark MLlib or other “traditional” big data / HDPA technology.

Adapting the Project: Increasing the Difficulty:
The choice of machine learning algorithm(s) to implement depends on the student’s skills and preferences. An ML algorithm implementation, to be efficient and run-time resilience, is challenging enough.

Adapting the Project: Decreasing the Difficulty:
Similar to “increasing difficulty”: we can choose one of simpler machine learning algorithms and /or sacrifice the requirement of runtime resilience.

Resources:
Student will have access to the necessary learning material, as well as to our local IBM P775 supercomputer and x86 infiniband clusters. The software stack we plan to use is open source.

Organisation:
Computing Center, Centre of Operations of the Slovak Academy of Sciences

Applications are open from 11th of January 2020 to 26th of February 2020. See the Timeline for more details.

PRACE Summer of HPC programme is announcing projects for 2020 for preview and comments by students. Please send questions to coordinators directly by the 11th of January. Clarifications will be posted near the projects in question or in FAQ.

About the Summer of HPC program:

Summer of HPC is a PRACE programme that offers summer placements at HPC centres across Europe. Up to 24 top applicants from across Europe will be selected to participate. Participants will spend two months working on projects related to PRACE technical or industrial work to produce a visualisation or video. The programme will run from June 31th, to August 31th.

For more information, chech out our About page and the FAQ!

Ready to apply? Click here! (Note, not available until January 10th, 2020)

Have some questions not found in the About section or the FAQ? Email us at sohpc16-coord@fz-juelich.de.

Programme coordinator: Dr. Leon Kos, University of Ljubljana

Hello everyone! I hope you are doing well!

I bet you are wondering now if you are going to apply (or if you should talk about this to your potentially interested friends) and are not really sure how to do this.

Well, I thought that it could help if I write a little bit about how the application process went for me and how I am feeling now about the experience.

As you may guess, if I’m still writing in this blog, my experience went pretty good overall! Here’s the story:

Disclaimer: This is my own personal experience. It does not engage PRACE in anything nor states that it is the “best pattern” to get selected. Everyone is different and has a different background, so trust your own guts and go for it!

I was a penultimate year student in Applied Mathematics and Computer Science. Last year, I had a High Performance Computing course (in the second semester). Luckily, our HPC lecturer sent us an email he received saying that the applications were open (he sent this in February, the applications opened around mid-January, so stay tuned during this winter) and if we were interested we could apply and ask him to fill in the recommendation form for us. (You need to ask a teacher of yours to do that, but it has to be filled in the application website directly, you won’t know what they said about you)

I was really loving the classes of HPC so I directly thought that I should apply! But then, I started overthinking about the selection process, that maybe I don’t have a good profile, etc.

Luckily, a friend of mine, Jordy, was also interested and encouraged me to go for it too! We supported each other but we knew that we didn’t have a lot of chances to be both selected (two people from the same city, the same specialty, the same class… but interested in very different projects!) but yet, we both applied (and we were not the only ones in our class actually) and waited. And you know what? We were both selected and spent the first week in Bologna together!

Should you be recommended by an HPC lecturer?

No! I thought a lot about it but ended up asking my teacher of Data Analysis and Statistics to do that, because she knew me since more than a year and she is a teacher I really admire (as I do for my HPC lecturer too, of course).
So, don’t worry about this! I don’t know if you have to ask a teacher that is somehow related to the subjects you want (I applied for Machine Learning/Deep Learning projects, so a statistics teacher made sense for me).

Do you have to be experienced in HPC to get selected?

I believe that you don’t have to. Even if I was already familiar with, some people we met were not! The projects were really different and focused on many fields. People came from different backgrounds and were not all programming on a daily basis before the internship!

What did we have to do?

There were different “steps”.
There was a page about you, your information, where you study etc.
We had some exercises to do (3, each one was related to the previous one). Actually it was about doing one exercise then improving it.
We had to choose at least two projects, 3 at most, and say what were our motivations for each project.
There was also a motivation part to write about the program in general.
I didn’t choose a “standard cover letter”. I wrote every word in a way that it was as sincere as possible, and to make it reflect me.
We also had to attach a résumé.
As far as I remember that was everything. I took enough time to write everything, I asked some friends to review my motivation paragraphs and that was very helpful.

I had the answer by the beginning of April, and my coordinator contacted me few days/weeks after that to organize my 3 travels. (To Bologna, from Bologna to Bratislava, then back to Paris). Everything was settled quickly and it wasn’t stressful at all for me!

We arrived, Jordy and I, to Bologna and had lot of fun and learned a lot. I met so many people and enjoyed talking and sharing moments with all the other participants. It was really a great week!

Once I arrived to Bratislava I started working on my project (High Performance Machine Learning) and you can know more about that if you check the posts I published in this summer!

If I could change something…?

Only one thing. I would have made it possible to meet all the other participants again for one week after the end of the Summer. This would have been the perfect way to say goodbye! But I was still lucky enough to meet again some people I met in Bologna. We went to Vienna and had lots of fun there together!

Now?

Now I started my semester abroad in Pisa (I’m back again to Italy, can’t resist to Pizza and Gelato) with some classes about AI, Machine Learning and Data. My experience in HPC is definitely not stopping here though, I will find a way to work on this!

I am very curious to see who will be the next participants! So if you are thinking about applying and have some questions you think I can help answer, just send me a message on LinkedIn! Good luck and welcome to the thrilling HPC world!

Hello ladies and gentlemen, as we start our descent, please make sure your seatbelt is securely fastened.

It was a great month for me with awesome stories. SoHPC2019 introduced new topics, nice friends and beautiful places to me.

Have you ever realized the waiting time for the machines decreasing day by day in our daily life? Of course, nobody wants to wait for slow automated machines or have slow smartphones, but do you remember how long you were waiting for the old devices to open?

How to express the real-world problems to the machines?

We can describe the problems with the help of the linear algebra and solve the problems by expressing as systems of linear algebraic equations. The machines understand things as numbers in a matrix, and any manipulation means matrix operations for the machines. The speed of the calculations can be increased. However, the speed can be the enemy of the crucial calculations that require to be precise at the same time.

Let’s get back to my project

In my project, the aim is improving the non-recovered output of “Markov Chain Monte Carlo matrix inversion” using a stochastic gradient algorithm method.

What is Stochastic Gradient Descent (SGD)?

Gradient Descent is an iterative method that is used to minimize an objective function. If randomly selected part of the samples used in every iteration of gradient descent, it is called stochastic gradient descent.

With the help of the mSGD method which purposed on the paper, the error of the inverse from MCMCMI method is decreased. After successful results in Python, the algorithm is implemented in C++.

”The Cherry on the cake”

“Life is not fair. Why should random selection be fair?”

In mSGD, the rows are selected with uniform probability. The results are slightly better if probability of selecting a row is proportional to the norm of the row.

The results of the implementation.

“Last touch: Adding Batches to Parallelization”

Stochastic Gradient Descent and Batching are as like as two peas in a pod. Instead of using whole matrix A for the method, rows are divided into the subsets for each process in the parallelization with Hybrid MPI/OpenMP. When rows are not equally divided, it is a good trick to give lower amount of rows to the master process since it has more work than others

Batching by dividing into processes.

Thats All folks!

Thank you for joining my adventure!

The interior of the chappel where MareNostrum is hosted.

I know I haven’t kept my promise of giving a regular heads-up on what is going on with my project, but these last days have been so extremely busy and overwhelming that haven’t allowed me to do so. Overwhelming, in the sense that a beautiful chapter has just finished, while a new, interesting one is ready to be written.

Here I am now, many feet over the earth surface on a flight back to Greece where I will complete my studies thinking how I should summarize a two-month-work in just a few words. So… let us start from where we left it the last time.

Diving in the details…


In order to monitor memory interactions we opted for Valgrind as it offers the additional possibility of simulating cache behaviour. That is, when we run our application under Valgrind, all data accesses can be monitored and collected giving the chance of determining all those accesses that missed cache and had to deploy memory subsystems deeper in the hierarchy. Considering every accessed data in isolation would be neither intuitive nor helpful for further post-processing. This is where EVOP intervenes in order to group data into memory objects. An object can be an array of integers, a struct or any other structure the semantics of which suggest that it has to be considered as an entity. Such objects, regardless if they are statically or dynamically allocated are labeled and every time an access is issued, it is tied to the related object.

Setting the aforementioned comparison of software and hardware approach as our ultimate scope, we enhanced EVOP with the option of performing sampled data collection. In detail, we extended Valgrind source code by integrating a global counter which increments on every memory access. While counter’s value is below a predetermined threshold no access shall be accounted for. On the contrary, the memory access that forces counter and threshold to be equal will be monitored and therefore accumulated to the final access number. The threshold is to be plainly defined when EVOP is called.

What is more, we familiarized with the internal Valgrind representation of memory accesses in order to distinguish between “load” and “store” ones. Similarly as before, we added a counter that increments every time a load or store is detected. The sole accesses that are to be monitored are the ones that equalize the counter to the predetermined threshold. This time, the user has to define a combination of flags in order to specify the type of accesses to be sampled as well as the respective sampling period.

Having enabled these two features our methodology can be summarized in the following procedure: We initially performed a simulation of our benchmarks without sampling in order to get the total number of memory objects. The latter were processed by a BSC-developed tool in order to be optimally distributed to the available memory subsystems. The distribution takes into consideration the last-level cache misses of each object as well as the number of loads that refer to each one of them and sets as its goal the minimization of the total CPU stall cycles. The total “saved” cycles are calculated along with the object distribution and thus the final speedup can be determined. This speedup is the maximum achievable given the application and the memory subsystem mosaic.

What follows is a trial-and-error experimentation process of obtaining results using various sampling periods to extract the referenced objects. It can be intuitively assumed that the longer the sampling period is, the fewer the total memory accesses will be, too. Given the fact that each access is related to one memory object, the fewer the accesses are, the less the possibilities are that not all existing memory objects are discovered. The latter presumably results in a worse distribution; worse in the sense that the final speedup is lower than the initial, achievable one. Nevertheless, depending on the access pattern of each benchmark, there is a specific sampling period, or a restricted range of neighboring sampling periods, that identify all the objects responsible for the initial speedup.

“Iterative”memory data access allows sparce sampling.
“Sequential” memory data access obliges more intense sampling.

In the first two figures I present a qualitative representation of the memory access behavior for each application, as it was understood by me. Sparse sampling, in the first case, is not enough to account for every “important” object, while in the second case, due to the different access pattern, this is allowed. Regarding the exact correlation between the final speedup and the sampling period used to get the objects, in the first case, there is a strict threshold, after which the final results are strongly damaged, in the second case this threshold can be generalized in a larger neighborhood of periods.

Closure


Since this experience has reached its end, it wouldn’t be fair to pay the just tribute to the SoHPC program. Disregarding the undoubtedly deep learning outcome of the program, one can benefit from it expanding one’s personality and experiences. The fact that we are given the chance to live first-hand in a foreign country poses a challenge, yet presents an opportunity to discover and develop one’s self. All in all, SoHPC provides as an important alternative that enriches both technical skills and personal qualitites.

If you want to keep up with me please check my personal page at LinkedIn!

We have come to the end of my project! I cannot believe how quickly this internship has gone and I am so grateful for the opportunity. I hope you enjoy this final blog post of mine as I explain the last tasks undertaken in my final – and potentially the busiest – two weeks.

Video 1: I filmed this short video on the morning of the last day of my internship and during my final walk to work.

With my project now finished, I am pleased to be able to provide details of the final steps taken. To summarise, I worked with four variants of the K-Ras4b protein: K-Ras4b wild-types each bound to GDP and GTP molecules, and K-Ras4b G12D mutants each bound to GDP and GTP molecules. As a recap from a previous post, wild-type proteins refer to the unmutated, normal protein. Contrastingly, the G12D refers to a specific mutation that has occured in the protein: the glycine amino acid in position 12 along the amino acid chain has mutated, becoming aspartic acid. GTP and GDP are known to bind to K-Ras4b. When bound to GTP, K-Ras4b is active, meaning that cell proliferation can occur. This refers to an increase in the number of cells as a result of cell growth and division. When K-Ras4b is bound to GDP, however, it is inactive so cell proliferation terminates. The G12D mutation in K-Ras4b means that when bound to GTP, it cannot undergo hydrolysis. This is the process in which GTP becomems GDP and effectively acts as a switch to turn off the cell growth and division. As a result, since the growth and division of the cells cannot stop, this can lead to tumour growth! My project involved studying this mutation in particular and running molecular dynamics simulations in order to investigate the changes within the protein throughout its trajectory. The resulting trajectory for the mutated G12D K-Ras4b protein, bound to the activating GTP, is shown in the animation below. This was visualised using VMD (a visual molecular dynamics program). The part of the structure that is shown using a different representation (refer to previous posts to explain the various drawing methods in VMD) is the aspartic acid molecule in position 12 along the amino acid chain. This is present since the G12D mutated K-Ras4b is displayed, so the glycine in position 12 has changed to aspartic acid, as explained previously.

Video 2: This video shows the trajectory of the G12D mutated K-Ras4b structure bound to the GTP molecule.

For the molecular dynamic simulation processes that were carried out, ‘jobs’ were to be submitted to the ARIS supercomputer. These are files that essentially tell the supercomputer how to process the information it has been given. Unfortunately, during my project, I sent some incorrect jobs to the supercomputer. This did not affect the project as a whole but just delayed the work schedule that was in place by a couple of days. One of the very important lessons that this internship has taught me is that projects like this are a long process that will sometimes include slip ups but if you keep working hard, you will get there. It has taught me that I can often learn lessons very deeply after having made a mistake since this requires me to investigate what went wrong. This then teaches me more about the systems used than if it had all gone correctly the first time! I suppose that this is what an internship is all about: learning new information and methods, making a few mistakes, learning from them and then developing your knowledge base as you progress. This internship has also, importantly, shown me how a working lab runs and how to work both individually and also as a member of the lab, offering and asking for guidance from peers.

Drugs are small molecules that can block proteins – which have cavities where drugs can bind. Within a protein’s structure, they may have a burried binding site, which is where a molecule can bind to the protein. A small drug molecule can be found or specifically designed in order to fit this binding site. Therefore, a crucial step in the project’s process was identifying binding sites on the four K-Ras4b proteins studied. After running the molecular dynamics simulations, this binding site identification was performed and binding sites were found. These are shown in the image below for the mutated K-Ras4b protein bound to the GTP molecule. This process was repeated for all four protein variants.

Figure 1: This image shows the binding sites identified on the mutated K-Ras4b protein that is bound to the GTP molecule. The binding sites, which are the red, purple and yellow surfaces, are highlighted using the coloured circles.

The very last tasks of my internship included creating a video that explained and summarised my project! This can be seen on YouTube by following the link here . I would love to hear what you thought of it and receive your feedback! Moreover, on the last day of my intership, the 28th of August, I gave a 20 minute presentation to my colleagues here at the Biomedical Research Foundation of Athens. We can all be seen in the image below.

Figure 2: This image shows myself (far right) with my colleagues from the Biomedical Research Foundation of Athens which is where I have been working for this project.

My final thoughts on the internship are that I have learnt so much and am so grateful for being accepted onto this programme as I have gained so much experience. I will take the skills learnt with me into the future as I know that they will be helpful in the upcoming projects I undertake – hence the naming of the title: the end of the beginning. Even though my internship is now over, this is still the start of my journey and I look forward to my future endeavours. Moreover, I have met so many wonderful people from this internship and I wish them all the best success in the future. Thank you so much for reading my blog posts and I can now say thank you for reading my last blog post!

Hi everyone!

As I elaborated in my last two blogposts on what the Intel Neural Compute Stick is and how to use it in combination with the OpenVino toolkit, I will describe in this blog post what the “dynamic” part in the title of my project stands for. As explained in my first blogpost, the Neural Compute Stick is meant to be used in combination with lightweight computers such as the Raspberry Pi to speed up computations for visual applications. Now often enough devices such as the Raspberry Pi are found on the edge. This could be a device attached to satellites in space running earth observation missions, or an underwater robot conducting maintenance on critical structures such as underwater pipelines.  One problem that remains though, even with the Neural Compute Stick accelerating computations, is that once a model is deployed on an edge device, it’s hard to repurpose the device to run a different kind of model. This is where the “dynamic” part of the title of my project comes into play. The Neural Compute Stick is a highly parallelized piece of hardware, with twelve processing units independently running computations. This way, it is possible to not only have one, but several models loaded into its memory.

Even more so, it is possible to switch these models and load new models into memory. This allows to adapt to new situations in the field, like when the feature space changes or the things that are supposed to be detected change. The simplest case of such an occurrence might be the sun setting or bad weather conditions coming up. Another motivation to switch models might also be to save power, as edge devices tend to have limited capacities if it comes to energy sources. Instead of deploying one big model that is supposed to cover all cases that could occur in a production environment, it would be possible to have many small models that could be loaded in and out of memory at runtime.

With this in mind, I went ahead and investigated the feasibility of doing so and implemented a small prototype that switches models at runtime. For this prototype I used two models detecting human bodies and faces and had the prototype switch between these two models. These models are both so called single shot detector MobileNets, networks that are better suited to be deployed on lightweight devices such as the Raspberry Pi. These networks localize and classify an object in a single pass through the network and draw bounding boxes around objects they detect in it.

I used OpenCV for this task, which is a library featuring all sorts of algorithms for image processing and is best described as “swiss army knife” if it comes to visual applications. Next to OpenCV I had OpenVino running as a backend to utilize the Neural Compute Stick in my application.

I eventually tested this model switching prototype by loading and offloading models in and out of memory of the Neural Compute Stick. I did this with a very high frequency of one switch per frame to determine what the latency of such a model switch would be in a worst-case scenario. The switching process includes reading the input and output dimensions of a model by using the XML representation of its architecture and then loading it into the memory of the Neural Compute Stick. On average this switch caused an extra overhead of about 14 percent of the overall runtime. To put this into perspective, on average it took my application half a second to capture and generate an output for an image, whereas a model switch in between would add a little less than a tenth of a second to this time. Of course, there is a lot of room for improvement given these numbers. One such improvement would be concerned with the parsing of the model dimensions. I used a simple XML parser to do so and had to read in the input and output dimensions of a model on every switch. Doing this once for all models that potentially will be used on the Neural Compute Stick when the application starts running and saving the dimensions into a lookup table could cut the switch time almost in half. Further speedup of this switch could be achieved by conducting it asynchronously, as while the model is loaded onto the Neural Compute Stick the next frame can already be capture instead of waiting for the switching process to finish.

A performance breakdown of my application with the pink part depicting the part of my code responsible for the model switch.

All in all, I found that although at the current state this prototype would not be applicable to real time applications yet, given the potential for improvement it could get there. Yet if no hard conditions are imposed for it to perform in real time as is the case for many applications, it is deployable already.

As the number of frames between switches increases, the performance of the application starts to drastically improve.

With this I would like to sum up my findings on this project, if you would like to learn more about this project feel free to have a look at my blog on the website of the PRACE Summer of HPC 2019. Lastly, I would like to thank my supervisors for their amazing support throughout this whole project and in general the staff at ICHEC for welcoming me and making this stay such a great experience!   

Today I will tell you how to speed up a programme running on a GPU. Do you remember the accumulation tree example from my last post?
I was provided with a working version of it, with a global queue to stores tasks ready to be launched.

The protagonists of our story are the following. The threads are our roman soldiers. They are working on a Floating Point Processing Unit (FPU) and they do the computation. They are grouped in units of 32 threads called warps. Threads within the same warp should execute the same instruction at the same time, but on different data. Warps are themselves grouped into blocks. Threads within the same block can communicate fast through shared memory and they can synchronize together. However, no inter-block communication or synchronization is allowed except through the global memory which is much slower.

The tree is another player. He lives in the global memory since all threads should be able to access it. At last there is the queue, also living in the global memory. It is implemented by a chained list, meaning that its size can adapt to the number of tasks it contains. To avoid race conditions, access to the tree and the queue are protected with a mutual exclusion system (mutex). We rely on a specific implementation of a mutex for GPU. It allows only one thread to access the data protected by the mutex at a time. To avoid deadlocks (a kind of infinite loop when two threads wait for an event that will never occur), only one thread per warp tries to enter the mutex. We call it the warp master, and in this early version, only the warp masters can work, as follows:

  1. Fetch a task from the queue
  2. Synchronisation of the block
  3. Execute the task
  4. Synchronisation of the block
  5. Push new tasks (if any)

Step 5 is done only if the task done at step 3 resolves a dependency and creates a new task.

Cutting allocations

The first improvement was to change the queue implementation with my own relying on a fixed sized queue. The reason behind this is that memory allocation is very expensive on a GPU, and with adaptive size queue you are allocating and freeing memory each time a thread pushes and pops a task.

Reestablishing private property

The second idea was to reduce the contention on the global queue since working threads are always trying to access it.
I added small private queues for each block, that can be stored either in the cache or in the share memory for fast access.
The threads within a block use the block’s private queue in priority, and the global queue as a fallback if the private one is full (push) or empty (pop).

Solving the unemployment crisis

For now only a few threads (the warp masters) are working. It’s time to put an end to that. First, since threads within a block are synchronised, access to the private queue is done at the same time thus performed sequentially because of the mutex. I decided that only one thread per block will access the queue, and will be in charge to fetch the work (step 1) and solve dependencies (step 5) for all the threads.

Breaking down the wall

Now that all threads are working, they are all waiting to enter the mutex protecting the tree, even if they are trying to access different parts of it. So I removed the mutex, and ensured that all operations on the tree are atomic. That’s a bit like if there was a mutex on each node in the tree, making it possible to several threads to access the tree concurrently.

Fighting inequality

Removing the mutex from the tree resulted in a huge gain on time, so I tried to also get rid of it for the shared queue access. First I did split the queue so that there is one shared queue for each block.
Because the queue is fixed in size, the operations of pushing and popping a task are independent. One block master can pop only from its own queues (private and shared) and can push to its own private queue, its own shared queue and also other block’s shared queue.
Thus, we do not need mutex protection for the popping operation.

Last but not least, the work must be equally shared between blocks (that is called load balancing). I provided a function that tells a calling thread in which shared queue a newly created task should be pushed.

How much is the speedup ?

Quite high actually. The optimized version I wrote works 450 times faster than the version I was provided. For an octree of depth 5, the execution time is reduced from 700 ms to 1.5 ms, enabling it to be used in real applications.

Is you graphics card able to run N-body simulations in a smart way? A complex tree algorithm, a sophisticated tasking system, is all that a task for a GPU? No, some will say, a graphics card can do only basic linear algebra operations. Well, maybe the hardware is capable of much more…

It is now time to give you some insight on what is my project. The main goal is to make progress towards the implementation on a smart algorithm to solve the N-body problem. To known the main idea behind this algorithm (without going into all the dirty details), you can check the video I made for PRACE there.

To put it in a nutshell, the Fast Multipole Methode (FMM) is a algorithm to compute long-range forces within a set of particles. It enables to solve numerically the N-body problem with a linear complexity, where it would otherwise be quadratic (when computing the interactions between each pair of particles). Doubling the number of particles only doubles the computation time instead of quadrupling it. However the FMM algorithm is hard to parallelize because of data dependencies. Tasking, meanings splitting the work into tasks and putting them into a queue helps a lot to give work to all threads. A working tasking framework for the FMM has been implemented on Central Processing Units (CPU) by D. Haensel (2018). Will such a tasking framework run efficiently on General Purpose Graphical Processing Units (GPGPUs or more simply GPUs)?

The answer is not obvious at all, because CPU and GPU have a very different architectures. My goal this summer was to shed some light on that topic.

A smooth start

Let me first introduce the problem that we will use to test the tasking framework. It is a simplified version of one of the operators of the FMM, and we named it the accumulation tree.

The principle is very simple: the content of each cell is added to its parent. So one task is exactly “adding the content of a cell to its parent”. You already see that dependencies will appear since a node needs the result from all children before having its tasks launched. Imagine we have 10 processing units (GPU or CPU threads), the computation of the accumulation will be as follows.

Initialisation

All leaves are initialized with 1. All tasks that are ready, that is all tasks from leaves, are pushed into the queue.

Round 1

Ten blue tasks are done. Two new green tasks are ready; so they are pushed into the queue.

Round 2

The last six remaining blue tasks are done as well as two green tasks. The last two green tasks are ready and pushed into the queue. Here we can see that the tasking permits to maximize the threads usage since all tasks that are ready can be executed.

Round 3

The last two green tasks are executed. We get the correct result, hip hip hooray!

And on GPUs ?

Such a tasking system works well on a CPU. Why can’t we just copy-paste the code, translate some instructions and add some annotation to make it work on a GPU?

Because many assumptions we can rely on CPUs do not hold anymore on GPUs. The biggest of them is thread independence. You can compare the CPU to a barbaric army: only few strong soldiers, any of them being able to act individually.

credits: Wikimedia CC-BY-SA

A graphics card however is more like the roman army, with a lot of men divided into units. All soldiers within the same units are bound to do exactly the same thing (but on different targets).

credits: https://patricklarkinthrillers.files.wordpress.com

Even if I’m sure you are looking forward knowing if it is possible to mate this powerful army to implement a tasking system, you will have to wait for my next blog post. Be well in the meantime!

Here we are at the end of August and it is time to sum up all the job done during these two months of SoHPC.

Remember to check the previous posts if you are not update to the latest proceedings in my HPC journey.

https://summerofhpc.prace-ri.eu/author/davideg/

Here we can see the results for the strong scaling experiment: if you don’t remember what we are talking about, just check again https://summerofhpc.prace-ri.eu/beads-beads-beads/

I must say am quite satisfied as I eventually managed to modify the DL Meso DPD code in order to run simulations with 3 billion particles. This was quite nasty, as there were several hidden variables that need to be modified to stock very big numbers: we are talking about long integers for those who are fond of computer sciences.

In conclusion, the DL Meso DPD code scales pretty well on very large GPU architectures. This result is amazing: by running these simulations, based on a mesoscale approach, we can jump several length scales and approximate pretty well the continuous nature of fluids.

To understand better, check the following video about a similar work based on a molecular dynamics perspective:

Moreover, this allowed me to run some very large jobs (up to 2048 nodes) on the Piz Daint supercomputer, which is kind of a story to talk about.

So here we are: the SoHPC is over and I couldn’t be any sadder. Not only it has been a great opportunity to work in an innovative environment on cut-edge research, but also the occasion to meet amazing people and live in the amazing city of Liverpool.

So, thanks for having followed this blog and, for the last time, in the immortal words of the fab four: “Hello, Goodbye”

Or sum up of a summer upon HPCs.

Index

In my one + three blog posts, I tried to present every 20 days a different aspect. Me + three relatively different topics that combined provide the basis of my unique project in the summer of HPC.

arsenios and marconi

In my 1st post in this blog, I introduced myself.
https://summerofhpc.prace-ri.eu/arsenios-chatzigeorgiou/

In my 2nd post, I introduced you to some basic notions of HPC.
https://summerofhpc.prace-ri.eu/among-high-performance-persons/

In my 3rd post, to the theory behind, into the plasma modeling.
https://summerofhpc.prace-ri.eu/sun-milk-and-forests-making-the-party-go-on/

In my 4th post, to Python GUI with PyQt5.
https://summerofhpc.prace-ri.eu/recipe-for-a-delicious-dish/

I also had a video presentation of my project, explaining nuclear physics, with stupid jokes, simple animations, puns and tv series references. https://www.youtube.com/watch?v=P12FqpXB7Yg

My final report will also be uploaded in this site, and you may find it if you click on Final Reports -> Final Reports 2019 in the top navigation bar.
I guess the link would be https://summerofhpc.prace-ri.eu/wp-content/uploads/2019/08/SoHPC2019_final_editorial.pdf but I might be wrong so you better find it yourself on the top bar.


So, in this 5th and final post, apart from summing up things for you, I am also taking some time to tell you more about my summer.

Responsibility ‘s cool…

I got the opportunity to work on Faculty of Mechanical Engineering and specifically on the LECAD lab.

I improved a lot my coding skills, and with the help of guys in the lab I learned much more than I thought I would. I also managed to create a GUI without prior experience on it. I learn about nuclear fusion, and plasma. I also learned about and used HPCs.

…but there ‘re more things in life

There ‘re more things in life, like getting a song playing all night, or enjoying Ljubljana, meeting and traveling around, and receiving slovenian hospitality.

1 Ljubljana

Ljubljana is very peaceful and small, yet there are a lot of opportunities for young students. Big music festivals, open-air cinema, parties, music shows. I loved cycling in here (you can get access to a fully equipped bike renting system with an annual subscription that costs 3€ for trips that last less than an hour), and I also enjoyed just walking around the city, looking for parks and recreations. River Ljubljanica, besides having the cutest river name, can be very crowded by the main bridges, but also passes through sites where you can enjoy serenity. The castle also provides an awesome view, where in same cases I guess you can see the whole Slovenian countryside from there (yes, Slovenia is very small).

2 Meet & Travel

Since this was the first international program I participated (I didn’t manage to participate on any Erasmus program), it was the first time I got the opportunity to meet so many people from all over the world. I was very fascinated by this. There were 25 participants in 12 cities this year, and we managed to find time to meet again with some of the participants of soHPC that were living on nearby countries. So, I had the opportunity on almost every second weekend to be in a different country with people I had good time with.

In two months I got the opportunity to meet most of the participants of soHPC in Bologna, to learn and appreciate Indian culture in Ljubljana, to get balkan and Pasok vibes in Zagreb, to constantly hum “fly me to the moon” in Vienna, to learn about Amaziɣ in Budapest and talk about feminism in Bratislava (and get back with a 12 hours delay through Munich). All very exciting experiences, and I am quite thankful for having them.

3 Slovenian hospitality

Me and my roommate and soHPC participant Khyati, were living for two months in a very comfortable university dormitory apartment. We also received great hospitality from our mentors and site coordinators. Specifically, we had the opportunity to have an adventure in nature with the soHPC coordinator Leon Kos where he proved to be the most athletic of us, when he succeeded in shallow river walking, climbing and hiking in Rakov Skocjan and canoeing in the Slovenian-Croatian border river.

Conclusion

It was an awesome experience, I got much more than I expected. Everything was quite well organized, and if you are here thinking about it, don’t think about it. Apply for the soHPC.

Hvala!

Follow by Email