Project reference: 1915

A Radix Sort is an array sorting algorithm that can run in O(N) time, compared with typical sorting algorithms like QuickSort which run in O(Nlog(N)) time. Typical sorting algorithms sort elements using pairwise comparisons to determine ordering, meaning they can be easily adapted to any weakly ordered data types. The Radix Sort, however, uses the bitwise representation of the data and some bookkeeping to determine ordering, limiting its scope. In the case of numerical simulation, it is typically a combination of integer and floating point data being sorted, meaning a bitwise representation is accessible. Another potential issue is the stack size used. However, with a sufficiently clever implementation, it may be possible to avoid this issue.

This project aims to implement a Radix Sort in MPI for an array distributed across a number of processes. Initial implementations will look at sorting fixed-width integers, and then moving on to arbitrary bit-wise representable data. Following on, supporting various array distribution and chunking approaches can be investigated. Upon achieving sufficient usefulness,  functionality, and performance, the project can be wrapped up into a library (finding immediate use in the ExSeisDat project).

Project Mentor:  Paddy Ó Conbhuí

Project Co-mentor: /

Site Co-ordinator: Simon Wong

Participant: Jordy Innocentius Ajanohoun

Learning Outcomes:
Solving problems using parallel and recursive decomposition in MPI. Writing clean and reusable C for use in other HPC projects.

Student Prerequisites (compulsory):
Intermediate C, Basic familiarity with MPI.

Student Prerequisites (desirable):

  • Familiarity with recursive algorithms.
  • Familiarity with sorting algorithms.
  • Familiarity with Scan, Reduction, and AllToAll algorithms in MPI.
  • Familiarity with creating sub-communicators in MPI.
  • Experience with Linux.

Training Materials:
C++Now 2017: M. Skarupke “Sorting in less than O(n log n): Generalizing and optimizing radix sort” https://www.youtube.com/watch?v=zqs87a_7zxw

Workplan:

  • Week 1: Training
  • Week 2: Learn about Radix Sort, serial implementations, distributed memory approaches. Set up an initial project, attempt initial implementation.
  • Week 3: Write Plan
  • Week 4: Implement Radix Sort for int8, int16, int32, and int64
  • Week 5: Implement Radix Sort for float, double, and user-defined functions returning a bitwise representation of their data.
  • Week 6: Profile + Optimize implementations.
  • Week 7: Write Report
  • Week 8: Write Report

Final Product Description:
A reusable library with a clean interface that uses Radix Sort to sort a distributed array using MPI.

Adapting the Project: Increasing the Difficulty:
The student could attempt to incorporate it into the ExSeisDat project, or go quite deep into the optimization of the sorting and communications.

Adapting the Project: Decreasing the Difficulty:
The student could stop at implementing int8, or int16, or anywhere along that path, and move on to profiling.

Resources:

  • A desk. There’s one available next to Dr. Ó Conbhuí.
  • A computer. The student should bring their own.
  • Access to a distributed memory machine for profiling the performance of the sorting routine. (Can access to / time on Kay be arranged?)

Organisation:
Irish Centre for High-End Computing

Project reference: 1925

With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). ML is in a nutshell “the science of getting computers to act without being explicitly programmed” and is permeating an ever-expanding range of human activities. Deep Learning (DL) is an ML technique used in cars with self-driving features such as the Tesla Autopilot and Mercedes Drive Pilot. DL is at the core of state of the art voice recognition systems, which enable easy control over e.g. Internet-of- Things (IoT) smart home appliances, such as the Google Home and Amazon Echo assistants. Online recommendation engines that suggest which products, movies, etc. you may be interested in, are also based on DL.

DL uses multi-layered (deep) artificial neural networks (ANNs) that automatically learn features and representations from (raw) data, and can deliver higher predictive accuracy when trained on larger datasets. The ecosystem of DL frameworks is fast evolving, as well as the DL architectures that are shown to perform well on specialized tasks. For image classification, several recent architectures (models) are common:

Training these models is a computationally intensive task, especially on well known large data sets such as ImageNet and OpenImages v4. To accelerate the training process, many high profile DL frameworks use GPU accelerators, with distributed training also becoming common.

This project will focus on evaluating and comparing the following reference frameworks currently available, i.e.:

  • TensorFlow, an open source software library from Google for numerical computation using data flow graphs, thus close to the Deep Learning book way of thinking about neural networks. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization.
  • PyTorch, a Python version of Torch, which currently implements distributed learning through the torch.distributed library, with support for Gloo, NCCL, and MPI.
  • MXNet, Apache library with a large set of interfaces (C++, Python, Julia, Matlab, JavaScript, Go, R, Scala, Perl) and focus on speed and scalability.
  • Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow.
  • Horovod, a distributed training framework for TensorFlow, Keras, and PyTorch from Uber Engineering.

The purpose of the project is to compare the efficiency of different implementations across several hardware architectures (CPU and GPU), taking into account a distributed configuration. Special care will need to be taken to perform the experiments in close to ideal situations such that the results are not influenced by outside factors (other user’s I/O activity or GPU utilization).

As part of the project, a facility wrapper will need to be designed to facilitate the integration and usage of these Deep Learning frameworks through the Slurm Workload Manager used on HPC facilities featuring hybrid architectures with NVIDIA  V100 GPU accelerators.

Brain-shaped representation of an artificial neural network, showing interlinked neurons depicting distributed computing elements. Visual material license: free for use, no attribution required, source: https://pixabay.com/en/a-i-ai-anatomy-2729794/

Project Mentor: Sebastien Varrette, PhD

Project Co-mentor: Valentin Plugaru

Site Co-ordinator: Prof. Pascal Bouvry

Participant: Sean Mahon

Learning Outcomes:
The student will learn to design and analyse the scalability of parallel and distributed Deep Learning methods using a set of reference frameworks.

The student will learn how to implement these methods on modern computer architectures with latest NVIDIA V100 GPU accelerators.

Student Prerequisites (compulsory):
Knowledge of parallel algorithm design, CUDA, and Python.

Student Prerequisites (desirable):
Advanced knowledge of Python. Basic knowledge of Deep Learning frameworks such as Keras, and more generally of HPC tools. Experience on Slurm-based scheduler.

Training Materials:

Workplan:

  • Week 1/: Training week
  • Week 2/:  Literature Review Preliminary Report (Plan writing)
  • Week 3 – 7/: Project Development
  • Week 8/: Final Report write-up

Final Product Description:
The final product will be a performance analysis report as well as an implementation wrapper facilitating the integration and usage of the different frameworks on top of Slurm-based HPC facilities featuring hybrid architectures with NVIDIA  V100 GPU accelerators.  Ideally, we would like to publish the results in a paper on a conference or a workshop.

Adapting the Project: Increasing the Difficulty:
Student contribution to the new Deep500 effort for HPC Deep learning benchmarking.

Adapting the Project: Decreasing the Difficulty:
We could restrict the number of analysed frameworks. Indeed the project involves a large number of a representative set of deep learning frameworks.

Resources:
The student will need access to a machine with NVIDIA V100 GPU accelerators, standard computing resources (laptop, internet connection) as well as, if needed, an account on Iris supercomputer of the University of Luxembourg.

Organisation:
University of Luxembourg

Project reference: 1924

Energy consumption is one of the largest problems faced by modern supercomputing in the race to build Exaflop/s capable systems. In the past decade hardware design focus has shifted from obtaining the best possible performance to improving the performance-per-watt, with each new hardware generation bringing better energy efficiency while delivering moderately increased performance. Hardware/software co-design of next-generation supercomputers is now seen as a necessary step in order to reach the Exaflop/s milestone, with systems or modules being purpose-built for specific applications. To enable co-design, existing and under-development applications’ behavioral patterns need to be monitored and understood, with key metrics reported both for human review and profiling by automated systems.

This project will focus on developing a plugin for the Slurm Workload Management system commonly used to schedule user jobs in HPC centers. The new plugin will generate reports containing energy usage, memory, I/O, and other metrics for the user jobs that request it. Slurm natively provides a generic interface for stackable plugins which may be used to dynamically modify the job launch code in Slurm and interact with the job in the context of its prolog, epilog or task launch [1].

The plugin will extend Slurm’s inbuilt facilities for metrics collection and reporting [2], with a special focus on interaction and integration with external energy consumption measurement frameworks and performance tools/interfaces such as LIKWID [3] and PAPI [4]. Energy usage metrics will be extracted at the node level using IPMI [5], and at a more fine-grained level for Intel CPUs with RAPL [5] and NVIDIA GPUs using NVML [6].

Metrics reporting will be generated for both human and automated systems consumption, with an emphasis on ease of use and readability.

[1] Slurm Plug-in Architecture for Node and job (K)control: https://slurm.schedmd.com/spank.html

[2] Slurm accounting data: https://slurm.schedmd.com/sacct.html

[3] Likwid Performance Tools: http://hpc.fau.de/research/tools/likwid

[4] Performance Application Programming Interface: http://icl.utk.edu/papi

[5] Slurm Energy Accounting Plugin API: https://slurm.schedmd.com/acct_gather_energy_plugins.html

[6] NVIDIA Management Library: https://developer.nvidia.com/nvidia-management-library-nvml

Macro shot of computer chip silicon wafer. CPU and GPU chips are a supercomputer’s top’s energy consumers.
Visual material license: free for use, no attribution required, source:
https://unsplash.com/photos/nIEHqGSymRU

Project Mentor: Valentin Plugaru

Project Co-mentor: Sebastien Varrette

Site Co-ordinator: Prof. Pascal Bouvry

Participant: Matteo Stringher

Learning Outcomes:
The student will interact deeply with the resource management system powering HPC systems. He or she will understand serial and parallel job executions, key performance indicators, metrics collection and efficiency reporting. He or she will have access to historic job traces and will be able to build and test code in a HPC environment.

Student Prerequisites (compulsory):

  • Background in software development and algorithms/data structures fundamentals.
  • C/C++ programming experience.

Student Prerequisites (desirable):

  • Advanced knowledge of C/C++.
  • Experience with Slurm scheduler.

Training Materials:

Workplan:

  • Week 1/: Training week
  • Week 2/:  Literature Review Preliminary Report (Plan writing)
  • Week 3 – 7/: Project Development
  • Week 8/: Final Report write-up

Final Product Description:
The final product will be a Slurm metrics collection and reporting plugin together with an implementation report including the output of test executions of real scientific codes ran with the plugin activated.

A high quality work will be released as open-source contribution to the community and possibly as a conference workshop article.

Adapting the Project: Increasing the Difficulty:
The developed plugin is able to interact with the task (Slurm steps) launch and wrap around (serial/parallel) user application for deep inspection and reporting on performance counters. The plugin is able to write time series data in HDF5 format.

Adapting the Project: Decreasing the Difficulty:
The developed plugin uses only existing metrics collected by Slurm and generates an end-of-job report in a text format.

Resources:
The student will need access to standard computing resources (laptop able to run virtualization systems, internet connection) as well as an account on the Iris supercomputer of the University of Luxembourg.

Organisation:
University of Luxembourg

Project reference: 1923

The project will consist of:

  • Getting to know Hadoop and RHadoop;
  • Defining big data source related to Industry 4.0;
  • Creating and storing big data files (BD);
  • Preparing BD for basic analysis;
  • Defining predictive model and writing RHadoop code for building this model;
  • Evaluation and application of the developed model on new data.

The student will create a big data file, store it in a DFS; perform basic analysis and build a predictive model for new data using RHadoop.

Project Mentor: Prof. Janez Povh, PhD

Project Co-mentor: MSc.Timotej Hrga

Site Co-ordinator: Doc. Dr. Leon Kos

Participant: Khyati Sethia

Learning Outcomes:

Student Prerequisites (compulsory):

  • Basics from data management;
  • R language
  • Basics from regression and classification

Student Prerequisites (desirable):
Basics from Hadoop.

Training Materials:
The candidate should go through PRACE MOOC:

https://www.futurelearn.com/admin/courses/big-data-r-hadoop/4

Workplan:

  • W1: introductory week;
  • W2: creating and storing industrial big data file
  • W3: coding and evaluating scripts for analysis;
  • W4-W5: coding and evaluating scripts for analysis;
  • W6: preparing materials for MOOC entitled MANAGING BIG DATA WITH R AND HADOOP.
  • W7: final report
  • W8: wrap up

Final Product Description:

  • Retrieved industrial big data files and stored in Hadoop;
  • Created RHadoop scripts for analysis and new prediction models;
  • Created a report on this example to be used in do PRACE MOOC with title MANAGING BIG DATA WITH R AND HADOOP.

Adapting the Project: Increasing the Difficulty:
We can increase the size of data or add more demanding visualization task.

Adapting the Project: Decreasing the Difficulty:
We can decrease the size of data or simply the prediction model.

Resources:
RHadoop installation at the University of Ljubljana, Faculty of mechanical engineering.

Organisation:
University of Ljubljana
project_1620_logo-uni-lj

Project reference: 1922

Gyrokinetic simulations are essential for better understanding of the plasma turbulence. For that purpose, a variety of non-linear gyrokinetic codes are being used and further developed, such as GENE, GYRO, ELMFIRE etc. These codes differ in numerics, physics, parallel scalability, and public availability. The emphasis in this project is on GENE code.

GENE (Gyrokinetic Electromagnetic Numerical Experiment) is an open-source code designed for numerical investigations of plasma microturbulence. It is freely available on  http://www.genecode.org and is being further developed by an international collaboration. GENE allows efficient computation of gyroradius-scale fluctuations and the resulting transport coefficients in magnetized fusion/astrophysical plasmas. GENE has been used, among other things, to address both fundamental issues in plasma turbulence research and to perform comparisons with tokamak and stellarator experiments.

The code itself can be run on a large number of different computer architectures, including Linux HPC clusters and various massively parallel systems, using anything between a single and tens of thousands of processors. The code itself is written in Fortran2008.

For easier user access to the GENE code, a graphical user interface, written in Python 2.6 and Tkinter Python library,  is available that allows convenient reading, editing and writing ’parameter’ files, while providing consistency checks to prevent common errors.

The code package includes also a custom IDL based plotting utilities for data visualization analysis and a python user interface for setting up simulations. The IDL tool for data visualization analysis is commercial software, found on https://www.harrisgeospatial.com/Software-Technology/IDL,  and thus not publicly available. Other visualization tools for visualisation of GENE code calculation results are not known. This is inconvenient for GENE users, who do not have access to IDL software and are in need of open source solutions.

The aim of this project is to reproduce GENE benchmark cases and provide a set of visualization utilities for the output data using Python (version 3.6 or higher), an open source programming language. This is to be achieved by developing interactive Graphical User Interfaces (GUI), provided by Python library PyQt5 (https://pypi.org/project/PyQt5/), and encompassed plotting utilities developed using pyqtgraph (http://www.pyqtgraph.org/), a graphics and user interface library for Python that provides functionality commonly required in engineering and science applications. Its primary goals are a) to provide fast, interactive graphics for displaying data (plots, video, etc.) and b) to provide tools to aid in rapid application development.

Example of GENE: a) data visualization GUI using IDL (bottom left),  b) simulation results analysis (bottom right) and c) simulation of turbulent fluctuations in an actual tokamak discharge (top right) (Note: Source of the top right image: http://www.genecode.org/).

Project Mentor: MSc. Dejan Penko

Project Co-mentor: Doc. Dr. Leon Kos

Site Co-ordinator: Doc. Dr. Leon Kos

Participant: Arsenios Chatzigeorgiou

Learning Outcomes:
The student will obtain/improve their skills and knowledge in the use of:

  • Python 3.x (and possibly Fortran90 – 2008) programming language
  • Linux OS
  • GIT version control system
  • Makefiles
  • visualization utilities
  • graphical user interface (GUI) development
  • HPC
  • basics of fusion physics
  • IDL data visualization tool
  • document generation utilities (Sphinx, ReStructured Text)

Student Prerequisites (compulsory):

  • Intermediate programming skills in Python (version 3.x)
  • OOP (Object-oriented Programming)

Student Prerequisites (desirable):

Familiar with:

  • Linux OS
  • GIT version control system
  • Basic programming skills in Fortran
  • Python PyQt5 library

Training Materials:

PyQt5:

pyqtgraph:

GENE:

  • Videos:
  • Articles:
  • The global version of the gyrokinetic turbulence code GENE: https://www.sciencedirect.com/science/article/pii/S0021999111003457

Workplan:

  • W1: Introductory week
  • W2: Learning GENE (creating, running case, analysis of results using IDL etc.)
  • W3-5: GUI and plotting utility development
  • W6: Project finalization
  • W7: Final report and video recording
  • W8: Wrap up

Final Product Description:

  • An interactive graphical user interface for GENE data visualization and analysis
  • Full technical report, including tutorial (including in video form), manual etc.
  • Visualization and brief analysis of a GENE computation case (should be included in the tutorial too)

Adapting the Project: Increasing the Difficulty:
Involve hierarchical tree data-structures for archiving and retrieval of GENE related data (parameters, calculation results etc.)

Adapting the Project: Decreasing the Difficulty:
Skip GUI development. Make simple plotting scripts in Python3.

Resources:

  • It is recommended for the student to bring his own laptop.
  • HPC cluster at the University of Ljubljana, Faculty of Mechanical Engineering, and other available HPCs.

Organisation:
University of Ljubljana
project_1620_logo-uni-lj

 

Project reference: 1919

Simulations of classical or quantum field theories often rely on a lattice discretized version of the underlying theory. For example, simulations of Lattice Quantum Chromodynamics (QCD, the theory of quarks and gluons) are used to study properties of strongly interacting matter and can, e.g., be used to calculate properties of the quark-gluon plasma, a phase of matter that existed a few milliseconds after the Big Bang (at temperatures larger than a trillion degrees Celsius). Such simulations take up a large fraction of the available supercomputing resources worldwide.

Other theories have a lattice structure already “build in”, as is the case for graphene, with its famous honeycomb structure. Simulations studying this material can build on the experience gathered in Lattice QCD. These simulations require, e.g., the repeated computation of solutions of extremely sparse linear systems and update their degrees of freedom using symplectic integrators.

Depending on personal preference, the student can decide to work on graphene or on Lattice QCD. He/she will be involved in tuning and scaling the most critical parts of a specific method, or attempt to optimize for a specific architecture in the algorithm space.

In the former case, the student can select among different target architectures, ranging from Intel XeonPhi (KNL), Intel Xeon (Haswell/Skylake) or GPUs (OpenPOWER), which are available in different installations at the institute. To that end, he/she will benchmark the method and identify the relevant kernels. He/she will analyse the performance of the kernels, identify performance bottlenecks, and develop strategies to solve these – if possible taking similarities between the target architectures (such as SIMD vectors) into account. He/she will optimize the kernels and document the steps taken in the optimization as well as the performance results achieved.

In the latter case, the student will, after getting familiar with the architectures, explore different methods by either implementing them or using those that have already been implemented. He/she will explore how the algorithmic properties match the hardware capabilities. He/she will test the archived total performance, and study bottlenecks e.g. using profiling tools. He/she will then test the method at different scales and document the findings.

In any case, the student is embedded in an extended infrastructure of hardware, computing, and benchmarking experts at the institute.

QCD & HPC

Project Mentor:  Dr. Stefan Krieg

Project Co-mentor: Dr. Eric Gregory

Site Co-ordinator: Ivo Kabadshow

Participant: Andreas Nikolaidis

Learning Outcomes:
The student will familiarize himself with important new HPC architectures, such as Intel Xeon, OpenPOWER or other accelerated architectures. He/she will learn how the hardware functions on a low level and use this knowledge to devise optimal software and algorithms. He/she will use state-of-the-art benchmarking tools to achieve optimal performance.

Student Prerequisites (compulsory):
Programming experience in C/C++

Student Prerequisites (desirable):

  • Knowledge of computer architectures
  • Basic knowledge on numerical methods
  • Basic knowledge on benchmarking

Training Materials:

Workplan:
Week – Work package

  1. Training and introduction
  2. Introduction to architectures
  3. Introductory problems
  4. Introduction to methods
  5. Optimization and benchmarking, documentation
  6. Optimization and benchmarking, documentation
  7. Optimization and benchmarking, documentation

Generation of final performance results. Preparation of plots/figures. Submission of results.

Final Product Description:
The end product will be a student educated in the basics of HPC, optimized methods/algorithms or HPC software.

Adapting the Project: Increasing the Difficulty:
The student can choose to work on a more complicated algorithm or aim to optimize a kernel using more low level (“down to the metal”) techniques.

Adapting the Project: Decreasing the Difficulty:
Should a student that finds the task of optimizing a complex kernel too challenging, could restrict himself to simple or toy kernels, in order to have a learning experience. Alternatively, if the student finds a particular method too complex for the time available, a less involved algorithm can be selected.

Resources:
The student will have his own desk in an open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC hardware for the project and have his own workplace with fully equipped workstation for the time of the program. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required.

Organisation:
Jülich Supercomputing Centre

JULICH

Project reference: 1918

Today’s supercomputing hardware provides a tremendous amount of floating point operations (FLOPs). While CPUs are designed to minimize the latency of a stream of individual operations, GPUs try to maximize the throughput. However, GPU FLOPs can only be harvested easily, if the algorithm does exhibit lots of independent data parallelism. Hierarchical algorithms like the Fast Multipole Method (FMM) inhibit the utilization of all available FLOPs on GPUs due to their inherent data dependencies and only little independent data parallelism.

Is it possible to circumvent these problems?

In this project, we turn our efforts towards a fully taskified FMM for GPUs. Depending on your interest, we will pursue different goals. First, the already available and taskified CPU version of the FMM can be adapted to support basic tasking on the GPU and speed up the execution. Second, a special hierarchical extension of our GPU tasking queue can be implemented and tested to achieve even higher performance.

The challenge of both assignments is to execute our small compute kernels without large overheads within the tasking framework. This also ensures portability between different generations/designs of modern GPUs.

What is the fast multipole method? The FMM is a Coulomb solver and allows to compute long-range forces for molecular dynamics, e.g. GROMACS. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods such as PME, multigrid or the FMM are capable of reducing the algorithmic complexity to O(N log N) or even O(N).  However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be provided. The layout and implementation of such algorithms on modern hardware strongly depend on the available features of the underlying architecture.

The assumed workplace of a 2019 PRACE student at JSC.

Project Mentor:  Andreas Beckmann

Project Co-mentor: Ivo Kabadshow

Site Co-ordinator: Ivo Kabadshow

Participant: Noë Brucy-Ciaramella

Learning Outcomes:
The student will familiarize himself with current state-of-the-art GPUs (Nvidia V100/AMD Vega 64). He/she will learn how the GPU/accelerator functions on a low level and use this knowledge to utilize/extend a tasking framework for GPUs in a modern C++ code-base. He/she will use state-of-the-art benchmarking/profiling tools to test and improve performance for the tasking framework and its compute kernels which are time-critical in the application.

Student Prerequisites (compulsory):
Prerequisites

  • Programming knowledge for at least 5 years in C++
  • Basic understanding of template metaprogramming
  • “Extra-mile” mentality

Student Prerequisites (desirable):

  • CUDA or general GPU knowledge desirable, but not required
  • C++ template metaprogramming
  • Interest in C++11/14/17 features
  • Interest in low-level performance optimizations
  • Ideally, the student of computer science, mathematics, but not required
  • Basic knowledge of benchmarking, numerical methods
  • Mild coffee addiction
  • Basic knowledge of git, LaTeX, TikZ

Training Materials:
Just send an email … training material strongly depends on your personal level of knowledge. We can provide early access to the GPU cluster as well as technical reports from former students on the topic. If you feel unsure about the requirements, but do like the project, send an email to the mentor and ask for a small programming exercise.

Workplan:
Week – Work package

  1. Training and introduction to FMMs and GPU hardware
  2. Benchmarking of current tasking variants on the CPU
  3. Adding basic queues in the tasking framework
  4. Extending basic queues to support different compute kernels
  5. Adding hierarchical queues in the tasking framework
  6. Performance tuning of the GPU tasking
  7. Optimization and benchmarking, documentation
  8. Generating of final performance results. Preparation of plots/figures. Submission of results.

Final Product Description:
The final result will be a taskified FMM code with CUDA or SYCL to support GPUs. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.

Adapting the Project: Increasing the Difficulty:
The tasking framework uses different compute kernels. For example, it may or may not be required to provide support for a certain FMM operator. A particularly able student may also apply the GPU tasking to multiple compute kernels. Depending on the knowledge level, a larger number of access/storage strategies can be ported/extended or performance optimization within CUDA/SYCL can be intensified.

Adapting the Project: Decreasing the Difficulty:
As explained above, a student that finds the task of adapting/optimizing the tasking to all compute kernels too challenging could very well restrict himself to a simpler model or partial tasking.

Resources:
The student will have his own desk in an air-conditioned open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC resources for the project and have his own workplace with a fully equipped workstation for the time of the program. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required. Hint: We do have experts on all advanced topics, e.g. C++11/14/17, CUDA in house. Hence, the student will be supported when battling with ‘bleeding-edge’ technology.

Organisation:
Jülich Supercomputing Centre

JULICH

Project reference: 1917

Deep neural networks (DNNs) are evaluated and used in many areas today, to replace or complement traditional technologies. In this project, the student will develop a DNN to detect and localize selected objects in images. The work comprises the selection of the right network topology, the training, and validation strategy, as well as collecting and pre-processing training and validation data. The project does not stop with the training of the model in the data center but considers the complete lifecycle, including deployment and optimization. An application should demonstrate the correct object detection of the trained DNN on an edge device with limited computing resources (e.g. Intel’s Movidius Neural Compute Stick). The final outcome should be the object detection of a single type of object in a 720p video stream with ca. 24fps (real-time).

Training of the DNN will take place on NVIDIA V100 GPUs (e.g. NVIDIA DGX-2 system) with Tensorflow as deep learning framework. Depending on the selected edge device, deployment and optimization of the DNN will be done with either NVIDIA TensorRT or Intel’s OpenVino toolkits. As a stretch goal, for better portability, the final DNN should be using the ONNX model interchange format.

Training of a Deep Neural Network over multiple epochs to decrease loss and increase accuracy [illustration]

Project Mentor: Georg Zitzlsberger, M.Sc.

Project Co-mentor: Martin Golasowski

Site Co-ordinator: Karina Pešatová, Martina Kovářová

Participant: Pablo Lluch Romero

Learning Outcomes:
The student will be guided through the principles of DNNs, the state of the art networks and Google’s TensorFlow. With the project, the student also gets accustomed to the latest NVIDIA technologies around V100 that will be used for training, and different deployment and optimization toolkits.

Student Prerequisites (compulsory):

  • Knowledge in Python is needed to create the DNN.
  • Knowledge of Linux as the development environment.
  • The project includes training for the student on mandatory DNN basics (training material from NVIDIA and Intel is used).

Student Prerequisites (desirable):
Tensorflow, Nvidia TensorRT, Intel OpenVino

Training Materials:
Introduction to:

Workplan:

  • 1st  week: training week.
  • 2nd week: internal student training on DNN technologies and networks incl. first exercises on TensorFlow
  • 3rd week: Selection of network, training strategy and identification of training data
  • 4th week: Training and validation of DNN on NVIDIA V100 (finding right hyper-parameters)
  • 5th week: Deployment to the edge device
  • 6th week: Demoing final application of DNN on edge device (e.g. via video stream)
  • 7th week: Optimization for edge device (e.g. TensorRT/OpenVino) to improve the DNN performance for the edge device
  • 8th week: Discussing full processing pipeline (HPC to edge) with quality analysis

Final Product Description:
An environment (processing pipeline) will be created to train a DNN for detecting and localizing a specific object on the HPC side and to optimize and deploy it to an edge device for use.

Adapting the Project: Increasing the Difficulty:
Retrain the DNN to detect more different types of objects.

Adapting the Project: Decreasing the Difficulty:
The exact type and number of objects to detect will be selected by the student depending on his/her skills. Also, the deployment to the edge is variable in terms of which device to use. Using ONNX is a stretch-goal that might be sacrificed to favor faster solutions vs. using standards.

Resources:
Access to all hardware mentioned in the proposal (IT4I’s clusters, DGX-2 system and Movidius Neural Compute Stick)

All software mentioned in the proposal will be made available to the student.

Organisation:
IT4Innovations National Supercomputing Center at VSB – Technical University of Ostrava

Project reference: 1916

The objective of this project is to present the scalability of CFD simulations on IT4Innovations parallel platforms and calculating external aerodynamics around the VSB-TUO formula student car (lift and drag coefficient calculation). Formula Student is a student engineering competition. Student teams from around the world design, build, test, and race a small-scale formula style racing car. The cars are judged on a number of criteria. One of the criteria being evaluated in this competition is engineering design, the quality of computational simulations etc.

This project will be focused on the application of Open Source Field Operation and Manipulation of C++ libraries (OpenFOAM) for solving engineering problems on parallel platforms. OpenFOAM is a free, open source CFD software package developed by OpenCFD Ltd. at the ESI Group and distributed by the OpenFOAM Foundation. It has a large user base across most areas of engineering and science, from both commercial and academic organisations. The parallel scalability of OpenFOAM with various setups of linear solvers implemented in OpenFOAM (PCG, GAMG…) will be tested to find the optimal solver settings for the IT4I clusters. For lift and drag coefficients evaluation, a variation of different types of turbulence models will be tested.

Visualization of the CFD results, pressure and velocity streamlines

Project Mentor: Ing. Tomáš Brzobohatý, Ph.D.

Project Co-mentor: Ing. Marek Gebauer, Ph.D.

Site Co-ordinator: Karina Pešatová, Martina Kovářová

Participant: David Izquierdo

Learning Outcomes:
Parallel CFD simulations for car external aerodynamic based on an open source solution.

Student Prerequisites (compulsory):
Basic knowledge of Linux OS

Student Prerequisites (desirable):

  • Basic knowledge of any CAD type software
  • Basic knowledge of CFD simulation in any open source or commercial codes
  • basic awareness of Linux shell scripting
  • basic knowledge of C++

Training Materials:

Workplan:

  • Create a computational mesh in OpenFOAM using the snappyHexMesh utility
  • CFD steady-state simulation of the external aerodynamics in parallel using OpenFOAM
  • Comparison of different turbulence models

Final Product Description:
Formula student external aerodynamics CFD simulation with parallel scalability report.

Adapting the Project: Increasing the Difficulty:
Programming a new boundary condition in OpenFOAM, create automatized meshing and solver utility as the formula student team needs.

Adapting the Project: Decreasing the Difficulty:
Work with one turbulence model only.

Resources:

  • CAD software, Autodesk Inventor, etc. – student version
  • Linux – free
  • OpenFOAM – open source
  • ParaView – open source
  • Salomon supercomputer – the student will be given a login to IT4I facilities

Organisation:
IT4Innovations National Supercomputing Center at VSB – Technical University of Ostrava

Project reference: 1914

Deep learning is an algorithmic technique that has found rapid adoption and application in a wide range of domains. In the deep learning workflow, the inference is one of the final steps where the trained models are deployed in the real world. For many applications, deep learning models are often deployed on the edge, i.e. on platforms that are remotely located. Furthermore, the inference devices may operate in low power envelopes and may not be readily accessible for updating the models deployed on them. Examples of such remote inference devices may be in industrial installations, solar/wind farms, weather monitoring stations or even satellites (in future). Therefore, it is essential that a method is developed to dynamically switch the deep learning models on the inference devices remotely.

The Intel Movidius Neural Compute Stick (NCS) a popular and powerful inference platform. The NCS is composed of a high-performance, energy-efficient inference processor (Myriad) that is packaged in a USB stick form factor. The Myriad processor is capable of real-time inference with a power envelope less than 1W. The NCS can be programmed using the Intel Movidius NCS Development Kit (NCSDK) or Intel OpenVINO primarily using Python.

In this project, we will develop a workflow and the core modules of an application that will offer the flexibility to dynamically switch deep learning models that are deployed on the Intel Movidius NCS. We will use pre-trained models that are available off-the-shelf. Models that are available include those for gender prediction from hand images, facial expression recognition and gesture recognition from hand images. The NCS will be interfaced with a Raspberry Pi kit, and the solution will be packaged into a publicly demonstrable end-user application.

A complete graph (see Section 9 below) (Publicly available image labeled for reuse).

Project Mentor: Venkatesh Kannan

Project Co-mentor: Myles Doyle

Site Co-ordinator: Simon Wong

Participant: Igor Kunjavskij

Learning Outcomes:

  • Performing deep learning inference on the Intel Movidius Neural Computer Stick platform.
  • Creating a publicly demonstrable end-user application with existing deep learning models.
  • Exposure to larger projects in which the techniques and results of this project are integral.

Student Prerequisites (compulsory):

  • Experience with Linux.
  • Programming with Python.

Student Prerequisites (desirable):

  • Basic knowledge of deep learning.
  • Knowledge of working with Raspberry Pi.
  • Creation of Docker images.

Training Materials:

  1. Intel Movidius Neural Compute Stick          https://software.intel.com/en-us/movidius-ncs
  2. Intel OpenVINO and Movidius Neural Compute SDK (NCSDK) https://software.intel.com/en-us/articles/transitioning-from-intel-movidius-neural-compute-sdk-to-openvino-toolkit

Workplan:

  • Week 1: Induction, an overview of the project, introduction to relevant deep learning concepts, hardware, and software.
  • Week 2: Example exercises for deep learning inference on the Intel Movidius Neural Compute Stick (NCS).
  • Week 3: Design implementation methodology and write a project plan.
  • Week 4: Implement an application to use a single deep learning model for inference on the NCS.
  • Week 5: Extend the application to dynamically switch deep learning models on the NCS.
  • Week 6: Package the application for portability (for example, Docker image) with publicly demonstrable user-interface.
  • Week 7: Documentation of application and report writing.
  • Week 8: Report writing and demonstration to ICHEC researchers.

Final Product Description:
A workflow and an end-user application that performs deep learning inference on the Intel Movidius Neural Compute Stick to dynamically switch the deep learning models.

Adapting the Project: Increasing the Difficulty:
The student could update/transmit the model to be loaded on the Intel Movidius NCS using a WiFi module on the Raspberry Pi kit instead of wired connectivity.

Adapting the Project: Decreasing the Difficulty:
The student could use a laptop/desktop to interface with the Intel Movidius NCS instead of the Raspberry Pi kit.

Resources:

  1. Intel Movidius Neural Compute Stick (available at ICHEC)
  2. Laptop (to be brought by the student)
  3. Deep learning models and data sets (publicly available for use)
  4. Optional Raspberry Pi kit (available at ICHEC)

Organisation:
Irish Centre for High-End Computing

Project reference: 1913

DPD is a stochastic particle method for mesoscale simulations of complex fluids. It is based on a coarse approximation of the molecular structure of soft materials, with beads that can represent large agglomerates or unions of simpler molecules like water. This approach allows to avoid extreme small time and length scales when compared to classical Molecular Dynamic solvers but retaining the intrinsic discrete nature of matter. However, realistic applications often require a very large number of beads to correctly simulate the physics involved. So here comes the need of scaling to very large systems using latest hybrid CPU-GPU architectures.

The focus of this work will be on benchmarking and optimization on novel supercomputers an existing multi-GPUs version of the DL_MESO (DPD) code; a discrete particle solver for mesoscale simulation of complex fluids.  Currently, it has been tested up to 2048 GPUs and needs further development for good scaling on larger systems as well as for improving its performance per single GPU. Moreover, the code has been tested only on simple cases, like binary fluid mixture separation and needs a robust evaluation with realistic applications, like phase separation, solute diffusion, and interactions between polymers.  These often require extra features, like Fast Fourier Transform algorithms, currently not implemented and usually representing a main challenge for scalability on novel architectures.

The student will have a minimum task of benchmarking the current version, modify the current limiting factors for scaling on large supercomputers and run performance analysis to identify possible bottleneck and relative solutions to improve speedup. According to s/he experiences, further improvements on the HPC side as well as new features for complex physics could be added. In particular, a plasma of electrically charged particles will be used as a benchmark where Ewald Summation based methods, like the Smooth Particle Mesh Ewald, have to be implemented.

Project Mentor: Jony Castagna

Project Co-mentor: Vassil Alexandrov

Site Co-ordinator: Luke Mason

Participant: Davide Di Giusto

Learning Outcomes:
The student will learn to benchmark, profile and modify multi-GPUs code mainly written in Fortran and CUDA languages following typical domain decomposition implemented using MPI libraries. S/he will also gain a basic understanding of the DPD methodology and its impact on mesoscale simulations. The student will also gain a familiarity with proper software development procedure using Software for Version Control, IDE and tools for Parallel Profiling on GPUs.

Student Prerequisites (compulsory):
Good knowledge of Fortran, MPI and CUDA programming is required as well as in parallel programming for distributed memory.

Student Prerequisites (desirable):
Some skills in being able to develop mixed code such Fortran/CUDA will be an advantage as well as experience in multi-GPU programming using CUDA/MPI.

Training Materials:
These can be tailored to the student once he/she is selected.

Workplan:

  • Week 1/: Training week
  • Week 2/:  Literature Review Preliminary Report (Plan writing)
  • Week 3 – 7/: Project Development
  • Week8/: Final Report write-up

Final Product Description:
The final product will be an internal report, convertible to a conference or better journal paper, with benchmark and comparison of the improved version of the DL_MESO multi-GPUs.

Adapting the Project: Increasing the Difficulty:
The project is on the appropriate cognitive level, taking into account the timeframe and the need to submit a final working product and 1 reports.

Adapting the Project: Decreasing the Difficulty:
The topic will be researched and the final product will be designed in full but some of the features may not be developed to ensure a working product with some limited features at the end of the project.

Resources:
The student will need access to multi GPUs machines, standard computing resources (laptop, internet connection).

Organisation:
Hartee Centre – STFC

Image result for Hartree Centre - STFC

Project reference: 1912

The focus of this project will be on enhancing further hybrid (e.g. stochastic/ deterministic) methods for Linear Algebra adding Deep Learning techniques. The focus is on Monte Carlo hybrid methods and algorithms for matrix inversion and solving systems of linear algebraic equations. Recent developments led to effective approaches based on building an efficient stochastic preconditioner and then solving the corresponding System of Linear Algebraic Equations (SLAE) by applying an iterative method. The preconditioner is a Monte Carlo preconditioner based on Markov Chain Monte Carlo (MCMC) methods to compute a rough approximate matrix inverse first. The above Monte Carlo preconditioner is further used to solve systems of linear algebraic equations thus delivering hybrid stochastic/deterministic algorithms. The advantage of the proposed approach is that the sparse Monte Carlo matrix inversion has a computational complexity linear of the size of the matrix.  Current implementations are either pure MPI or mixed MPI/OpenMP ones. There is also a version running on GPUs. The efficiency of the approach is usually tested on a set of different test matrices from several matrix market collections.

The intern has to take the existing codes and will have to integrate some Deep Learning methods such as Stochastic Gradient Descent in the hybrid method and to test the efficiency of the new method applied to a variety of matrices as well as Systems of Linear Equations with the same matrix but different right-hand side.

This image has an empty alt attribute

Project Mentor: Vassil Alexandrov

Project Co-mentor: Jony Castagna

Site Co-ordinator: Luke Mason

Participant: Mustafa Emre Şahin

Learning Outcomes:
The student will learn to design parallel hybrid  Monte Carlo methods as well as advanced Deep Learning techniques.

The student will learn how to implement these methods on modern computer architectures with latest GPU accelerators as well as how to design and develop mixed MPI/CUDA and/or MPI/OpenMP code.

Student Prerequisites (compulsory):
The introductory level of Linear Algebra, some parallel algorithms design and implementation concepts, parallel programming using MPI and CUDA.

Student Prerequisites (desirable):
Some skills in being able to develop mixed code such as MPI/OpenMP or MPI/CUDA will be an advantage.

Training Materials:
These can be tailored to the student once he/she is selected.

Workplan:

  • Week 1/: Training week
  • Week 2/:  Literature Review Preliminary Report (Plan writing)
  • Week 3 – 7/: Project Development
  • Week8/: Final Report write-up

Final Product Description:
The final product will be a parallel application that can be executed on hybrid architectures with  GPU accelerators or a multicore one.  Ideally, we would like to publish the results in a paper on a conference or a workshop.

Adapting the Project: Increasing the Difficulty:
The project is on the appropriate cognitive level, taking into account the timeframe and the need to submit a final working product and 2 reports.

Adapting the Project: Decreasing the Difficulty:
The topic will be researched and the final product will be designed in full but some of the features may not be developed to ensure a working product with some limited features at the end of the project.

Resources:
The student will need access to a GPU and/or multicore-based machines, standard computing resources (laptop, internet connection).

Organisation:
Hartee Centre – STFC

Image result for Hartree Centre - STFC

Project reference: 1908

Python is widely used in scientific research for tasks such as data processing, analysis, and visualisation. However, it is not yet widely used for large-scale modelling and simulation on high-performance computers due to its poor performance – Python is primarily designed for ease of use and flexibility, not for speed. However, there are many techniques that can be used to dramatically increase the speed of Python programs such as parallelisation using MPI, high-performance scientific libraries and fast array processing using numpy Python library. Although there have been many studies of Python performance on Intel processors, there have been few investigations on other architectures such as ARM64. EPCC has recently installed a parallel computer entirely constructed from ARM64 processors, a Catalyst system from HPE. This project will involve converting a simple parallel program (already written in C and Fortran) to Python, measuring its performance on the Catalyst system and comparing to standard Intel machines. The aim will then be to try and improve the performance using standard techniques – although the approaches may be well known, Python performance on the new ARM64 processor is not well understood so the results will be very interesting.

Sample output of existing program showing turbulent flow in a cavity

Project Mentor: Dr. David Henty

Project Co-mentor: Dr. Oliver Brown and Dr. Magnus Morton

Site Co-ordinator: Ben Morse

Participant: Ebru Diler

Learning Outcomes:
The student will develop their knowledge of Python programming and learn how to compile and run programs on a range of leading HPC systems.

Student Prerequisites (compulsory):
Ability to program in one of these languages: Python, C, C++ or Fortran. A willingness to learn new languages.

Student Prerequisites (desirable):
Ability to program in Python.

Training Materials:
Material from EPCC’s Python for HPC course.

Workplan:

  • Task 1: (1 week) – SoHPC training week
  • Task 2: (2 weeks) –Understand the functionality of existing C/Fortran code and make an initial port to Python
  • Task 3: (1 week) – Measure baseline performance on Intel and ARM systems
  • Task 4: (2 weeks) Investigate performance optimisations and write a final report

Final Product Description:

  • Development of a parallel python application;
  • Benchmarking results for Python performance on a range of parallel machines;
  • Recommendations for how to improve Python performance on ARM processors.

Adapting the Project: Increasing the Difficulty:
The project can be made harder by investigating advanced optimisation techniques such as cross-calling from Python to compiled languages. Alternative languages such as Julia could also be considered.

Adapting the Project: Decreasing the Difficulty:
The project can be made simpler by considering a serial rather than a parallel code.

Resources:
Access to all HPC systems can be given free of charge by EPCC.

Organisation:
EPCC

EPCC

Project reference: 1907

EPCC has developed a small, portable Raspberry-pi based cluster which is taken to schools, science festivals etc. to illustrate how parallel computers work. It is called “Wee ARCHIE” (in fact there are two versions in existence) because it is a smaller version of the UK national supercomputer ARCHER. It is actually a standard Linux cluster and it is straightforward to port C, C++ and Fortran codes to it. We already have a number of demonstrators which run on Wee ARCHIE that demonstrate the usefulness of running numerical simulations on a computer, but they do not specifically demonstrate how parallel computing works.

This project aims to develop and enhance existing and in-development demonstrators that show more explicitly how a parallel program runs and its real-world applications. This is often done by showing a real-time visualisation on the front-end of Wee ARCHIE, or by programming the LED light arrays attached to each of the 16 Wee ARCHIE nodes to indicate when communication is taking place and where it is going (e.g. by displaying arrows). The aim is to make it clear what is happening on the computer to a general audience, for example, a teenager who is studying computing at school or an adult with a general interest in IT.

The demonstration considered for development is focused on coastline management. This looks at possible flood barriers and how they influence wave height and spatial location.

The project will involve understanding how these programs work, porting them to Wee ARCHIE, designing ways of visualising how they are parallelised and implementing these visualisations. If successful, these new demonstrators will be used at future outreach events.

The Wee ARCHIE parallel computer, with someone from our target audience for scale!

Project Mentor: Dr. Gordon Gibb

Project Co-mentor: Dr. Lorna Smith and Jane Kennedy

Site Co-ordinator: Ben Morse

Participant: Caelen Feller

Learning Outcomes:

  • Working with real parallel programs.
  • Learning how to communicate technical concepts to a general audience.

Student Prerequisites (compulsory):
Ability to program in at least Python.

Student Prerequisites (desirable):
Previous experience in visualisation and/or animation would be a bonus.
Programming experience/desire to learn at least one of C, C++, Fortran.

Training Materials:
The Supercomputing MOOC would also be useful – if this does not run in a suitable timeframe then we can give the student direct access to the material.

Access can be given to MSc in HPC training materials.

Workplan:
The project will start with the student familiarizing themselves with existing parallel programs including a traffic model and a fluid dynamics simulation. The second phase will be porting these to Wee ARCHIE. After that, the student will explore ways of making the parallel aspects more obvious to a general audience, e.g. via real-time visualisations or programming the LED light arrays on WeeARCHIE. The final phase will be implementing these methods and packaging the software up for future development.

Final Product Description:
One or more demonstrator applications developed for Wee ARCHIE that can be used at events such as science festivals and made available more generally for others interested in the public understanding of science.

Adapting the Project: Increasing the Difficulty:
There are many additional programs that could be looked at in addition to the simulation mentioned above.

Adapting the Project: Decreasing the Difficulty:
Various other simulations exist for the WeeARCHIE platform which could instead be enhanced or existing datasets of results could be further analysed rather than involvement in active development of the more recent simulations.

Resources:
The student will need access to Wee ARCHIE at some points, but although we have two systems this cannot be guaranteed as they are often offsite at different events. However, we have smaller test and development systems which will be available at all times.

Organisation:
EPCC

EPCC

Project reference: 1906

Anomalies detection is one of the timeliest problems in managing HPC facilities. Clearly, this problem involves many technological issues, including big data analysis, machine learning, virtual machines manipulations and authentication protocols.

Our research group already prepared a tool to perform real-time and historical analysis of the most important observables to maximize the energy efficiency of an HPC cluster.

We want to merge the data obtained from this tool with the system logs searching for some observables that could anticipate system failures.

In the framework of SoHPC program, we aim to perform some preliminary test on deep learning methodologies and database for log mining to data extract knowledge from node logs and to correlate it to with node failures.

The first step will be to implement a system to collect data from system logs using Elastic search. These data will be used to train a neural network to find a correlation between system faults and system observables.

Project Mentor: Andrea Bartolini

Project Co-mentor: Andrea Borghesi

Site Co-ordinator: Massimiliano Guarrasi

Participant: Martin Molan

Learning Outcomes:

Student Prerequisites (compulsory):

  • Python or C/c++ language (python will be preferred)

Student Prerequisites (desirable):

  • Elastic Search
  • Tensor Flow
  • Cassandra
  • Spark
  • Blender
  • MQTT

Training Materials:
None.

Workplan:

  • Week 1: Common Training session
  • Week 2: Introduction to CINECA systems, small tutorials on elastic search and tensor flow and detailed work planning.
  • Week 3: Problem analysis and deliver final Workplan at the end of the week.
  • Week 4, 5: Production phase (log mining and training of the neural network).
  • Week 6, 7: Final stage of production phase (Depending on the results and timeframe, the set of observables will be increased). Preparation of the final movie.
  • Week 8: Finishing the final movie. Write the final report.

Final Product Description:
A full big data to process log information as well as a trained deep neural network to predict the node anomalies.

Adapting the Project: Increasing the Difficulty:
A simple tool to perform a live anomalies detection will be prepared and installed on virtual machines.

Adapting the Project: Decreasing the Difficulty:
If necessary we could reduce the effort, creating only a log mining application using elastic search

Resources:
The student will have access to our facility, our HPC systems, the databases containing all the measurements, system logs and node status information. They could also manage a dedicated virtual machine.

Organisation:
CINECA
cineca_logo

Project reference: 1905

I/O is recognized to be the main bottleneck to achieve the Exascale computing, which is up to 1000x faster than current Petascale systems. The main cause can be identified in the disproportion of the rate of change between memory and external storage and CPUs. With a such limited I/O bandwidth and capacity, it is acknowledged that traditional 3 step workflows (pre-processing, simulation, post-processing) are not sustainable in the exascale era. One approach to overcome the data transfer bottleneck is through an in situ approach: the in situ approach moves some of the post-processing tasks in line with the simulation code. Using the recent plugin of OpenFOAM co-developed by CINECA with ESI-OpenCFD and Kitware to perform in-situ with OpenFOAM and Catalyst, a series of scaling tests will be carried out to classical visualization pipelines during the simulation concurrently.

Project Mentor: Prof. Federico Piscaglia

Project Co-mentor: /

Site Co-ordinator: Simone Bnà

Participant: Li Juan Chan

Learning Outcomes:
Increase student’s skills about:

  • Paraview (Catalyst)
  • OpenFOAM
  • Python
  • Managing resources on Tier-0 and Tier-1 systems
  • Batch scheduler (Slurm)
  • Profiling
  • Remote visualization (RCM, TurboVNC)
  • Video and 3D editing (Blender)

Student Prerequisites (compulsory):
Knowledge of:

  • OpenFOAM (basic)
  • Python : strong
  • ParaView : good
  • Linux/Unix: strong

Student Prerequisites (desirable):
Knowledge of:

  • Remote Visualization
  • Blender
  • MPI, Parallel Rendering

Training Materials:
None

Workplan:

  • Week 1: Common Training session
  • Week 2: Introduction to CINECA HPC systems, small tutorials on remote visualization and detailed work planning.
  • Week 3: Problem analysis and deliver final work-plan at the end of the week.
  • Week 4, 5: Production phase (A series of visualization workflows will be implemented using the Paraview GUI).
  • Week 6, 7: Final stage of production phase (A series of scalability tests are performed for each visualization workflow). Preparation of the final movie.
  • Week 8: Completing the final movie. Write the final report.

Final Product Description:
The purpose of this project is to perform a series of runs to show the scaling performance of the catalyst library coupled with OpenFOAM using the recent plugin co-developed by CINECA with ESI-OpenCFD and Kitware. Classical visualization workflows (isosurfaces, slicing, clipping, parallel rendering, …) will be applied to the outputs of OpenFOAM combustion simulations. Our final product will consist into a report with plots that will show the scalability performances. In addition, a small report on the work done will be produced.

Adapting the Project: Increasing the Difficulty:
If we will have enough time we could expand the project into the directions:

  • Several visualization pipelines for analysis purposes can be generated using the ParaView GUI

Adapting the Project: Decreasing the Difficulty:
An easy benchmark test case can be generated using the ParaView GUI.
A very simple movie showing the web interface will be prepared.

Resources:
The software OpenFOAM, released by the OpenFOAM Foundation and by ESI-OpenCFD will be provided together with all the open-source software needed for the work ( e.g. ParaView and Blender). The software is already available on the CINECA clusters that will be used by the students with their own provided accounts.

Organisation:
CINECA
cineca_logo

Project reference: 1904

In calculations of nanotubes prevail methods based on a one-dimensional translational symmetry using a huge unit cell. A pseudo two-dimensional approach, when the inherent helical symmetry of general chirality nanotubes is exploited, has been limited to simple approximate model Hamiltonians. Currently, we are developing a new unique code for fully ab initio calculations of nanotubes that explicitly uses the helical symmetry properties, even for systems that are aperiodic in one dimension. Implementation is based on a formulation in two-dimensional reciprocal space where one dimension is continuous whereas the second one is discrete. Independent particle quantum chemistry methods, such as Hartee-Fock and/or DFT or simple post Hartree-Fock  perturbation techniques, such as Moller-Plesset 2nd order theory, are used to calculate the band structures.

The student is expected to cooperate on the optimization of this newly developed implementation and/or on a performance testing for simple general nanotube model cases. Message Passing Interface (MPI) was used as the primary tool in the code parallelization.

The aim of this work is to improve MPI parallelization in the recently developed parts of the code that include electron correlation effects. The second level parallelization in inner loops over the processors of individual nodes would certainly enhance the performance together with using threaded BLAS routines.

By improving the performance of our new software we will open up new possibilities for tractable highly accurate calculations of energies and band structures for nanotubes with predictive power and with facilitated band topology whose interpretation is much more transparent than in the conventionally used one-dimensional approach. We believe that this methodology soon becomes a standard tool for in silico design and investigation in both the academic and commercial sectors.

General case nanotube with helical translational symmetry.

Project Mentor: Prof. Dr. Jozef Noga, DrSc.

Project Co-mentor: Ing. Marian Gall, PhD.

Site Co-ordinator: Mgr. Lukáš Demovič, PhD.

Participant: Irén Simkó

Learning Outcomes:
The student will familiarize himself with MPI programming and testing, as well as with ideas of efficient implementation of complex tensor-contraction based HPC applications. A basic understanding of treating the helical periodic systems in physics will be gained along with detailed knowledge of profiling tools and parallelization techniques.

Student Prerequisites (compulsory):
Basic knowledge of Fortran and MPI.

Student Prerequisites (desirable):
Advanced knowledge of C/C++ or Fortran and MPI, BLAS libraries and other HPC tools.

Training Materials:
Articles and test examples to be provided according to an actual student’s profile and skills.

Workplan:

  • Weeks 1-3: training; profiling of the software and design simple MPI implementation, Deliver Plan at the end of week 3
  • Weeks 4-7: optimization and extensive testing/benchmarking of the code
  • Week 8: report completion and presentation preparation

Final Product Description:
The resulting code will be capable of successfully and more efficiently completing the electronic structure calculations of nanotubes with a much simplified and transparent topology of the band structure.

Adapting the Project: Increasing the Difficulty:
A more efficient implementation with the hybrid model using both MPI and Open Multi-Processing  (OpenMP)

Adapting the Project: Decreasing the Difficulty:
Profiling of the code to provide key information on the bottlenecks and a simple MPI parallelization of the main loops.

Resources:
The student will have access to the necessary learning material, as well as to our local IBM P775 supercomputer and x86 infiniband clusters. The software stack we plan to use is open source.

Organisation:
Computing Centre of the Slovak Academy of Sciences

Image result for Slovakia Computer centre SAS logo

Project reference: 1903

The goal of the project is to demonstrate that HPC tools are (at least) as good, or even better, for big data processing as the popular JVM-based technologies, such as the Hadoop MapReduce or Apache Spark. The performance (in terms of floating-point operations per second) itself is clearly not the only metrics to judge. We must also address other aspects, for which the traditional big data frameworks shine, i.e. the runtime resilience and parallel processing of distributed data.

The runtime resilience of parallel HPC applications is a vivid research field for quite a few years and several approaches are now application ready. We plan to use GPI-2 API (http://www.gpi-site.com/gpi2) implementing GASPI (Global Address Space Programming Interface) specification, which offers, among other appealing features (asynchronous data flow, etc.), mechanisms to react to failures.

The parallel, distributed data processing in the traditional big data world is made possible using special file systems, such as the Hadoop file system (HDFS) or analogues. HDFS enables data processing using the information of data locality, i.e. to process data that is “physically” located on the compute node, without the need for data transfer over the network.  Despite it’s the many advantages, HDFS is not particularly suitable for deployment on HPC facilities/supercomputers and use with C/C++ or Fortran MPI codes, for several reasons. Within the project, we plan to explore other possibilities (in-memory storage / NVRAM, and/or multi-level architectures, etc.) and search for the most suitable alternative to HDFS.

Having a powerful set of tools for big data processing and high-performance data analytics (HDPA) built using HPC tools and compatible with HPC environments, is highly desirable, because of growing demand for such tasks on supercomputer facilities.

Simplified scheme of asynchronous parallel matrix multiplication C = A * B. Blocks of matrixes A, B, and C (boxes) are distributed across parallel processes (“ranks”), each represented by one of four colors.

Project Mentor:  Doc. Mgr. Michal Pitoňák, PhD.

Project Co-mentor: Ing. Marian Gall, PhD.

Site Co-ordinator: Mgr. Lukáš Demovič, PhD.

Participant: Thizirie Ould Amer

Learning Outcomes:
The student will learn a lot about MPI and GASPI (C/C++ or Fortran), parallel filesystems (Lustre, BeeGFS/BeeOND) and basics of Apache Spark. He/she will also get familiar with ideas of efficient use of tensor-contractions and parallel I/O in machine learning algorithms.

Student Prerequisites (compulsory):
Basic knowledge of C/C++ or Fortran and MPI.

Student Prerequisites (desirable):
Advanced knowledge of C/C++ or Fortran and MPI. Basic knowledge of GASPI, Scala, Apache Spark, big data concepts, machine learning algorithms, BLAS libraries, and other HPC tools.

Training Materials:

Workplan:

  • Week 1: training
  • Weeks 2-3: an introduction to GASPI, Scala, Apache Spark (and MLlib) and efficient implementation of algorithms
  • Weeks 4-7: implementation, optimization and extensive testing/benchmarking of the codes
  • Week 8: report completion and presentation preparation

Final Product Description:
Expected project result is (C/C++ or Fortran) MPI and (runtime resilient) GASPI implementation of a selected, popular machine learning algorithm. Codes will be benchmarked and compared with the state-of-the-art implementations of the same algorithm in Apache Spark MLlib or other “traditional” big data / HDPA technology.

Final Product Description:
Expected project result is (C/C++ or Fortran) MPI and (runtime resilient) GASPI implementation of a selected, popular machine learning algorithm. Codes will be benchmarked and compared with the state-of-the-art implementations of the same algorithm in Apache Spark MLlib or other “traditional” big data / HDPA technology.

Adapting the Project: Increasing the Difficulty:
The choice of the machine learning algorithm(s) to implement depends on the student’s skills and preferences. An ML algorithm implementation, to be efficient and run-time resilience, is challenging enough.

Adapting the Project: Decreasing the Difficulty:
Similar to “increasing difficulty”: we can choose one of simpler machine learning algorithms and /or sacrifice the requirement of runtime resilience.

Resources:
The student will have access to the necessary learning material, as well as to our local IBM P775 supercomputer and x86 infiniband clusters. The software stack we plan to use is open source.

Organisation:
Computing Centre of the Slovak Academy of Sciences

Image result for Slovakia Computer centre SAS logo


Project reference: 1902

Supercomputers are a key tool for professionals from many disciplines to address society challenges, enabling them to perform, e. g., climate change simulations or genome analysis. European Commission’s high-performance computing (HPC) Strategy, implemented in the Horizon 2020 Programme, devises the need to bring Europe’s high-performance computing technology to the exascale era, being energy efficiency one of the major challenges. Since providing the required amount of memory for upcoming exascale applications is not viable by means of top performance technology only, due to energy consumption and dissipation constraints, vendors are incorporating a variety of additional memory subsystems built upon different technologies, which provide diverse features and limitations (e.g., Intel’s 3D XPoint technology). Deciding what data to host in each memory subsystem is far from trivial and poses notable performance and energy implications. Recent research has proposed different methodologies to address this problem, e. g., based on object or page level placement, based on system emulation or sampling of real hardware counters, based on profiling or run time, etc. The aim of this project is to reproduce some of the results of previous works and improve upon the state of the art of this field, based on already developed technology at the hosting institution. The selected student will have access to state of the art hardware and will be mentored by experts in the field, in close collaboration with the Intel-BSC Exascale Laboratory. A co-authored journal paper is expected to be submitted based on the knowledge developed during the internship.

Example of data object placement from last-level cache misses.

Project Mentor: Antonio J. Peña

Project Co-mentor: Muhammad Owais

Site Co-ordinator: Maria Ribera Sancho

Participant: Perry Gibson

Learning Outcomes:                                                                       

  • Heterogeneous memory data placement strategies.
  • Latest trends in heterogeneous memory systems for HPC.

Student Prerequisites (compulsory):                                                                 

  • C
  • Computer Architecture

Student Prerequisites (desirable):

  • MPI
  • OpenMP

Training Materials:

Workplan:

  • Week 1: Training
  • Week 2: Literature review and plan report development
  • Week 3-7: Project development
  • Week 8: Development of final report

Final Product Description:

  • Reproduced previous paper results.
  • A scientific paper and/or technical report.

Adapting the Project: Increasing the Difficulty:
We can add improving upon the state-of-the-art to the reproducibility work.

Adapting the Project: Decreasing the Difficulty:
We can keep the project within the reproducibility part. The student will be technically assisted with actual coding workforce if needed.

Resources:
A laptop brought by the student. We will provide access to the required hardware.

Organisation:
Barcelona Supercomputing Centre

BSC_logo

Project reference: 1901

Supercomputers are a key tool for professionals from many disciplines to address society challenges, enabling them to perform, e. g., climate change simulations or genome analysis. European Commission’s high-performance computing (HPC) Strategy, implemented in the Horizon 2020 Programme, devises the need to bring Europe’s high-performance computing technology to the exascale era, being energy efficiency one of the major challenges. Since providing the required amount of memory for upcoming exascale applications is not viable by means of top performance technology only, due to energy consumption and dissipation constraints, vendors are incorporating a variety of additional memory subsystems built upon different technologies, which provide diverse features and limitations (e.g., Intel’s 3D XPoint technology). Deciding what data to host in each memory subsystem is far from trivial and poses notable performance and energy implications. Recent research has proposed different methodologies to address this problem, e. g., based on object or page level placement, based on system emulation or sampling of real hardware counters, based on profiling or run time, etc. The aim of this project is to analyse effects of different profiling strategies for heterogeneous memory data placement. The selected student will have access to state of the art hardware and will be mentored by experts in the field, in close collaboration with the Intel-BSC Exascale Laboratory. A co-authored journal paper is expected to be submitted based on the knowledge developed during the internship.

Analysis of data object placements from last-level cache misses with less simulation overhead by sampling.

Project Mentor: Marc Jorda

Project Co-mentor: Antonio J. Peña

Site Co-ordinator: Maria Ribera Sancho

Participant: Dimitrios Voulgaris

Learning Outcomes:                                                                       

  • Heterogeneous memory data placement strategies.
  • Latest trends in heterogeneous memory systems for HPC.

Student Prerequisites (compulsory):                                                                 

  • C
  • Computer Architecture

Student Prerequisites (desirable):

  • MPI
  • OpenMP

Training Materials:

Workplan:

  • Week 1: Training
  • Week 2: Literature review and plan report development
  • Week 3-7: Project development
  • Week 8: Development of final report

Final Product Description:

  • Analysis of profiling strategies for heterogeneous data placement.
  • A scientific paper and/or technical report.

Adapting the Project: Increasing the Difficulty:
We can add improving upon the state-of-the-art to the improvement of simulated profile runtime.

Adapting the Project: Decreasing the Difficulty:
We can keep the project within the profiling effect analysis part. The student will be technically assisted with actual coding workforce if needed.

Resources:
A laptop brought by the student. We will provide access to the required hardware.

Organisation:
Barcelona Supercomputing Centre

BSC_logo

Follow by Email