High Throughput HEP Data Processing at HPC
Project reference: 2110
High Energy Physics (HEP) community traditionally employed High Throughput Computing (HTC) type of facilities for the purpose of LHC data processing and various types of physics analyses. Furthermore, with recent convergence of AI and HPC, it becomes apparent that a single but modular and flexible type of facility could in the future replace single-purpose based environments. The goal of this project is to take several types of workloads: compute bound LHC event reconstruction and much more I/O driven physics analyses that could employ some type of ML or DL; and evaluate their effectiveness at an HPC scale. LHC type of workloads are mostly data-driven, which means that a lot of data has to be ingested and substantial amount of output produced. Doing this at a scale of a single or even tens of nodes does not represent a challenge. However, when moving to thousands of nodes, the handling of input and output becomes a huge bottleneck. HEP community as a whole currently evaluates various proof-of-concept designs of how this data flow should function. The idea is to scale these two very different workloads, but still both data driven, and understand the limitations of the existing model and also observe peculiarities of HPC systems under heavy dataflow load.
Project Mentor: Maria Girone
Project Co-mentor:Viktor Khristenko (CERN) and Sadaf Roohi Alam (CSCS)
Site Co-ordinator: Maria Girone (CERN) and Joost VandeVondele (CSCS)
Participants: Carlos Eduardo Cocha Toapaxi, Andraž Filipčič
Learning Outcomes:
- Learn about data driven sciences that are more and more present due to the increase of volumes of collected/recorded information.
- Learn to scale out analytics applications using HPC facilities (this is very similar to what people do in industry with Hadoop/Apache Spark/other Big Data tools, the idea is quite similar)
- Learn about available system monitoring at HPC facilities.
Student Prerequisites (compulsory):
- Working knowledge of at least a single programming language
- Bash, Linux
- Familirity with/Some knowledge of Networking (TCP/IP)
Student Prerequisites (desirable):
- working knowledge of Linux
- Working knowledge of TCP/IP stack
- Slurm
Training Materials:
https://ieeexplore.ieee.org/document/5171374 Paper describing the existing job submission schema for Grid computing.
Workplan:
- Week 1-2: Learn about the existing workloads. How to run, what they do, etc. Learn about writing batch system submission scripts. In short, get started by running things on a smaller scale.
- Week 3-4: Use LHC event reconstruction (mostly like CMS experiment event reconstruction) and scale it out to use more nodes. Observe the change on throughput as function of nodes. Learn how to monitor and understand the generated traffic. Test out various ways of doing I/O, e.g. reading directly from a file system or integrate into job submission that data has to be copied to the nodes. Devise the existing limitations for doing this at an HPC system
- Week 5-6: Use LHC physics analysis. Similar to Week 3-4, however here the requirements for I/O are 10x higher at least.
- Week 7-8: Finalize
Final Product Description:
- Scale out 2 representative workflows that different in input/output and compute requirements and observe how figure of merit changes, throughput, changes when going to higher node counts
- Understand the impact of doing high throughput data analysis at an HPC facility. Understand the limitations. Using shared filesystem vs preplacing data on a node. Using outgoing/incoming HPC connection
Adapting the Project: Increasing the Difficulty:
Ideally we would not want an HPC site to host/store any of our data for a long period as it essentially means that HEP community loses full control of its data. On the other hand, doing data reading/writing over HPC’s external links could be costly. The idea is to move away from 2-tier architecture to N-tier by employing smart dynamic data caching on HPC site. If the workplan is too simple, we want to think about possible designs for such an architecture. In other words, after walking through 1)-4), we would like to understand patterns and possibilities to improve things.
Adapting the Project: Decreasing the Difficulty:
To decrease the difficulty of the project, we would just use LHC event reconstruction and scale a single application out.
Resources:
- Access to an HPC site and support with co-mentoring a student and ensuring expertise from an HPC site.
- Laptop
*Online only in any case
Organisation:
CERN
Leave a Reply