The convergence of HPC and Big Data/HPDA
Project reference: 2133
While High-Performance Computing (HPC) traditionally focuses on compute-intensive tasks, Big Data frameworks are more focused on data-intensive tasks. Our project aims at investigating the intersection of HPC and Big Data on the basis of case studies arising from the two different fields. How can the Big Data frameworks be ported to a traditional HPC architecture? What are the challenges and limitations? How can HPC enhance Big Data analysis? What tools and projects are already available to facilitate the interoperability of Big Data and HPC platforms for facilitating what is also known as HPDA (High Performance Data Analysis). On the one hand, existing Big Data solutions will be run on the local supercomputer and make efficient use of its resources. Performance bottlenecks for Big Data users on HPC include a queuing system, waiting times and access restrictions to data and compute resources. On the other hand, we will use methods taken from Big Data software stacks to solve pre- and post-processing tasks of High-Performance applications. An important topic is to identify and classify tasks that can profit from new implementation ideas, where the interplay of HPC and Big Data may lead to easily implementable solutions to advanced problems. Bringing together HPC and Big Data solutions poses a challenge in terms of programming paradigms for computation (“data parallel” versus “automatic parallelization”) and even in terms of common programming languages. This may be overcome by focusing on Python as a common denominator, not excluding other languages, however. Our project will develop prototypical applications to illustrate the integration of Hadoop and HPC tasks leveraging the Vienna Scientific Cluster (VSC) as well as the Apache Hadoop Little Big Data (LBD) cluster of the Technical University of Vienna.
Project Mentor: Giovanna Roda
Project Co-mentor: Dieter Kvasnicka
Site Co-ordinator: Claudia Blaas-Schenner
The student will get to know both HPC and Big Data ecosystems and will be able to leverage these technologies for their computational needs.
Student Prerequisites (compulsory):
Familiarity with the Linux shell.
Student Prerequisites (desirable):
Experience with Hadoop and/or HPC.
Will be provided at a later time
- Week 1: training
- Week 2-3: Introduction to Big Data
- Week 3-4: Introduction to HPC
- Week 5-6: Running Big Data applications on HPC
- Week 7-8: Further experiments and report.
Final Product Description:
The expected project result consists in a report and software prototypes to illustrate the work done.
Adapting the Project: Increasing the Difficulty:
Packaging the applications in containers for portability
Adapting the Project: Decreasing the Difficulty:
Get familiar with Big Data and HPC but run applications just on Hadoop.
A client machine for connecting to the clusters and fast Internet connection.
VSC Research Center, TU Wien