Apache Spark: Are Big Data tools applicable in HPC?
Project reference: 1703
This project started as a SoHPC 2016 project, where we implemented routinely used quantum chemistry Hartree-Fock (HF) method in Apache Spark framework. Due to a lack of time, several goals remained unreached, such as: how efficient can a parallel Apache Spark code be compared to MPI if all affordable optimizations are applied?; how resilient is the parallel calculation, especially in comparison with MPI?; etc.
Student is expected to further cooperate in optimization of the existent HF Apache Spark code written in Scala, implement and optimize other popular quantum chemistry algorithms, such as Density Function Theory (DTF) and/or Second-order Møller-Plesset perturbation using the same framework.
Despite the fact that Apache Spark runs on top of JVM (Java Virtual Machine), thus can hardly match the FPO performance of Fortran/C(++) MPI programs compiled to machine code, it has many desirable features of (distributed) parallel application: fault-tolerance, node-aware distributed storage, caching or automated memory management. Yet we are curious about the limits of the performance of Apache Spark application by, e.g. substituting critical parts with compiled native code or by using efficient BLAS-like libraries.
We do not expect the resulting code to be (performance wise) truly competitive with MPI in production applications. Still, such experiment may be valuable for programmers from the Big Data world implementing (often computationally demanding) e.g. machine-learning algorithms, etc.
The choice of the implementation target, i.e. the aforementioned quantum chemistry methods, results from the professional background of the project mentor and certainly is a subject of negotiation. Any HPC application with non-negligible data flow is acceptable as well.
Directed Acyclic Graph of transformations of RDDs (Resilient Distributed Dataset) in Apache Spark program execution.
Project Mentor: Doc. Mgr. Michal Pitoňák, PhD.
Site Co-ordinator: Mgr. Lukáš Demovič, PhD.
Student will learn a lot aboutMPI, Scala programming language, Apache Spark as well as ideas of efficient implementation of tensor-contraction based HPC applications, particularly in quantum chemistry.
Student Prerequisites (compulsory):
Basic knowledge of Fortran/C/C++, MPI, Scala or advanced knowledge of Java.Background in quantum-chemistry or physics.
Student Prerequisites (desirable):
Advanced knowledge of Scala, basic knowledge of Apache Spark, BLAS libraries and other HPC tools.
Week 1: training; weeks 2-3: introduction to Scala, Apache Spark, theory andefficient implemented quantum chemistry methods, weeks 4-7: implementation, optimization and extensive testing/benchmarking of the code, week 8: report completion and presentation preparation
Final Product Description:
The resulting Apache Spark computer program will be resilient enough to successfully complete (e.g. quantum-chemistry) calculations with on-the-fly crashing/failing compute nodes. What may be interesting about this project from the outreach/dissemination perspective is bridging HPC and much more “popular” Big Data world. We do not expect it will directly lead to visually appealing outputs, but we will try to produce some (molecules, orbitals, execution graphs, etc.)
Adapting the Project: Increasing the Difficulty:
The goal is to push the efficiency of the Apache Spark code to maximum, which is, by its own,“infinitely difficult”.
Student will obtain access to (multimode) Apache Spark cluster. Apache Spark is an open-source project, accessible and easy to installon any commodity hardware cluster. Moreover, there are several free virtual machine images with preinstalled software available from companies like Cloudera, MapR or Hortonworks, ideal for learning and pivotal development.
Computing Centre of the Slovak Academy of Sciences
[…] 3. Apache Spark: Are Big Data tools applicable in HPC? […]