Apache Spark: Bridge between HPC and Big Data?

Apache Spark: Bridge between HPC and Big Data?
Directed Acyclic Graph of transformations of RDDs (Resilient Distributed Dataset) in Spark program execution.

Project reference: 1606

Student is expected to cooperate on implementation and performance testing of simple quantum chemistry method(s), such as Hartee-Fock/DFT or the Second-order Møller-Plesset perturbation theory using Apache Spark, general-purpose engine for large-scale data processing (alternative to Hadoop/MapReduce). Despite the fact that Spark runs on top of JVM (Java Virtual Machine), thus can hardly match the FPO performance of Fortran/C(++) programs compiled to native machine code, it has many desirable features of (distributed) parallel application: fault-tolerance, node-aware distributed storage, caching or automated memory management. Yet we are curious, how far we can push the performance of Spark application by, e.g. substituting critical parts with compiled native code or by using efficient BLAS-like libraries.

We do not expect the code to be truly competitive with (especially the latest versions of) MPI in production applications. Still, such experiment may be valuable for people from the Big Data world, who use/code (often computationally demanding) e.g. machine-learning algorithms without experience with HPC, especially with established concepts like efficient tensor contractions, etc.

The choice of the implementation target, the aforementioned “basic” quantum chemistry methods, results from the professional background of the project mentor and certainly is a subject of negotiation. Any HPC application with non-negligible data flow is acceptable as well.

The plan is to write the Spark code in Scala language (functional/object oriented language, similar to Java).

Directed Acyclic Graph of transformations of RDDs (Resilient Distributed Dataset) in Spark program execution.

Directed Acyclic Graph of transformations of RDDs (Resilient Distributed Dataset) in Spark program execution.

Project Mentor: Doc. Mgr. Michal Pitoňák, PhD.

Site Co-ordinator: Mgr. Lukáš Demovič, PhD.

Student:Oisín Benson

Learning Outcomes:
Student will familiarize himself with Scala programming language, Apache Spark as well as with ideas of efficient implementation of tensor-contraction based HPC applications, particularly in quantum chemistry.

Student Prerequisites (compulsory): 
Basic knowledge of Scala or advanced knowledge of Java. Background in quantum-chemistry or physics.

Student Prerequisites (desirable): 
Advanced knowledge of Scala, basic knowledge of Apache Spark, BLAS libraries and other HPC tools, knowledge of C/C++.

Training Materials:
http://www.scala-lang.org, http://spark.apache.org, http://nd4j.org/scala.html

Week 1: training;
Weeks 2-3: introduction to Scala, Spark and theory of implemented quantum chemistry methods,
Weeks 4-7: implementation, optimization and extensive testing/benchmarking of the code,
Week 8: report completion and presentation preparation

Final Product Description: 
We believe the resulting codes will be capable of successfully completing (e.g. quantum-chemistry) calculations with on-the-fly crashing/failing compute nodes. What may be interesting about this project from the outreach/dissemination perspective is bridging HPC and Big Data worlds. We do not expect it will directly lead to visually appealing outputs, but we will try to produce some (molecules, orbitals, execution graphs, etc.).

Adapting the Project: Increasing the Difficulty:
The goal is to push the efficiency of the Spark code(s) to maximum, which is by itself “infinitely difficult”. We plan to start with basic quantum chemistry methods, but we can go to more advanced ones (e.g. coupled-clusters).

Student must have access to (multimode) Apache Spark cluster. We at CC SAS have IBM BigInsights software package offering this functionality installed on IBM P755 infiniband cluster with GPFS parallel filesystem. Apache Spark is open-source project and is accessible and easy to install on any commodity hardware cluster.

Computing Centre of the Slovak Academy of Sciences

One comment on “Apache Spark: Bridge between HPC and Big Data?
  1. inseptember says:

    How is your project progress now, can you share with me, please?

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.