From a Desolate Düren Platform to Spark in Slovakia
Its been one hell of a first couple of weeks here despite the worrying last leg of the trip to Juelich. Everything was smooth sailing until Düren train where the train to Juelich was leaving from an apparently non-existent platform 23. After doing a loop of the station I discovered an overgrown seemingly abandoned set of tracks beside a potholed empty platform matted with grass and a small sign marked 23. While checking the timetable for the fifth time some other passengers emerged onto this desolate platform and confirmed that the train to Juelich did indeed depart from there. I still found this difficult to believe until an hour later a rickety old tram rolled up to the platform and I was on the way to Juelich.
Despite this ominous start to SoHPC the training week as a whole was a fantastic experience. We kicked it off on Monday with a tour of the super computing facilities at the Forschungszentrum. Their JUQUEEN packs one hell of a punch with 5 Petaflops peak performance and fans that can climb to a deafening 90db. From Tuesday onward, we were given a crash course in core HPC concepts; racing through MPI, OpenMP, CUDA and visualisation tools at break neck speed, 9am-7pm. Thursday gave us all some respite with an afternoon of go-karting. I started the race well but my time slipped towards the end with one too many barrier assisted turns. My teammate Shaun raced like a machine, overtaking people with brutal efficiency to hand us a respectable mid table finish. All in all it was a brilliant week and it was bitter sweet as we said our goodbyes on Friday to head off to our projects.
And now to my project in the Slovak Academy of Sciences in Bratislava. My project is quite unique in the fact that it is the only one to use modern tools such as Scala (a functional or cross-paradigm language) and Apache Spark (a general-purpose engine for cluster data processing). Initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009, and open sourced in 2010, Spark has seen relatively rapid adoption in the area of big data due to its speed compared to Hadoop and it’s relative ease of use, supporting applications in multiple languages. The scientific computing front for the engine has been relatively scant so far but it received interest from NASA for the purposes of large scale data processing. We hope that it’s fault-tolerant with node-aware distributed storage, caching and automated memory management should compensate for the lose of performance vs a pure HPC language such as MPI Fortran.
Over the course of the project we aim to implement and performance-test simple quantum chemistry method, such as the Hartree-Fock Method using these tools. Hartree-Fock theory is fundamental to much of electronic structure theory providing a method of approximation for the determining the wave function and the energy of a quantum many-body system in a stationary state. It is the basis of molecular orbital (MO) theory, which posits that each electron’s motion can be described by a single-particle function (orbital) which does not depend explicitly on the instantaneous motions of the other electrons.
Hartree-Fock theory can only provide an exact solution in the case of the hydrogen atom, where orbitals exact eigenfunctions of the full electronic Hamiltonian. However Hartree-Fock theory often provides a good starting point for more elaborate theoretical methods which are better approximations to the Schrödinger equation (e.g., many-body perturbation theory).
The first week or so of the project was spent mainly on setting up the tools, reading up on Spark and refreshing my Scala knowlege. As spark requires hadoop and a HDFS (Hadoop Distributed File System) we opted for a virtual box locally on my machine to test code before scaling up to the Spark cluster in the Slovak Academy of Sciences. After i the mapR installation drove the system administrator up the wall we successfully got the cloudera image working instead. This allowed me to play around with spark a bit and familiarise myself with it.
Once this was done I began focusing on the HF algorithm, reading up on previous C(++) implementations & prototyping it in Scala. Rewriting C code for this proved challenging but Scala’s Breeze library for numerical computing did help ease the process providing support for most of the required linear algebra operations. However the tensor operations required for the final integrals meant I had to implement my own functions. Upon fixing the bugs with these final operators it should be reasonably straight forward to get it running in spark, but that presents new challenges such as correctly dealing with the data for the integral calculations which is stored in binary files. From there I can begin considering optimisation for the program and running tests with the spark cluster. Full steam ahead in Slovakia.