Huge task meets huge machine. The convergence of HPC and Big Data.

I have to say that this is my unconditional favourite “meme” of the year so far because of its relatability. Even though we can probably all say we feel a bit guilty laughing at disaster, sometimes its sheer magnitude, and our absolute impotence makes it the only logical way to react, and we feel like that worker in the digger.

Source: https://ichef.bbci.co.uk/news/976/cpsprodpb/182D6/production/_117703099_5a98eeaa-a480-4cfa-a2c2-88d08336718b.jpg BBC & EPA
This is more or less, also the challenge HPC systems face when dealing with huge chunks of “big” data. They get to work on it with one, or multiple diggers (likely CPUs) bit by bit, until it is done. Legacy and mainstream HPC systems are not well suited to dealing with Big Data, a memory bandwidth bound problem , because they have historically been tailored to expand a different bottleneck in science: how to deal with large expensive computations, or processor bound tasks.
For example, when simulating an evolving system e.g. of differential equations, like fluid flow or electromagnetic particle interactions, initial conditions are generally well understood, and not very “large” in size in comparison with the amount of data that gets generated and periodically saved as the simulation evolves. In a fluid simulation you generate in every time step you are interested, about exactly the same amount of data you started with in your initial conditions, save it, and you “start again” the simulation with these new “initial” conditions for the next time step.
The task instead, of processing credit card registers from millions of customers hunting for fraud is very different. There is a lot of data that needs to be read into the memory and processed continuously, sometimes in real-time, and usually the output “conclusions” of the analysis are much simpler than the input dataset, we are looking at fraud or no fraud, that simple. Mainstream HPC systems are designed to produce fine wine from fine grapes, not to distill vodka from a stenching mix of crushed potatoes, corn and seeds.
This type of Big Data challenge is becoming ever more common as users of HPC systems are expanding. No longer are the main characters of “who gets to use the supercomputer” arguments going to just be quantum physicists and aerodynamicists, but also epidemiologists studying human interactions to fight covid-19 and social scientists unwrapping the spread of digger memes in Twitter. Neither are all HPC systems going to be CPU systems with low memory bandwidth, but likely many more GPU based supercomputers more suited to machine/deep learning workflows.
This untapped growth, and potential to learn a bit more about what is promising to be a new paradigm in HPC use is what makes me so excited to join “Project Reference: 2133 The convergence of HPC and Big Data/HPDA”, led by Giovanna Roda, Dieter Kvasnicka and Claudia Blaas-Schenner, and collaborating with Liana Akobian, Petra Krasnic and my fellow PRACE Summer of HPC student Rajani Kumar at Vienna Scientific Cluster – TU WEIN in Austria.
I think I was also supposed to chat a bit about me! I love espresso coffees, “pay it forward”, want to leave a grain of sand in the world helping fight climate change, and you can always tempt me to try new things, like working with HPCs! I recently graduated from a masters at the University of Glasgow in Aeronautical Engineering, and was supervised by fantastic lecturer and researcher, Dr Kiran Ramesh, who kindly sent me across to the PRACE-Summer of HPC webpage, and well, here we are.
Getting really excited to start my own project but also to get to know fellow PRACE-Summer of HPC students and hear about your amazing projects, feel free to throw me a request on Github or LinkedIn so I can follow you and keep in touch!
Leave a Reply