Removing the data processing ‘bird’en with Spark
Greetings dear audience. Sorry that I’ve been a little radio-silent over the last week. I guess you could say I dropped off the radar. Read on for another riveting installation of ‘Allison’s summer of HPC’.
It’s hard to believe, but the Summer of HPC 2019 program is officially at the half-way mark! So far, I’ve kept my posts pretty light and airy, but today I will ask you to come with me on a journey along a more technical path. The destination of this journey? An understanding of how I am helping the UvA researchers process so much data in so little time. The stopovers on this journey? 1) How radars work, and 2) Apache Spark.
Radar, an introduction
Radar is an acronym for Radio Detection and Ranging. Perhaps I’ve been living under a rock, but I did not know that until very recently. The key purpose of meteorological radars is to measure the position and intensity of precipitation. Basically the way this is accomplished is by transmitting signals, and receiving back the echoes from objects in its range.
A meteorological radar uses a series of angles. After each scanning rotation, the elevation angle is changed. This allows for a three-dimensional view of the surroundings, and means that both horizontal and vertical cross-sections can be analysed. This scanning method takes 5-10 minutes, and basically collects data covering many kilometers in altitude (up to 20) and even more kilometers in horizontal range. As you can imagine, this is a huge number of data points.
The datasets that I am working with are made up of 96 sweeps per day (one every 15 minutes). Each of these sweeps has 10 elevation angles, and at each elevation angle there is data for 16 different metrics. There are 360 x 180 data points for each metric at each elevation. This means that, per day, the pipeline needs to process:
96 x 10 x 16 x 360 x 180 = 995,328,000 data points
And this is for just one radar station! Already I’m working with three radars, so multiply the above number my 3. How on earth do meteorological hubs and researchers manage this kind of data!?! Well, many different technologies and infrastructures are used for this kind of problem. Be glad that you clicked onto this link, dear audience, because I’m going to explain all about one of these tools in the coming paragraphs.
Introducing Apache Spark
Ah, Spark. Where would I be without you.
Spark is a brilliant tool for handling large amounts of data by distributing computational efforts across a cluster of many computers. At its core, distributed computing subscribes to the idea that ‘many hands make light work’. Spark is built on this principal and makes it easy for users to distribute computational efforts for a huge variety of tasks.
Some of the biggest companies in the world use this tool (think Amazon, Alibaba, IBM, Huawei etc.) In fact, it is the most actively developed open source engine for parallel computing. It’s compatible with multiple programming languages, and has libraries for a huge range of tasks. It can be run on your laptop at home, or on a supercomputer like at the SURFsara site where I am working.
Spark involves one driver and multiple worker nodes. It’s sometimes referred to as a ‘master/slave’ architecture. I find it easier to think of it in terms of a work environment. There’s one boss, and she has a number of workers for whom she is responsible. The boss allocates tasks to all of her workers. When a worker is done with their task, they share the results with the boss. When the boss has received everyone’s work, she does something with it: maybe she sends it to her client, or publishes a document. Now, reimagine all of the above but with computers. And instead of sending to a client, the driver node saves to disk, or produces an output (etc.) Spark is the tool that facilitates all of this distribution.
In my project, it’s really important to design a solution where the end-users can quickly filter through the data available (years worth of radar files), get the specific values that they want from those files, and then generate some kind of visualisation from those files (gifs being a particularly popular viz). By combining the wonder of Spark with a file format called Parquet, I’ve been able to generate some huge efficiencies for the researchers already. It now takes a matter of seconds to load and visualise a whole day of radar data into a 96-image gif! Now that’s pretty fly…
Another important feature of Spark is that it is optimised for computational efficiency, not storage. This means that it is compatible with a huge variety of storage systems. Being storage system-agnostic makes Spark much more versatile than it’s competitor/predecessor: Hadoop. Hadoop includes both storage (HDFS) and compute (MapReduce). The close integration between these two systems made it hard to use one without the other. The versatility of Spark has been a really important feature in my project so far, since we’ve already had to change storage systems once (and may need to again). Constant change is all part of the adventure in the world or ornithology/HPC/data science, and Spark is the perfect tool to facilitate this!
Well, congratulations on making it to the end of this considerably more verbose and technical update. Until next time, over and out!