The secret of Big Data programmers
In my last blog post, I described the problem of linkage prediction in large-scale graphs. I also mentioned that this is a big data processing problem. Today I will present you one of the biggest secrets of big data programmers: the MapReduce paradigm.
First, I should mention why processing a big dataset – let say having 1 Petabyte (PB) of data – creates difficulties to modern computer systems. Well, let’s start with the most basic thing: to process the data we need to read it first. The read rate of a typical hard drive is 200 MB per second (one can argue that we can use much faster SSD discs, but they are still far too expensive to use for storage in data warehouses). So, how many megabytes are in 1 PB? It’s 1 073 741 824 MB! This means that we need 1491 hours i.e. 478 days just to read a dataset like this! And remember, the speed of writing to a disk is even lower. So how it is possible that modern big data systems need only 234 mins to sort 1PT of data (so to read, write and make some real processing with the data)? You knew it – HPC comes to the stage now 😉
If it is impossible to efficiently read such an amount of data from one disc, then you divide this data across many computers. Having 200 computers with the distributed data we are able to read this data in less than 8 minutes. That’s great, but now we have another problem: our program needs to read this data separately, communicating results between computers, and what’s more the result of our processing probably will be huge so we need to save our results also in distributed way. To make it simpler for programmers the MapReduce paradigm was proposed. I will explain this paradigm on the famous word-count example. Our task is to read a huge amount of text and return the information of how many times every single word appears in it.
MapReduce consists of three steps: Map, Shuffle (which is done automatically, we don’t need to program it) and Reduce. During a map stage, each line of the input file is transformed into one or many pairs of objects. The first object in such a pair is called “key”, and the second is called “value”. Note, that the map operation for a particular line does not depend on other lines, hence it can be executed at any time and at any node (a computer in the cluster). Returning to our example: our map will split the line into words, and then it will emit a pair (word, 1) for each word.
Later, during the shuffle, each node is sending all produced pairs in the map stage to other nodes. However, using HPC magic (called hash functions) all pairs with the same key are transported to the same node making further processing much simpler.
Finally, having a node all key-value pairs related to a particular word we can reduce them into one key-value pair with the sum of values. Note that, reduce operation on a key does not depend on the other reduce operations on different keys making it possible to execute it in parallel.
And this concludes our example, because we already created the result which we wanted to get: a distributed map containing information how many times each word occurs in the file!
To sum up, MapReduce is a simple developer abstraction which allows us to write distributed & parallel applications in a simple way and it has come to be a standard in the implementation of big data processing applications. This paradigm was further extended and implemented in the Apache Spark framework (see more info in the blog post here), which was able e.g. to sort 1 PB of data obtaining earlier mentioned, awesome result: 234 minutes.
Source of featured image: http://www.thebluediamondgallery.com/tablet/b/big-data.html