Learn to Write MapReduce in R Step-by-Step

Firstly, you need a local virtual instance of Hadoop with R, which you can download from Oracle Virtual Machine (or VMWare) and the image from Mint-Hadoop.ova. This ova file already contains a working version of Hadoop & RStudio.
Starting Hadoop services in the terminal:
start-dfs.sh
start-yarn.sh
hadoop fs -ls
Starting RStudio from terminal
rstudio
Setting Environment Variables:
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
Installing RHadoop Libraries
install.packages("rhdfs")
install.packages("rmr2")
library(rhdfs)
library(rmr2)
hdfs.init()
Now we are ready to test small Hadoop MapReduce Jobs in RStudio. Here, we count different species in the iris dataset.
hdfs_input = to.dfs(iris$Species)
mapreduce_job = from.dfs(
mapreduce(
input = hdfs_input,
map = function(., v) keyval(v, 1),
reduce =
function(k, vv)
keyval(k, length(vv))))
result = as.data.frame(cbind(mapreduce_job$key, mapreduce_job$val))
Stop Hadoop Services:
stop-yarn.sh
stop-dfs.sh
You can also set up your Hadoop Cluster if you have a number of PCs on a domain by following the instructions in this link.

Future Learn “Managing Big Data with R and Hadoop” by PRACE & University of Ljubljana is a step-by-step course to manage and analyze big data using the R programming language and Hadoop programming framework. It is one of the top courses to learn statistical learning with RHadoop.

I have adapted some of the algorithms developed in my project “Predicting Electricity Consumption” to RHadoop. The example has been made available in this course. The Electricity Consumption Data has been made available in the Hadoop Distributed File System (HDFS) for you to practice. The example explains how to do data formatting, aggregating, calculating mean, standard deviation using MapReduce jobs and then plotting the results. Try the exercises at the end of the course and leave a comment there if you have any questions.
Additional RHadoop Materials:
Leave a Reply