Learn to Write MapReduce in R Step-by-Step

Learn to Write MapReduce in R Step-by-Step

Firstly, you need a local virtual instance of Hadoop with R, which you can download from Oracle Virtual Machine (or VMWare) and the image from Mint-Hadoop.ova. This ova file already contains a working version of Hadoop & RStudio.

Starting Hadoop services in the terminal:


start-dfs.sh
start-yarn.sh

hadoop fs -ls

Starting RStudio from terminal


rstudio

Setting Environment Variables:


Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")

Installing RHadoop Libraries


install.packages("rhdfs")
install.packages("rmr2")

library(rhdfs)
library(rmr2)
hdfs.init()

Now we are ready to test small Hadoop MapReduce Jobs in RStudio. Here, we count different species in the iris dataset.


hdfs_input = to.dfs(iris$Species)

mapreduce_job = from.dfs(
  mapreduce(
    input = hdfs_input, 
    map = function(., v) keyval(v, 1), 
    reduce = 
      function(k, vv) 
        keyval(k, length(vv))))

result = as.data.frame(cbind(mapreduce_job$key, mapreduce_job$val))

Stop Hadoop Services:


stop-yarn.sh
stop-dfs.sh

You can also set up your Hadoop Cluster if you have a number of PCs on a domain by following the instructions in this link.

Future Learn: Digital Education Platform

Future Learn “Managing Big Data with R and Hadoop” by PRACE & University of Ljubljana is a step-by-step course to manage and analyze big data using the R programming language and Hadoop programming framework. It is one of the top courses to learn statistical learning with RHadoop.

Managing Big Data with R and Hadoop Course on Future Learn Platform

I have adapted some of the algorithms developed in my project “Predicting Electricity Consumption” to RHadoop. The example has been made available in this course. The Electricity Consumption Data has been made available in the Hadoop Distributed File System (HDFS) for you to practice. The example explains how to do data formatting, aggregating, calculating mean, standard deviation using MapReduce jobs and then plotting the results. Try the exercises at the end of the course and leave a comment there if you have any questions.

Additional RHadoop Materials:

My name is Khyati and I come from Jaipur (India) – The Pink City. I am studying a Master’s in Data Science at the University of Salford. I am always passionate about continuous learning, finding new and exciting technologies which can prove useful for my data projects. I like working on Data mining (Prediction/Forecasting), Big Data, Advanced Databases, and Text Mining data problems which involves processing, manipulation, statistical analysis, visualization, and modeling.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.