Parallel big data analysis within R for better electricity consumption prediction

Parallel big data analysis within R for better electricity consumption prediction
The workflow of the project

Project reference: 2218

The main objective of this SoHPC project is to test how our existing code for energy consumption prediction scales from a local server to supercomputer. We have developed Python and R scripts to retrieve data, store it to MongoDB and load it back when needed. Additionally, based on the historical data we have developed scripts that build prediction models.Using deep neural networks, building each data model takes approx. 2 minutes and approx. 8 MB of memory. We have also tested parallelization within R using libraries parallel, doParallel and foreach. Therefore, the main goal of this project is to test how we can improve the existing R and Python scripts such that we can build 10.000 models and predictions within time limit of approx. 1 hours using a supercomputer with state-of-the-art computing nodes and storage and R libraries for MPI.

The workflow of the project

Project Mentor: Prof. Janez Povh, PhD

Project Co-mentor: Matic Rogar

Site Co-ordinator: Leon Kos

Learning Outcomes:

  • Student will master R and parallelization in R using RStudio for creating computing jobs and libraries for efficient parallel data management and analysis with R;
  • Student will learn MongoDB for big data management

Student Prerequisites (compulsory):
R, Python
Basics from regression and classification

Student Prerequisites (desirable):
Basics from data management (NoSQL data bases – MongoDB)

Training Materials:
The candidate should go through PRACE MOOC:
https://www.futurelearn.com/admin/courses/big-data-r-hadoop/7

Workplan:

W1: introductory week;
W2: efficient I/O management of industrial big data files on local HPC
W3-4: studying existing scripts for data management and predictions and parallelizing them;
W5: testing new scripts on real data
W6: final report
W7: wrap up

Final Product Description:

  • Developed scripts to retrieved industrial big data files and store in Hadoop;
  • Created R scripts for parallel analysis and computing new prediction models;

Adapting the Project: Increasing the Difficulty:
We can increase the size of data or add more demanding visualization task.

Adapting the Project: Decreasing the Difficulty:
We can decrease the size of data or simply the prediction models.

Resources:
R, Rstudio, MongoDB installations at University of Ljubljana, Faculty of mechanical engineering

Organisation:
UL-University of Ljubljana

ULFME

Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.