Parallel big data analysis within R for better electricity consumption prediction

Project reference: 2218
The main objective of this SoHPC project is to test how our existing code for energy consumption prediction scales from a local server to supercomputer. We have developed Python and R scripts to retrieve data, store it to MongoDB and load it back when needed. Additionally, based on the historical data we have developed scripts that build prediction models.Using deep neural networks, building each data model takes approx. 2 minutes and approx. 8 MB of memory. We have also tested parallelization within R using libraries parallel, doParallel and foreach. Therefore, the main goal of this project is to test how we can improve the existing R and Python scripts such that we can build 10.000 models and predictions within time limit of approx. 1 hours using a supercomputer with state-of-the-art computing nodes and storage and R libraries for MPI.

The workflow of the project
Project Mentor: Prof. Janez Povh, PhD
Project Co-mentor: Matic Rogar
Site Co-ordinator: Leon Kos
Learning Outcomes:
- Student will master R and parallelization in R using RStudio for creating computing jobs and libraries for efficient parallel data management and analysis with R;
- Student will learn MongoDB for big data management
Student Prerequisites (compulsory):
R, Python
Basics from regression and classification
Student Prerequisites (desirable):
Basics from data management (NoSQL data bases – MongoDB)
Training Materials:
The candidate should go through PRACE MOOC:
https://www.futurelearn.com/admin/courses/big-data-r-hadoop/7
Workplan:
W1: introductory week;
W2: efficient I/O management of industrial big data files on local HPC
W3-4: studying existing scripts for data management and predictions and parallelizing them;
W5: testing new scripts on real data
W6: final report
W7: wrap up
Final Product Description:
- Developed scripts to retrieved industrial big data files and store in Hadoop;
- Created R scripts for parallel analysis and computing new prediction models;
Adapting the Project: Increasing the Difficulty:
We can increase the size of data or add more demanding visualization task.
Adapting the Project: Decreasing the Difficulty:
We can decrease the size of data or simply the prediction models.
Resources:
R, Rstudio, MongoDB installations at University of Ljubljana, Faculty of mechanical engineering
Organisation:
UL-University of Ljubljana
Leave a Reply