Project reference: 1811
The objective of the project will be to improve a data-hub framework (using Apache Kafka, Apache Spark, Elastic Search and Kibana) to collect and to monitor real time environmental data (air composition, groundwater quality, and seismicity data, see links below) using distributed resources (e.g. Jasmine-CEDA) and to create an alert system for interpreting sensors at the field. The alert system will be able to classify ‘outlier’ values, such as a sensor failure or a natural event (e.g. seismic movement). The alert system will also be able to detect gaps in the sensors (e.g. if there are no values from a specific sensor) and submit notifications.
Project Mentor: Amy Krause
Project Co-mentor: Rosa Filgueira
Site Co-ordinator: Ben Morse
The student will analyse sensor data from environmental sensors using cloud analytics frameworks and build models to identify outliers. They will learn about Python-based streaming tools, for example Apache Kafka, Apache Spark and Elastic Search.
Student Prerequisites (compulsory):
A strong programming background, with an interest in HPC, parallel programming and real-time data processing and data streaming.
Student Prerequisites (desirable):
Experience in Python, parallel programming techniques, big data engineering and/or the willingness to learn these technologies.
These will be provided to the successful student once they accept the placement.
- Week 1&2: Familiarise with the existing framework, the streaming tools and download the data for the examples;
- Week 3&4: Understand, clean and prepare the sensor data to feed into a model for interpreting the data and classifying potential alerts;
- Week 5,6,7: Validate and improve the model, and implement an alert system based on the model;
- Week 8: Final report and demo at the EPCC seminar
Final Product Description:
The final project result will be a feasibility study and/or a prototype to show how an alert system could be implemented.
Adapting the Project: Increasing the Difficulty:
The alert system should create a pluggable framework for different models that the student can build. Create scripts to build a Docker infrastructure for a distributed system and test in real-time.
Adapting the Project: Decreasing the Difficulty
The model can be provided for the student to include in the framework. A subset of the environmental data can be used to build a non-distributed version that can be run on a laptop.
A desktop/laptop capable of running the development tools. Possibly access to cloud infrastructure, for example Azure, Jasmine which we will apply for.