Large scale data techniques for research in ecology
Project reference: 1921
Ecologists studying animal movement require integrating multiple kinds of data coming from different sources. In particular, at SURFsara we are helping ornithologists to work with a new stream of radar data.
Where the researchers are used to working with relational databases, we will explore the suitability of replacing these by static storage along with an accompanying working environment and practices that would form a virtual lab. Ideally, querying this static database still feels natural (i.e.: you describe which data you want and not how to access it) and ecologists can still apply their existing IT knowledge.
Interns will be free to explore existing tools and work with the radar datasets (see section 9) that ecologists are gathering, guided mainly by our Scalable Data Analytics team to deliver a few characteristic workflows and a guide with best practices. Typical constraints in these fields include ensuring that workflows remain easily scalable with the increase of data volumes.
Project Mentor: Ander Astudillo
Project Co-mentor: Haukurpall Jonsson
Site Co-ordinator: Lykle Voort
Participant: Allison Walker
- Interns will learn to work with state-of-the-art, large-scale data analysis techniques and tools.
- Interns will work in a real research environment.
Student Prerequisites (compulsory):
Linux, Data analytics, Programming in R or Python
Student Prerequisites (desirable):
Apache Parquet, Apache Spark, Jupyter, Docker, SQL, good communication skills
- Week 1: learning the problem, our initial ideas to tackle the problem and the tools and methodologies
- Week 2-3: work on a first approach, coarsely implementing a simple first workflow
- Week 4-6: evolve the basic concept into a scalable set-up, implementing a couple of (possibly) more complex workflows
- Week 7-8: write a guide with best practices for setting up and working in the developed environment
Final Product Description:
We expect as a deliverable:
- A set of representative scientific workflows
- A guide of best practices for storing and working with the data stream, using the previous workflows as examples to generalise upon
Adapting the Project: Increasing the Difficulty:
Distributing computing outside of the dedicated environment and integrating extra data streams can be used as difficulty increases.
Adapting the Project: Decreasing the Difficulty:
Limiting functionality and scalability requirements can be used as difficulty decreases.
We host our own environment for working on this assignment. The interns are free to use their laptops with any hardware and software they see fit to develop their ideas. Technologies we are thinking about include (but are not limited to): Apache Parquet, Apache Spark, Jupyter, R Studio, R, Python. The tools are readily available to download and install in most common operating systems. The interns are free to choose and explore alternatives once they deepen into the matter.
Deployment as a service in the form of Docker containers managed by Kubernetes will also be addressed, but it is not the focus of the assignment.
SURFsara will arrange any internal rights to data and the platform.