Maximising data processing efficiency in the cloud, with a twist for Research Data Management
Project reference: 2127
In the current times of ever-growing data, cross-site collaboration, and evolving new data processing techniques; moving to the cloud to carry out scientific research is becoming a commodity. With that shift, however, keeping track of where your data lives turns out to be key, because exploiting data locality can yield enormous performance boosts. At the same time, tracking those data locations can be tricky, so assistance in this matter can somewhat free you from that task.
SURF offers Research Data Management (RDM) services using iRODS. iRODS provides: advanced tooling for data management, scalable data storage, high performant data transfers and a policy engine to enforce RDM policies. Moreover, within iRODS you can annotate data objects with metadata which can be used for data discovery or maintaing provenance.
We are also developing SURF Research Cloud, where making things easy for the scientist is key. In our vision, as a scientist, you select your application along with the data that you want to work on, and that will deploy a virtual lab in the cloud for you and your colleagues to work in. We call that a Workspace.
Research Cloud must benefit from iRODS functionality by querying for data based on metadata when designing your workspace, but also to store workspace metadata for provenance to enable reproducibility of scientific results.
Making data available properly for processing in a Workspace is ongoing work. Studying and implementing real life cases in different services, and compiling advice along with best practices is what this project is about.
Project Mentor: Ander Astudillo
Project Co-mentor: Arthur Newton
Site Co-ordinator: Carlos Teijeiro Barjas
Interns will learn how to handle storage for data processing in scientific research, how to apply them to cloud computing, and how to use Research Data Management principles in practice.
All work will be done in real research environments.
Student Prerequisites (compulsory):
Linux administration, Data processing, Programming in Python
Student Prerequisites (desirable):
Good analytical skills, good communication skills, basic understanding of measuring performance, a feeling for architecture design, experience with exercises of the form “compare and contrast”, basic understanding of input/output and using different storage solutions
- Anything related to storage usage and performance
- SURF Research Cloud: https://servicedesk.surfsara.nl/wiki/pages/viewpage.action?pageId=9798172
- SURF Research Drive: https://wiki.surfnet.nl/display/RDRIVE/Research+Drive
- OwnCloud: https://owncloud.com/
- Apache Swift: https://swift.org/
- CVMFS: https://cernvm.cern.ch/fs/
- dCache: https://www.dcache.org/
- Ceph: https://ceph.io/
- OpenStack: https://www.openstack.org/
- iRods: https://irods.org/
- Yoda: https://www.uu.nl/en/research/yoda
Week 1: learning the problem, our initial ideas to tackle the problem and the tools and methodologies
Week 2-3: work on a first approach, coarsely implementing a simple first workflow
Week 4-6: evolve the basic concept into a scalable set-up, implementing a couple of (possibly) more complex workflows
Week 7-8: write a guide with best practices for setting up and working in the developed environment. Possibly, and depending on the intern’s interests, provide a first integration with the rest of the platform.
Final Product Description:
The project delivers:
- Collection of reference implementations of pipelines for real life cases, using different data services
- Catalog of best practices for real cases
Depending on time and intern’s interests, we can also include:
- Implementation of some level of integration within SURF Research Cloud
Adapting the Project: Increasing the Difficulty:
Difficulty can be increased by tackling complexer cases, looking at more services, and combining multiple data sources.
Adapting the Project: Decreasing the Difficulty:
Difficulty can be decresead by tackling simpler cases, looking at fewer services, and focusing on single data sources.
We host our own environment for working on this assignment. The interns will get laptops from us with access to our network. They can then develop their ideas.
Technologies we are thinking about reolve around data transfer. They include (but are not limited to): WebDAV, Ceph, CVMFS, S3, Swift, iRods, Yoda. These technologies are used by several of the services that we can test with, and are rather commonplace, with plenty of information available on the Internet, readily available to download and install in most common operating systems. The interns are free to choose and explore alternatives once they deepen into the matter.
Extra information about OpenStack, Ansible and Terraform can be helpful to understand the context that we operate in.
SURF will arrange any internal rights to data and the platform.