Shared But Hard to Use: Helping Climate Researchers Share, Discover, and Use Data

1. The Problem
Sharing data among researchers is usually an afterthought. In our case, data is already shared publicly on a data repository, which is called DSpace. DSpace serves as an open-access repository for scholarly data published in various scientific fields. The main focus here is on the climate data.
Through DSpace, climate researchers and institutions can easily share their datasets. However, shared files can be considered as a “black box”, which needs to be opened first in order to know what is inside. In fact, climate simulation models generate vast amounts of data, stored in the standard NetCDf format. A typical NetCDF file contains a set of many dimensions and variables. With so many files, researchers can waste a lot of time trying to find the appropriate file (if any).
The goal of our project is to produce intelligible metadata about the NetCDF-based data. The metadata is to be stored and indexed in a query-able format, so that search and query tasks can be conducted effectively. In this manner, climate researchers can easily discover and use NetCDF data.

2. Data Source
As already mentioned, the DSpace repository is the main data source of NetCDF files. DSpace is a digital service that collects, preserves, and distributes digital material. Our particular focus is on climate datasets provided by Dr Eleni Katragkou from the Department of Meteorology and Climatology, Aristotle University of Thessaloniki. The datasets are available through the URL below:
https://goo.gl/3pkW9n

3. Background: NetCDF Files (Rew, & Davis, 1990) and (Unidata, 2017)
As the NetCDF format is usually used within particular scientific fields such as life sciences and climatology, it is expected that the reader may not be familiar with it. This section serves as a brief background to the NetCDF format, and its underlying structure.
NetCDF stands for “Network Common Data Form”. It emerged as an extension to NASA’s Common Data Format (CDF). The NetCDF is defined as a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. NetCDF was developed and is maintained at Unidata.
Unidata is a diverse community of education and research institutions with the common goal of sharing geoscience data and the tools to access and visualise that data. Unidata aims to provide data, software tools, and support to enhance Earth-system education and research. Funded primarily by the National Science Foundation (NSF), Unidata is one of the University Corporation for Atmospheric Research (UCAR)’s Community Programs (UCP).
The NetCDF data abstraction models a scientific data set as a collection of named multidimensional variables along with their coordinate systems and some of their named auxiliary properties. The NetCDF abstraction is illustrated in Figure 1 with an example of the kinds of data that may be contained in a NetCDF file. Each NetCDF file has three components: dimensions, variables, and attributes.
A variable has a name, a data type, and a shape described by a list of dimensions. Scalar variables have an empty list of dimensions. The variables in the example in Figure 1 represent a 2D array of surface temperatures on a latitude/longitude grid and a 3D array of relative humidities defined on the same grid, but with a dimension representing atmospheric level. Any NetCDF variable may also have an associated list of attributes to represent information about the variable.

Figure 1. An example of the NetCDF abstraction.

4. Project Objectives
Basically, we are attempting to facilitate search and query of NetCDF files uploaded to the DSpace repository. A set of objectives were defined as below:

  1. Defining the relevant metadata structure to be extracted from NetCDF files.
  2. Extraction of metadata from the NetCDF files.
  3. Storage/indexing of extracted metadata.
  4. Supporting search/querying functionalities.

The project is developed in a collaboration between GrNet and the Aristotle University of Thessaloniki. GrNet provides us with access to the ARIS supercomputing facility in Greece, and they also manage the DSapce repository.

References
Rew, R., & Davis, G. (1990). NetCDF: An Interface for Scientific Data Access. IEEE Computer Graphics and Applications10(4), 76-82.
Unidata. (2017). Retrieved on 10/08/2017 from http://www.unidata.ucar.edu/about/tour/index.html

Please follow and like us:
Posted in Blogs 2017, PRACE Summer of HPC 2017 Tagged with: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*