OR/13/042 Appendix 2 - Summary of current approaches

From MediaWiki
Jump to navigation Jump to search
Hughes, A G, Harpham, Q K, Riddick, A T, Royse, K R, and Singh, A. 2013. Meta-model: ensuring the widespread access to metadata and data for environmental models - scoping report. British Geological Survey External Report, OR/13/042.

Metadata tools

NASA's Earth Observing System (EOS) Clearinghouse (ECHO) is a metadata registry and order broker that allows query and access to data from a large number of repositories, primarily NASA repositories, though any repository can request to have their metadata included in the ECHO database, and stores data from a variety of science disciplines.

There are also several tools to assist the capture, cataloguing and retrieval of metadata in XML format, including the open source data management system — eXist; the metadata authoring tool, MATT; the Mercury web based system to retrieve metadata and associated datasets; and the open source metadata catalogue METACAT. The latter system is in use throughout the world to manage environmental data.

Another widely used geospatial metadata catalogue system is GeoNetwork OpenSource which is an open source geospatial data catalogue service host, metadata creation and management system, and basic web mapping platform. Another widely used system is the THREDDS Data Server (TDS) — a web server that provides metadata and data access for scientific datasets, using OPeNDAP, OGC WMS and WCS, HTTP, and other remote data access protocols.

Repository technologies

Fedora (Flexible Extensible Digital Object Repository Architecture) is a modular architecture built on the principle that interoperability and extensibility is best achieved by the integration of data, interfaces, and mechanisms (i.e., executable programs) as clearly defined modules, and is often used in the digital library community.

EPrints is a free and open source software package for building open access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It shares many of the features commonly seen in Document Management systems, but is primarily used for institutional repositories and scientific journals. EPrints is a Web and command-line application based on the LAMP architecture (but is written in Perl rather than PHP). It has been successfully run under Linux, Solaris and Mac OS X. A version for Microsoft Windows was released in May 2010.

D-Space is an open source tool aimed at organisations with minimal resources. The DSpace architecture is a straightforward three-layer architecture, including storage, business, and application layers, each with a documented API to allow for future customization and enhancement. The storage layer is implemented using the file system, as managed by PostgreSQL database tables.

Of relevance to the earth science community is the National Geospatial Digital Archive (NGA) which aims to create a new national federated network for archiving geospatial imagery and data, as well as collecting and archiving important digital geospatial data and images.

Storage technologies

The JASMIN&CEMS cluster includes 4.6 Petabytes of usable fast access Panasas® parallel file storage (www.stfc.ac.uk/eScience/news+and+events/38663.aspx) The important aspects of the data storage design are the 1 Tb/s aggregate bandwidth from data to processors which supports the processing of very large data volumes, and the lower total cost of ownership than competing solutions due to less need for manual intervention by operators to manage and expand the system. The 1133 data blades constitute the second largest configuration that Panasas® have provided to a single installation.

Hierarchical storage management (HSM) is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the data on slower devices and then copies data to faster disk drives when needed. The following link: www.stfc.ac.uk/e-Science/services/atlas-petabyte-storage/22459.aspx provides details of an STFC based example.

Data preservation technologies — summary and main trends

Many of the software tools which are directly applicable to digital preservation are relevant to a wide variety of science (and sometimes also non-science) disciplines. Few are specific to the earth sciences, but a number of these technologies are concerned with the basic elements of files and their representation in computer systems. Hence they should be applicable to the types of file format commonly found in earth science archives. For example the EAST and DFDL data description language would potentially provide ways of describing a wide variety of data formats. Considering the aim of increasing the level of interoperability between different earth science disciplines the data dictionary (e.g. Data entity Data specification language) and semantic languages such as OWL and SKOS will be important in documenting data dictionaries and establishing new ontologies to ensure this interoperability.

The availability of emulators both for software and operating systems will be important. The Dioscuri emulator was designed by the digital preservation community and being java based can be ported to a number of platforms, and therefore seems a particularly useful tool. Important metadata tools (some of which are also referenced in the user surveys) include the open source metadata catalogue MERCAT which is widely used to manage environmental data and also the GeoNetwork metadata catalogue system which is widely used within the earth science community.

In terms of software archiving, a number of the available tools are also those commonly used by software developers during the development phase (e.g. SourceForge, and Subversion), since these provide mechanisms for documenting and version control of the code. Open source development communities (e.g. Tigris.org) also fulfil a useful function in digital preservation in that they provide a means for users to track and be informed about changes to their software, and often methods of upgrading open source applications as new versions of the underlying software become available.

Considering the technologies available for storage and archive repository development, FEDORA (Flexible Extensible Digital Object Repository Architecture) has been mentioned in the survey responses, and therefore is clearly used by the earth science community to some extent. Products such as EPrints and D-Space are probably more applicable to the digital library and academic publishing worlds, but may have some relevance to SCIDIP-ES. Repository planning tools such as the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) tool, did not come up in any of the user survey responses, but given the importance of auditing repositories and establishing the criteria for including certain data (and risks in not doing so) would seem to have a potential application in the earth science domain.

Data discovery and access

Portals appear to fall into two main types, those which provide a federated search across multiple archives and those which provide a dedicated search of a specific archive system. Frequently the database behind a specific portal can be accessed by federated search systems using OGC compliant standards and metadata. There is a strong indication that the facilities for federated searches across multiple archives are generally well developed.

The relevant OGC compliant standards include OGC Catalog Services (CSW) specification, Web Map Service (WMS), Interface Implementation Specification, Web Feature Service (WFS) Implementation Specification, Web Coverage Service (WCS). These standards have been widely implemented to provide access to potentially very detailed and rich sets of geospatial information.

Of particular relevance to this project is the INSPIRE Geoportal (www.inspire-geoportal.ec.europa.eu/discovery) which is the central discovery portal for the European geospatial data infrastructure (EU-GDI) providing a front end to an OGC compliant data catalogue, and also the GEO portal. The GEO Portal (www.geoportal.org/web/guest/geo_home) is the central portal and clearinghouse for Global Earth Observation System of Systems (GEO-GEOSS) providing access to geospatial and earth observation (EO) data. The GEO portal allows the user to discover, browse, edit, create and save geospatial information from GEO members around the globe. This data discovery portal accesses the OGC compliant catalogues, viewing and download services of various organizations worldwide through the use of standardized OGC-compliant protocols.

Another important project concerned with data access is GENESI-DEC (www.genesi-dec.eu) The project has established open data access services allowing European and worldwide Digital Earth Communities to seamlessly access, produce and share data, information, products and knowledge. This will create a multi-dimensional, multi-temporal, and multi-layer information facility of huge value in addressing global challenges such as biodiversity, climate change, pollution and economic development. GENESI-DEC evolves and enlarges the platform developed by the predecessor GENESI-DR project by federating to and interoperating with existing infrastructures.

GENESI-DEC involves key partners of ESFRI projects and collaborates with key participants of Digital Earth and Earth Science initiatives, including the International Society of Digital Earth and GEO-GEOSS to ensure the efficient use of already existing and planned developments.

The INSPIRE, GEO-GEOS, and GENESI-DEC portals are front ends to large complex systems which allow data producers to upload data and metadata to the portal and also for users to retrieve their data.

The NERC Data Grid (www.ndg.badc.rl.ac.uk/) provides a gateway to find data and explore what is known about the datasets. The data themselves remain located with the data providers, and this provides a multi-archive search for discovering data. In a similar manner the Earth System Grid (ESGF — www.earthsystemgrid.org) provides a gateway to scientific collections which may be hosted at sites around the world.

In some cases, in addition to the functionality to discover and access data, tools are also made available within the data discovery/access portal to enable visualisation of data, although it appears that this integration of visualisation and analysis tools is not currently a common feature.

The Heterogeneous Missions Accessibility (HMA) project aims to establish harmonised access to heterogeneous Earth Observation mission data from multiple missions ground segments, including national and ESA Sentinel missions. The project partners who already have a direct contractual relationship with ESA in the framework of HMA are: ASI (Italian Space Agency), CNES (French Space Agency), CSA (Canadian Space Agency), DLR (German Space Agency), EUSC (European Union Satellite Centre).

Other web portals examined are aimed at the discovery and access of earth observation data, and in many cases it is clear that the domains which these portals support are quite diverse. For example the Global Land Cover facility at (www.landcover.org) is commonly accessed by users from a diverse range of communities including from science ( geography, earth science, ecology, climatology, conservation, education) environmental policy (global warming, sustainable development, risk management) and resource management (biodiversity assessment, forestry, protected area management). In other cases e.g. the SPOT catalogue and maps store (www.catalog.spotimage.com) and the ‘GMES Land Monitoring Portal’ (www.land.eu/portal) the portal provides access to a specific dataset or range of data sets.

As would be expected, data is generally provided in formats (e.g. GIS files or images) which are appropriate to the predominant user community. There is not a great deal of evidence of users from one discipline being able to access and use relevant data from disparate domains. In fact the form based search facilities frequently provided allow searching on the basis of terms such as location, sensor, data type and time, some of which require a knowledge of earth observation data, and so may not encourage users of other disciplines to make use of it. This is clearly one area where the development of tools and services in the SCIDIP-ES project can contribute to making data more interoperable between disciplines.

Technologies and frameworks for processing data

These include the Web Processing Service (WPS) interface standard which provides rules for standardising inputs and outputs (requests and responses for geospatial processing services. Through WPS a generic user gains access to geospatial data processing tools provided by third parties. WPS can be seen as a way to perform standardized geospatial computations in a distributed environment. In the context of LTDP it can be used as a tool to preserve data processing algorithms and procedures in the geospatial domain as long as adequate data preservation policies are implemented on the infrastructure providing the service itself.

The OpenGIS® Web Coverage Processing Service (WCPS) Interface Standard (www.opengeospatial.org/standards/wcps) defines a protocol-independent language for the extraction, processing, and analysis of multi-dimensional gridded coverages representing sensor, image, or statistics data. Services implementing this language provide access to original or derived sets of geospatial coverage information, in forms that are useful for client-side rendering, input into scientific models, and other client applications.

Open virtualisation format (OVF) represents a standard vendor independent representation of virtual machines which, in turn, are a common component of data preservation strategies. A virtual machine containing all the processing chain components of a given dataset can be used to reproduce and analyse the procedures and algorithms used in data processing.

Earth System Modelling Framework (ESMF) defines an architecture for composing complex, coupled modelling systems and includes data structures and utilities for developing individual models. The ESMF framework is emerging as a standard among the modellers in the earth science domain. The standards and software tools defined by ESMF might be useful to support LTDP of model related data. Moreover, its components can be used as standardized data processing tools. ESMF is supported mainly by US organizations, universities and research centres.

Open Modelling interface (OpenMI) was developed within the EU funded projects HarmonIT and OpenMI-Life. OpenMI evolved to become a generic solution to build software components that can be applied to linking any combination of models, databases and analytical/visualisation tools. As an emerging standard in the domain of earth science will play a major role in preservation of data processing capabilities. Open MI has a similar role to the Earth Modelling Framework (ESMF) described above, although a key feature is that it is able to pass variables between models at run-time. A framework of open source components are used to ‘wrap’ components of models and to this extent OpenMI may represent a useful means of preserving linked environmental models.