OR/13/042 Findings

From MediaWiki
Jump to navigation Jump to search
Hughes, A G, Harpham, Q K, Riddick, A T, Royse, K R, and Singh, A. 2013. Meta-model: ensuring the widespread access to metadata and data for environmental models - scoping report. British Geological Survey External Report, OR/13/042.

Summary of current activities

Adoption of metadata standards

Respondents to the on-line questionnaire were asked both about how they managed metadata for the datasets used in modelling and about use of metadata for environmental models themselves. In terms of metadata standards for datasets used in modelling about 30% of respondents (see Figure 5) indicated that they adhered to INSPIRE data specifications. However the number of people who use the ISO metadata standards (ISO19110, ISO19115[1], and ISO19119) is relatively small — generally 10% or less of those contributing to the survey for each standard, suggesting that these standards (which are adopted by the NERC data centres) are not so commonly used within the wider environmental modelling community. However, there is a significant overlap between INSPIRE and ISO19115[1] and this may mask the use of the ISO standard. A further c.30% of respondents preferred to use a variety of other more domain specific standards including the metadata components within WaterML (Part 2), the GEMINI 2.1 standard, the climate and forecast metadata convention, as well as the MEDIN discovery metadata standard. In some cases, particularly for larger organisation (such as the UK Environment Agency) internal metadata schemes are used.

Figure 5    Metadata Standards Applied to Data.

Considering metadata standards for models there is a general consensus from the questionnaire confirming our initial view that there is a lack of formal standards for model metadata. Some organisations (e.g. the Environmental Protection Agency in the United States) use their own internal standard. Organisations such as CSDMS in the United States have also proposed a system of describing model metadata.

Using metadata to find and locate data and models

The questionnaire results (Figure 6) indicate that whilst a fair proportion of respondents tend to use metadata catalogues to locate and identify data the most used method of finding data to use in modelling is through organisations which people already collaborate with (36% of respondents) or by approaching known suppliers of specific datasets (c.20% of respondents).

Figure 6    Mechanisms Used to Locate and Identify Data.

This clearly implies a lack of take up of on-line metadata catalogues within environmental modelling and a reliance on personal contact or recommendation. There is also a perceived lack of metadata to help people find and locate the datasets they need (Figure 7) with only c20% of respondents indicating that the level of metadata supplied is sufficient. This suggests a number of gaps in provision which are discussed further in Gaps in metadata provision.

Figure 7    Sufficient metadata supplied with data — is this the case?

When asked whether it was easy to find models produced within other environmental disciplines c.55% of respondents reported that this was a difficult process, with c.28% unsure. However several people highlighted the dangers in using a model developed in another discipline without fully understanding the model. There was also some perception that the atmospheric science community may be better at identifying appropriate models within other disciplines.

Respondents were also asked which metadata attributes are viewed as most important in finding data and models. The results (Figures 8 and 9) indicate that both for datasets and models the most important attributes are descriptive information, the parameters or phenomena involved, and the spatial extent of the dataset or model, followed by quality assessments. For models the type of model (e.g. whether deterministic, probabilistic etc.) and the technical platform are viewed as fairly important metadata attributes (over 30 respondents rate these as of high importance). The programming languages used and the typical runtime are seen as relatively less important for models. Otherwise the trends for finding data and models are relatively similar, although recording the original purpose of the activity is viewed as more important for models than for datasets. An interesting feature of both datasets and models is that items such as reference dates and provenance (the history of the model or dataset) fall lower in relative importance compared to other attributes. These are ‘typical’ metadata attributes which tend to be advocated by data management specialists, and there is a general view expressed that there should be more emphasis on inclusion of metadata items of interest to the end user.

Figure 8    Searching for Data — Relative Importance of Metadata Attributes.
Figure 9    Searching for Models — Relative Importance of Metadata Attributes.

Other metadata attributes which respondents wanted to record included the temporal resolution, including the date and time of measurements within the dataset, and the time period to which measurements relate (e.g. month, week, day, minute etc.). There is also interest in recording any associated and derived datasets. For models additional metadata attributes desired included an indication of ease of use, to avoid spending an inordinate amount of time configuring an unfamiliar model. Although the programming languages used was ranked fairly low in relative importance overall, several respondents indicated that to know if the source code for the model was available was an important factor, particularly for developing compositions of linked models. The minimal data requirements were also regarded as an important element to include in metadata for models.

The role of metadata in making use of data and models

The majority of questionnaire respondents who supply metadata (c.40%) indicate that their primary reason for providing metadata is to assist others in using the dataset or model (Figure 10), and this seems to be a more important driver than providing access.

Figure 10    Primary Reason for Providing Metadata.

For users aiming to making use of data (using descriptive and technical metadata) the parameters represented and units of measurement, together with spatial details and quality assessments are viewed as the most important metadata attributes (Figure 11) with about 60 respondents ranking these as important. Data and file formats are also considered to be reasonably important. Similar to the metadata attributes for finding and locating data reference dates (date created etc.) and provenance are also regarded as relatively less important in the ranking. The inclusion of a digital object identifier (DOI) within the metadata schema is regarded as relatively unimportant, and this is an interesting trend considering the increasing interest in using DOI’s to uniquely identify datasets within data management generally.

Figure 11    Making use of Data — Relative Importance of Metadata Attributes.

In the case of models (Figure 12) the metadata attributes perceived as most important were:

  • information on the datasets used as inputs
  • details of the parameters represented in the data
  • the assumptions made in building the model, and
  • information on the models used as inputs

Approximately 60 respondents ranked these four attributes as important. A total of 35 respondents ranked information on the details of the software or model code as of high importance, there was also an indication from the free text comments that this is important information to have particularly for customising code when linking models together. Most of the remaining attributes on the right hand side of Figure 12, including file formats available for input and output and compatible model coupling technologies fall further down the order of relative importance for models with only 20% of respondents regarding these as of high importance to record in metadata (though generally a good proportion of respondents do regard these attributes as of at least moderate importance). Again information on input and output file formats and coupling technologies would seem to be quite important to know about when selecting models for coupling together in a composition, and the relative importance of these attributes would be expected to increase in linked modelling scenarios.

Figure 12    Making use of Models — Relative Importance of Metadata Attributes.

Other attributes recommended for inclusion in metadata for data include temporal descriptors and resolution, and also an improved means of describing units. Although provenance was ranked as relatively less important overall there was some interest in being able to access information on the history of use (e.g. what the model had been used for and whether it had met previous requirements). Availability of documentation on the model (for example possibly a link to documentation) was also mentioned several times as being a required metadata attribute.

Best practice

The questionnaire results highlight a number of current trends in best practice concerning how metadata data is used within environmental modelling.

Metadata standards in environmental modelling

The questionnaire results suggest relatively low levels of adoption of the ISO metadata standards (e.g. ISO19115[1]) which are in general use by NERC data centres for discovery metadata. However, domain specific standards tend to be more commonly adopted for example the metadata elements within Water ML 2.0 Part 1, and the climate and forecast metadata convention applicable to climate modelling.

At the same time when asked about the most important attributes to assist discovering and using data and models, many of the attributes commonly found in for example the ISO19115[1] schema are regarded as important elements of schemes for discovery and descriptive or technical metadata. This suggests that the ISO19115[1] schema for discovery metadata (possibly with appropriate extensions) may provide a good basis for developing a readily adoptable discovery metadata scheme to support environmental modelling.

A number of researchers suggest that there is an over emphasis on spatial metadata attributes in the ISO metadata schemes and that more information, particularly on attributes such as temporal resolution, and the units in which parameters are expressed, should be included.

Best practice issues relating to discovering and accessing data and models

The overall impression is that a metadata schema to support environmental modelling must be easy both to populate and to obtain access to for search and discovery purposes. Such a scheme should easily support the minimum requirements of various environmental disciplines. It is evident that such a schema is not available at the moment but that there are strong drivers within the modelling and IT community to create such a schema (see further discussion in Gaps in metadata provision)

There is clearly a strong interest in users being able to access the data they need in a format which is useful for them, even if they have to convert from one format to another. There is therefore a need for metadata profiles to include file format information.

The need to be able to capture metadata retrospectively from legacy projects has been mentioned by a number of respondents. There is an indication that this may be less of a problem with NERC funded projects over recent years because of NERC’s metadata requirements for submitting data.

Issues relating to model usage

The metadata provided for each dataset should include some documentation on how to use the model. This could for example be in the form of a URL link to appropriate documentation. Related to this the possibility of recording a dataset owner or expert user was also highlighted in the questionnaire, this would provide a means of obtaining advice on the appropriateness of the dataset for various purposes. Technical information on how to configure the model for use is particularly desirable. As one respondent remarked:

“there is little point in being able to access a model and then have to spend several days configuring it to run on your own system.”

Other specific information should include whether the source code for a model is available and how to access this, particularly for users wanting to develop compositions.

A number of respondents were interested in indications of data quality being present in the discovery metadata, and also indications of uncertainty. The quality information is particularly valuable when using data or models of course and should include some estimate of accuracy (for example for data items) and also an indication of any limitations with the model.

Where ‘real’ measured data has been mixed with modelled estimates, for example in an input dataset or model this information should be included in the metadata accompanying the dataset or model.

IPR and policy matters

Although IPR and policy matters relating to environmental metadata were not specifically examined in the questionnaire, a number of comments on this area were offered, and have a bearing on the development of metadata systems. It was a widely held view by respondents that data provided by academics or public bodies should be available without cost, the view was expressed that tax payers have already paid for the capture or production of that data and therefore should not have to pay again. There is also a need to encourage more data to be made available in the public domain, an issue was noted by some public sector organisations that although they were aware of high quality commercial data they sometimes had to nevertheless use alternative public sector data which were considered inferior, because they could not obtain access to the higher quality commercial data.

Gaps in metadata provision

Discovering data and models

The survey results confirm our initial supposition that there are conspicuously few widely used metadata schemes for models. However many respondents do regard a number of the metadata attributes included in ISO19115[1] for example as being important and useful both for discovering data and models even though they may not currently use this standard formally. Metadata elements already contained within the ISO schemes included the spatial extent and spatial reference system, which were viewed as critical for determining the spatial resolution of models when for example linking regional or global models with lower resolution models. The responses overall indicate that the definition of a minimum set of required metadata attributes which are applicable across discipline boundaries is a key requirement. It may be that this could be based for example on the ISO19115[1] schema.

Metadata attributes which were of particular interest to environmental modellers and which go beyond the level of detail provided in the current ISO schemas are described in Table 1. Additional metadata elements suggested within the questionnaire are further summarised in Figure 13.

Table 1    Additional metadata items to assist discovery of data and models
Attribute Information Required
Data/Model Quality Assessments For datasets — Including estimates of accuracy and also measurements of uncertainty

For models — limitations and assumptions
For models — Scientific Pedigree (e.g. peer reviewed publications)
For models — Does the model answer the questions it was designed to address

Additional description of temporal parameters Temporal resolution and scale (e.g. period of time over which measurements have been made, years, months, weeks etc.)
Also what statistical information (if any) is available over a given time period
More information on dates and times when measurements were made, this is considered more useful than dates when the metadata record was submitted

There was a strong interest in having more metadata about the computing environment and model code when using models (see Supporting the use of data and models) but at the discovery level there was an interest in simply recording whether the model code was available, in order to assist modellers seeking to build linked model compositions.

Figure 13    Some additional metadata elements recommended.

Supporting the use of data and models

As described above there is a lack of established metadata schemes for models. Some discipline specific schemes are available (as described in Adoption of metadata standards), and some organisations use their own internally developed schemes. However, as with discovery metadata, there is a clear recognition within the overall environmental modelling community that a usable scheme supporting dataset and model usage that is not constrained by discipline boundaries is required.

The availability of better descriptions of temporal and quality information within the metadata is seen as particularly important when using data for environmental modelling as well as when discovering data. Improved information on the units used is also desirable (e.g. for molecular ratios it is important to state whether the units are Mol/Mol or g/g, or ‘%’).

Another major area which requires metadata development to facilitate effective model use is details of the computing and modelling environment including:-

- Information about the code used to create the model
- Information on the computing environment used
- Which sub models were used in a linked ensemble
- Documentation on how to use the model (e.g. what assumptions were made, and any limitations on its intended usage
- Information on the required input and output data
- Information for input and output data should cover all data types (e.g. constants, parameters and variables) and how their variation over time and space is recorded

Additional metadata elements desired in a metadata scheme to support environmental modelling are further summarised in Figure 13.

An additional recommendation from the questionnaire was that each dataset is assigned an owner or expert user who can be contacted for further information on the dataset if required. This is actually already a component of NERC’s own data management policy and could be extended to a wider metadata scheme.

Additional requirements arising from environmental modelling workflows

There is a common trend in the development of e-infrastructures for environmental sciences within Europe and beyond for users to rationalise the number of web portals for access to models and data, for example to create portals that federates together other existing catalogues. This aspiration is reflected in a number of our questionnaire responses.

In addition to providing better access to metadata to enable other researchers to locate and use them, there is also a perceived need for better systems for model developers and dataset providers to supply the metadata in the first place. These could include for example improved methods for automatically extracting certain metadata, or integrating metadata collection more with the modelling process, to reduce the time/resource impact on the modeller. Some of this information is recorded as part of the modelling workflow, but it often resides in reports and is not systematically made available for model discovery and access, and so mechanisms to make this information more widely available are needed.

The questionnaire results also imply a general lack of availability of software tools to create or access metadata. Tools that are used include Arc GIS which has its own tools for managing spatial metadata. NERC research centres (particularly BGS and CEH) provide research centre catalogues and contribute to the NERC data catalogue. One solution could be for easily available open source tools to access metadata. There is also an interest in improved tools to readily select data on geographic criteria and in time slices, which can export the selected data ready for use. The NERC Centre for Ecology and Hydrology (CEH) have developed internal systems for this, and further development of such technologies will rely on the availability of suitable metadata, particularly including appropriate temporal information.

There is also a view expressed that as a long term aim a metadata standard for modelling should contain the information to build, either manually or automatically, a composition or series of linked models, and should contain sufficient information to detect errors in such a composition.

The requirement for various types of semantic support within a metadata system, for example across different disciplines, or between different countries and languages was also highlighted by a number of people.

Breaking down barriers to more integrated cross discipline modelling

Over 50% of respondents reported that it was not easy to locate models produced in other environmental disciplines, with a further 28% being unsure how easy this was, demonstrating a clear need for better systems to locate models. Lack of searchable catalogues and common ways to describe models were also viewed as important barriers to making models more widely available (Figure 14).

Figure 14    Barriers to the wider availability of models.

Summary of essential gaps to be addressed

The key gaps in provision identified are summarised in Figure 13 and include:

  • Metadata elements describing the temporal information available in datasets and models both for discovery and use of data and models
  • More information needs to be provided on data and (particularly) on model quality issues, to assist users in selecting models which are suitable for their purposes
    • With regard to metadata for researchers using models there is a definite need for more ‘technical metadata’ information. A number of required elements have been suggested in the questionnaire and clearly to some extent reflect individual preference. But the key emphasis is on information on how to configure and use the model. A requirement for the metadata simply to contain a link to existing information about the model, whether in a user manual or research paper etc. was a fairly common requirement
  • In terms of being able to find out what other models are available there is a clear lack of suitable metadata catalogues (presumably because the metadata itself is not available)
  • Users are not aware of suitable software tools to capture metadata within their domains, and there is an indication that such tools need to be developed. Clearly for a metadata scheme to support environmental modelling to work then users need to be able to enter and supply their metadata easily

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 ISO19115, 2003. Geographic information — Metadata. International Standards Organisation, ref: ISO19115:2003(E).