OR/13/042 Methodology

Hughes, A G, Harpham, Q K, Riddick, A T, Royse, K R, and Singh, A. 2013. Meta-model: ensuring the widespread access to metadata and data for environmental models - scoping report. British Geological Survey External Report, OR/13/042.

On-line questionnaire

In order to capture the views of a wide spectrum of stakeholders on how they are currently managing metadata for integrated modelling and what gaps exist, an on-line survey was constructed using the ‘Survey Monkey’ tool. This questionnaire was structured to understand both how users approach metadata for data sets used in modelling, and also to explore issues relating to metadata for the models themselves. Accordingly the survey was circulated to over three hundred and twenty stakeholders in universities, commercial organisations, other research organisations, in addition to the NERC data centres.

A total of 108 responses were collected over a four week period. The majority of the respondents held senior positions in their organisations giving weight to the findings of the study. In order to facilitate good take up of the questionnaire the number of ‘mandatory’ questions was kept to the minimum, and so respondents were free to ‘skip’ questions as appropriate. Nevertheless the majority of the respondents completed most of the questions, providing a useful set of data on which to base conclusions. As another aid to maximising the level of response most of the questions were of a multiple choice format where respondents simply selected an option on screen, but scope was provided for users to also record ‘free text’ responses(for example additional comments or opinions on gaps in provision) and very useful additional information was also captured in this way.

The survey was sent out to the extensive contacts networks of BGS and HR Wallingford mainly within the UK but also further afield, and links to the survey were also enabled from relevant websites to maximise take up. The graph in Figure 2 shows that a number of responses were also received from other parts of Europe, as well as the United States and Australia.

**Figure 2** Number of Respondents by Country.

In order to better understand differences in metadata requirements between different environmental disciplines respondents were also asked to indicate their primary science discipline. Users were asked to select their discipline from a predefined list. Overall the results indicate that a variety of disciplines are represented including climate change, earth system modelling, ground water and land use modelling (Figure 3). An option was also provided to record disciplines not listed, these also indicate a very wide variation including a number of individuals involved in IT and systems development to support environmental modelling, CO₂ storage and reservoir modelling, as well as a small number of people involved in biodiversity and also catastrophe modelling.

**Figure 3** Scientific Disciplines Represented.

Respondents were also asked to indicate their organisational roles e.g. data supplier, end user of models, model developer (i.e. Involved in creating model code and systems to support modelling, and those actively involved in the process of integrated environmental modelling. The respondents included a small proportion of data suppliers with the remaining c.90% split fairly equally between end user, model developer and modeller (see Figure 4).

Visits and phone meetings

Environment Agency

A visit to the Environment Agency HQ in Bristol was undertaken on the 16th July. The meeting was held between BGS staff (Stephanie Bricker, Geraldine Wildman, Andrew Kingdon and Andrew Hughes) and the Environment Agency staff responsible for models (Helen James), Data (Brian Wilson), Data licensing (Paul Hyatt) and Data Sharing (Chris Jarvis). The management of data, the drivers and use of metadata within the Environment Agency were explained.

The main issues presented by the Environmental Agency staff were:

Legislative drivers are very important — both UK Government and European, e.g. Water Framework Directive and INSPIRE
Freedom of Information (FOI) enquiries — There are a huge amount so have to reduce them, some 47000 in all at a huge cost in staff time
Significant amount of datasets (1500 in all) and data flow mapping undertaken on them all

In terms of metadata and data use within the Environment Agency:

A small proportion of Environment Agency metadata is made available via data.gov.uk The vast majority is held in an internal repository.
Linked data — Bathing Water Quality collected, analysed and then checked before being made available in via linked data (e.g. a method of publishing data in a defined structure so that it can be interlinked and be used to provide extra services). These data then serve all internal and external requirements.
All data is managed by a service provider, with spatial data held in Oracle which is distributed as 50 copies to the Environment Agency regions
Standards are very much used with Defra open data strategy and metadata using ESRI spatial data. Currently investigating ways of dealing with both discovery and technical metadata

NERC data centres

The NERC website defines the role of its Data Centres as ‘It is essential that data generated through NERC supported activities are properly managed to ensure their long-term availability. Our network of data centres provide support and guidance in data management to those funded by NERC, are responsible for the long-term curation of data and provide access to NERC's data holdings. The NERC Data Policy details our commitment to support the long-term management of data and also outlines the roles and responsibilities of all those involved in the collection and management of data.’

There are seven NERC data centres, relating to the following subject areas:

Atmospheric science
Earth sciences
Earth observation
Marine Science
Polar Science
Science-based archaeology
Terrestrial & freshwater science, Hydrology and Bioinformatics

Representatives of NERC Data Centres contributed to the project whether via the Survey Monkey questionnaire or by direct contact. Of particular interest for this project is the NERC funded ‘Model Core’ project which aims to extend the storage of data to models themselves. The project, reporting to the NERC Science Information Strategy (SIS), is currently investigating the feasibility of a ‘gold standard’ which will:

Build on the current NERC policy on archiving simulations (BADC Model Data Policy)
Ensure that rich metadata are available for the model (both discovery and technical)
Input and output files are in standard formats and have associated Digital Object Identifiers (DOIs)
Define how to store models, i.e. using a model code repository such as SourceForge, GitHub, etc.
Provide a way of recording where the models are stored (Register of Code Repositories or RCR)
Have adequate documentation with which to understand all the elements of the modelling process

Follow up to survey monkey questionnaire

Interviews were conducted with:

Dr Deborah Hemming, Met Office Hadley Centre, Professor Andrew Wade, University of Reading

The purpose of the interviews was to clarify some of the responses made to the questionnaire and potentially gather further useful information from selected individuals who were clearly engaged with the topic.

Dr Hemming mainly works with global and also regional scale climate models, whilst Dr Wade specialises in biogeochemical and fluid flow modelling. Despite the differences in disciplines covered, both researchers were interested both in the representation of temporal and spatial information in metadata and this further highlights the interest in these areas reflected in the questionnaire results. In both cases there was an interest expressed in temporal resolution — so that a modeller had sufficient metadata to, for example, select data containing the minimum or maximum temperature parameter for a given period e.g. (month, week, day etc). It seems that some of this type of capability may already be incorporated within metadata for climate models, providing a basis for developing a scheme suitable for other environmental disciplines.

The other common theme concerned information on spatial extent. It was clear from the interviews conducted that there is an important need for information to be recorded which allows users to understand the spatial resolution before they proceed further to download the data. This is a particular issue when linking together large scale climate models with data more at a geological scale — for example soil moisture datasets, and is clearly viewed as a key issue to address in developing a metadata scheme.

Both interviews also highlighted the additional ‘technical metadata’ information that others had outlined in the survey including information on the model code and information on how to actually run the model (including time steps and assumptions made).