de.mpg.escidoc.pubman.appbase.FacesBean
English
 
Contact usLogin
  Advanced SearchBrowse

Item


Conference Paper

Released

panMetaDocs : a tool for collecting and managing the long tail of 'small science data'

Klump, J., Ulbricht, D. (2011): panMetaDocs: a tool for collecting and managing the long tail of 'small science data', AGU 2011 Fall Meeting (San Francisco, USA 2011).



http://gfzpublic.gfz-potsdam.de/pubman/item/escidoc:245363
Resources

PanMetaDocs-AGU.pdf
(Publisher version), 444KB

Authors
http://gfzpublic.gfz-potsdam.de/cone/persons/resource/jklump

Klump ,  Jens
CeGIT Centre for GeoInformation Technology, Geoengineering Centres, GFZ Publication Database, Deutsches GeoForschungsZentrum;

http://gfzpublic.gfz-potsdam.de/cone/persons/resource/ulbricht

Ulbricht ,  Damian
CeGIT Centre for GeoInformation Technology, Geoengineering Centres, GFZ Publication Database, Deutsches GeoForschungsZentrum;

Abstract
In the early days of thinking about cyberinfrastructure the focus was on "big science data". Today, the challenge is not anymore to store several terabytes of data, but to manage data objects in a way that facilitates their re-use. Key to re-use by a user as a data consumer is proper documentation of the data. Also, data consumers need discovery metadata to find the data they need and they need descriptive metadata to be able to use the data they retrieved. Thus, data documentation faces the challenge to extensively and completely describe these objects, hold the items easily accessible at a sustainable cost level. However, data curation and documentation do not rank high in the everyday work of a scientist as a data producer. Data producers are often frustrated by being asked to provide metadata on their data over and over again, information that seemed very obvious from the context of their work. A challenge to data archives is the wide variety of metadata schemata in use, which creates a number of maintenance and design challenges of its own. PanMetaDocs addresses these issues by allowing an uploaded files to be described by more than one metadata object. PanMetaDocs, which was developed from PanMetaWorks, is a PHP based web application that allow to describe data with any xml-based metadata schema. Its user interface is browser based and was developed to collect metadata and data in collaborative scientific projects situated at one or more institutions. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. In the development of PanMetaDocs the business logic of panMetaWorks is reused, except for the authentication and data management functions of PanMetaWorks, which are delegated to the eSciDoc framework. The eSciDoc repository framework is designed as a service oriented architecture that can be controlled through a REST interface to create version controlled items with metadata records in XML format. PanMetaDocs utilizes the eSciDoc items model to add multiple metadata records that describe uploaded files in different metadata schemata. While datasets are collected and described, shared to collaborate with other scientists and finally published, data objects are transferred from a shared data curation domain into a persistent data curation domain. Through an RSS interface for recent datasets PanMetaWorks allows project members to be informed about data uploaded by other project members. The implementation of the OAI-PMH interface can be used to syndicate data catalogs to research data portals, such as the panFMP data portal framework. Once data objects are uploaded to the eSciDoc infrastructure it is possible to drop the software instance that was used for collecting the data, while the compiled data and metadata are accessible for other authorized applications through the institution's eSciDoc middleware. This approach of "expendable data curation tools" allows for a significant reduction in costs for software maintenance as expensive data capture applications do not need to be maintained indefinitely to ensure long term access to the stored data.