BHL Metadata Requirements
Biodiversity Heritage Library (BHL)
Metadata Schema - Draft 2
Introduction
The metadata schema to be adopted by the BHL must strike a balance between being practical and comprehensive. It must at minimum serve to identify and locate the digitized literature but it should also make provision for additional functions that may become possible in the future, which may mean that sections of the schema remain unused in the early stages of BHL development.
Note: In developing a "BHL Application Profile" it is not necessary to adopt comprehensive external schemas in their entirety. By using namespace declarations it is possible to select just those elements that will be of value to BHL.
This document outlines the functional areas for metadata coverage and the options for existing open data standards that may be adopted, at least in part. Comments and additions are encouraged and should be sent to
n.thomson@nhm.ac.uk for incorporation.
Purpose
The purpose of the schema is to provide a data standard for the pooling of data between the partners and to aid the process of creating and managing the BHL. A subset will be used as the entrance point for users.
At this stage, it is not helpful to define the exact schema that the BHL will use - too much evolution has yet to take place based on the assessed needs of users and funders and the level of funding received. However, the functional areas that will require metadata can be defined and an appropriate standard identified from which data elements will be drawn to make up the BHL schema.
Functional areas to be covered
Some level of metadata will be required in several functional areas, outlined below. It is strongly recommended that BHL make use of open standards where these exist, rather than inventing its own to serve the same purpose.
As noted above, not all of every standard needs to be adopted, but through the use of namespace identification, those elements that are of direct use may be imported into a "BHL Application Profile". Even then, not all the elements need to be filled straight away, but by making provision for a rich structure, future services may be developed more easily, as and when time and finance allow.
Given that the BHL is a collaborative project which should have a very long lifespan, all the candidate schemas are XML-based to aid data exchange, aggregation and sustainability.
- Packaging: Each volume that is digitised will be made up from many page images, the associated OCR data and some metadata. All of this needs to be kept together in a package as a navigable "complex digital object". METS is a metadata schema specifically designed to meet this need and makes use of already existing metadata, such as the bibliographic metadata that will be exported from participants' OPACs (see bibliographic). Yet to be agreed is the level of granularity that will form a package. Ideally, this would be at article level, but the cost of doing so may be prohibitive in the early stages.
- Bibliographic: For monographs and serial titles, bibliographic metadata may be exported from the participants' OPACs. None of the current participants have article-level metadata, but much of this could be sourced by agreement from the indexing services, such as Zoological Record and Kew Record. There are two candidate schemas, MODS and MARCXML. The former is sometimes characterized as "MARC Lite", with some additional elements that cater for digital material, whilst the latter is a more exact and full mapping of the MARC standard. This metadata will form the basis of the BHL catalog and will allow users to locate the digitized material that they seek.
- Technical: Much of the technical metadata may be derived either directly from the equipment or created once for an entire session. It covers a variety of aspects such as the file format, color information, file sizes and so forth that help to determine what the storage and viewing requirements will be. Digital preservation aspects need to be covered from the start of the project - the principle of benign neglect does not work with digital material. The PREMIS standard (which has won the 2005 Conservation Award) is the best example of a standard for this purpose. For the technical aspects of images, MIX (Metadata for Images in XML) appears to be the current example of best practice and, like METS and MODS, is maintained by the Library of Congress. It is, however, much more comprehensive that BHL is likely to require and is an example of a standard from which only selected elements will be used.
- Administrative: This area will primarily be concerned with IPR (Intellectual Property Rights). Although most items will be in the public domain or be made available under the terms of the Science Commons, this needs to be stated. As the project progresses and agreements are reached with publishers for the use of copyright material, the terms of the agreement need to be made available, at least for administrative purposes if not to the end user. The Dublin Core Admin Core (DC:AC) is a candidate schema.
- Identifiers: Each item digitised must have a Globally Unique Identifier (GUID) so that it may be unambiguously referenced. There are several schemas available, both commercial and non-commercial. The Life Science Identifier (LSID) schema was developed in the genomics world and identified at the Library and Laboratory Conference (London, 2005) as a schema that could unite the domains of genetic sequencing, the specimens from which the DNA was sourced and the literature. The Global Biodiversity Information Framework (GBIF) is also considering its adoption. The National Library of Australia has developed an alternative schema. It is important that the identifier may be easily synthesized, rather that being allocated by an authority.
- Workflow support: Additional metadata will be required purely to manage the workflow and the elements required will become obvious during the detailed planning of the project.
Functional areas to be covered - summary table
Packaging
COMMENT
Level of granularity to be agreed
OPTIONS SELECTION
METS
CURRENT PREFERRED OPTION
METS
Bibliographic or Descriptive
COMMENT
Export from partner's OPACs and third-party indexes e.g. IK
OPTIONS SELECTION
MODS / MARCXML
CURRENT PREFERRED OPTION
MODS
Technical
COMMENT
Includes requirements for interpreting the files and digital sustainability data
OPTIONS SELECTION
VRA Core / MIX for images. PREMIS for digital sustainability
CURRENT PREFERRED OPTION
MIX
PREMIS
Administrative
COMMENT
Includes rights and publisher agreements
OPTIONS SELECTION
Science Commons / ODRL / XRML / ROMEO / DC:AC
CURRENT PREFERRED OPTION
DC:AC
Identifiers
COMMENT
To enable links with other domains, such as specimen data, sequences and the original documents
OPTIONS SELECTION
LSID or National Library of Australia Persistent Identifier Scheme
CURRENT PREFERRED OPTION
LSID
Workflow support
COMMENT
Includes register of intent flags for "Done", "Priority", "Exception"
OPTIONS SELECTION
BHL-specific / JSTOR
CURRENT PREFERRED OPTION
BHL
Further information