BHL Metadata Repository Proposal
View the Metadata Repository Powerpoint
Click to download MSWord version of the BHL MR document
Biodiversity Heritage Library Metadata Repository (BHL MR)
Though recognizing the amazing development of language and the written word, primates remain visual species par excellence. As such, in support of this document, we offer the accompanying PowerPoint that visualizes the concepts presented below.
Purpose and Scope of the BHL MR
The initial and primary purpose of the BHL MR is to support the short term need for a metadata system that will support the large scale digitizing of materials from contributing institutions. A basic assumption of this document is that the scanning operation will be run by the Internet Archive and as such, the BHL MR must focus certain design parameters and criteria that will work with the current Internet Archive workflow as well as being flexible enough to expand beyond that current limitation.
For logistical reasons, it is envisioned that the BHL MR will need to support only two of the eight BHL member library collections during the initial start up of the scanning operation. For this reason, extensive collection analysis is not a mandatory deliverable. For the purposes of BHL, the immediate collections analysis can be done at a gross level – and criteria for what will be scanned first and where by human selection (e.g. library A will scan literature related to vertebrates, library B will scan literature related to invertebrates). This gross level of pre-selection will enable a large amount of scanning to take place while more extensive collection analysis tools, duplicate detection, and gap filling at a later date.
The BHL MR as discussed here is not necessarily construed as a stand alone data system, but as a virtual system that could reside alongside or part of existing systems (RLG Union Catalog, Internet Archive data repository, WorldCat, etc.) It does not intend to be the final solution for a more robust “bells and whistles” public interface. Everything done at this phase should be “migratable” or built upon for the future, large scale BHL.
BHL MR: Phase I: Defining the Bucket
As detailed in the accompanying presentation, the metadata contained in the BHL MR will consist of two types: the bibliographic description (contained in a MARC or MARC flavored format); and item level information. The bibliographic description is straightforward and needs little explanation. The item level information is the key component for handling serials, monographic series, and multivolume monographs.
Elements of the Bibliographic Description
The bibliographic description can consist of a standard MARC record, possibly in native MARC, MARC XML or even a translation to MODS. We anticipate that for the first phase that the data will at least be collected as MARC21. Each bibliographic description record must contain a unique identifier carried over from the source data catalog (e.g. a Horizon bibliographic number or Unicorn bibliographic identification number, or a WorldCat identifier). Additionally, bibliographic description record will be assigned, upon ingest into the BHL MR, a unique BHL MR id.
Elements of the Item Level Information
None of the BHL member institutions, nor possibly any library, have the complete information necessary to create the perfect item level record. Working in this realm of imperfect data, the BHL MR item level information will be constructed in the following manner.
• each bibliographic description record will have a unique id(s)
• each physical piece to be scanned will have a unique physical id attached to it (this could be a barcode, an RFID tag, or similar)
• Metadata associated with the physical id will be minimal; when possible, this information will be pulled from current item level information in the contributing members' online catalogs. When this data is not available, this information will be created during the physical workflow process (see below). This information becomes the primary sub-element of the item level information. A single sub-element is necessary for a BHL MR record, but additional levels of sub-elements should be supported.
• Upon ingest into the BHL MR, the bibliographic description record and the item level information will be linked in an appropriate manner for subsequent ingestion into the Internet Archive.
Elements of the Page/Image Structure Map
The Internet Archive currently has a workflow and methodology for creating a page/image level structure map at the point of scanning. The BHL MR builds upon the page/image structure map to provide the possibility for richer level of information for bibliographic entities that consist of multiple parts. The page/image structure map will consist of the following elements:
• Unique id
• Sequential page/image number (e.g. 0001, 0002, 0003)
• Explicit page information (title page, iv, p. 6, etc.). NB: the explicit page information is a desired but not mandatory element. It is assumed that some level of explicit page information capture would occur in an automated manner (e.g. micro capture of a page area with OCR)
Elements of the Physical Workflow
A specific workflow of materials, the "paging and staging" element is pre-supposed and key to successful functioning of the BHL MR as a tool for large scale scanning. As outlined on the accompanying presentation, the physical workflow will be as follows:
At the shelves:
- Library Team Leader completes preliminary review of a range of materials to be scanned; general rules for the area are prepared for team members and more detailed instructions for anomalies prepared
- Library Team members begin pulling items from the shelf for transport to scanning center
- Monographs are wanded to update status on the BHL MR to "in transit" – could even be tracked in the Library’s ILS circulation module if needed.
- Multi part items with physical identifiers (e.g. barcodes or RFID tags) are wanded to update status on the BHL MR to "in transit" - could even be tracked in the Library’s ILS circulation module if needed.
- Multi part items lacking physical identifiers are tagged and minimal records entered into the BHL MR
- All items are physically transferred to the scanning center
At the project management station:
- All items pulled from shelves and marked "in transit" are batch loaded (via a Z39.50 fetch) from the BHL MR to the Internet Archive system
At the scanning station:
- items are wanded at the scanning station calling up the record ingested by the bulk fetch from the BHL MR
or
At the scanning station:
- Items are wanded at the station to mark and “label” the data.
At the project management station:
- The days worth of identifiers (barcodes or RFID tags) are scanned and a batch load is done (via Z39.50 fetch) from the BHL MR to the Internet Archive system.
BHL MR: Phase II
Phase II of the BHL MR implementation will add a public/administrative interface. The administrative interface will allow for enhancements and modifications to the records. The public interface will allow for querying and retrieval of the full-text from the Internet Archive through the BHL MR.
The most important feature of the second phase will be the opening up of the BHL MR to allow for harvesting of data for reuse by other users. Examples of this would include Coalition for the Barcode of Life (CBOL), Global Biodiversity Information Facility (GBIF), Integrated Taxonomic Information System (IT IS), etc.
GUID Creation
For optimal use of the contents of the BHL MR, page level addressing of image is necessary. Automated page level GUID creation is inherent in the hierarchical structure of the Bibliographic record, item, sub-element, and image/page structure map. Page level GUIDs will be built "bottom up" by combining the bibliographic description ID, item/sub-element ID, and the page/image ID. This page level GUID would be used as the citation level "handle" for other biodiversity research tools such as GBIF, CBOL, ITIS, GenBank, etc. Additionally, commercial indexing and abstracting services (e.g. Zoological Record) would use the GUID to link to full text sources.
2006-02-03