Deep Thoughts

Biodiversity Heritage Library
Metadata Repository Preliminary Requirements
Draft
January 2006

This document is a draft idea of what is a minimum requirement for the first stage of a metadata repository for the Biodiversity Heritage Library. The first stage does not intend to be a full scale, publicly accessible union catalog of all the materials. But it is anticipated that from this root metadata repository a fuller, more robust public interface to the BHL can be built. This first stage has a few basic goals it is attempting to achieve:

• Prioritize scanning efforts
• Detect duplicates
• Guide workflow
• Track materials through the scanning phases
• Connect resource basic description with the digital images
• Provide the searching mechanism to retrieve digital scans
• Basic connection of digital scans parts to the whole intellectual unit
• Report on the quantity and items that have been submitted and the submitting institution

This document is laid out by what is consider the metadata repository “Must Do” to be successful. Under each “Must Do” is a list of what the repository “Must Have” to accomplish the task. Description, examples, and scenarios are given in attempt to provide further explanations. Appendices include a list of “Must Have”, issues that need to be resolved, and a detailed break down of potential fields for data.

I. Inventory of items to be scanned
a. The metadata repository should be able to take the metadata from the depositing institutions and do a basic comparison to find duplicate titles and runs.

This will not be a complete list of all materials in each institution but will be benchmark of what materials at each institution can be flagged to be part of the BHL.
b. This requires the repository to be able to ingest data from the repository institutions. This data for this step could be the basic MARC records and be the seed for the entire metadata repository.
c. Should this inventory include holdings for each title at this point? Should holdings be generated, some how from digitization? Should holdings be collected for the deduping?

II. From the inventory, a workflow of what to scan first and where
a. Highlights from each institution can be chosen based on the inventory to choose which topics will be scanned at which location
b. Duplications of holdings can be used to provide back up for filling in missing volumes, issues, etc.
c. Duplications of holdings need to be detected with in each repository library. (Smithsonian has numerous copies of the same titles. Some are rare and other circulate, etc.)
III. Track material through the scanning workflow
a. Unique identifier attached (permanently or temporarily) to the physical item
Example of a unique identifier would be a barcode attached to the physically bound volume of the item to be scanned. This identification number can then be used to for multiple purposes including tracking where the physical piece is located.

Each piece that goes on the scanner must have a unique identifier. It is impossible to maintain 100% accuracy on large scale scanning operations connecting the digitized piece to the proper metadata – but we should come extremely close if it is a scannable identification number and not a keyed in human created link.

Internet Archive has had some issues in linking the scanned object on the scanner via the scanning operator inputting tags of data that they think might be good enough to then later go out and search via Z39.50 for the proper bibliographic record.

A quick test of attached barcodes used at the time of scanning to bibliographic record searched for the exact match of the barcode number seems to be successful.

N.B. The barcode is for the physical piece and NOT the intellectual unit of a digitized product. Multivolumes bound indiscriminately do not help in identifying the intellectual units of a title.

If no barcode is attached to a specific piece, then one can be assigned at the scanning station and post metadata assignments could be made.

b. Assigned locations and/or status coding in field(s)
They system will need to have a field that can hold the locations used in the physical workflow of the scanning process. Ideal this field should be auto filled with each scan of the unique identifier (barcode). The location can be used to track down where in the process the scanned item is at any given time. This can include information such as “Transit From Library to Scanning”, “Scanner 1”, “Quality Review”, “Transit Returning To Library Location”, etc. This could also be timed stamped for further detail.

IV. Connect separate scanned parts of a bibliographic entity together
a. Intellectual identification number for the overall item being scanned.
b. This unique number can be based on some information related to the intellectual bibliographic whole and then have related granular parts be built off the base identification number
1. This could be GUID or some other unique identifier that has some discernable information embedded.
2. This unique identifier can be used to bring all parts of a multipart piece together. The prefix of the identifier could be the “parent” record and then the suffix be added on for the granularity:
• multivolume
• multibound
• chapters
• articles
• down to pages
3. This could be unique identifier that can be used to link between related works
• earlier titles
• later titles
• general relationships

V. Basic bibliographic description of the scanned object
a. The metadata repository should be able to accept traditional metadata already created for the various objects being submitted to be scanned. This includes MARC records.
b. The metadata repository should also be able to accept metadata in other forms (tab delimitated, comma delaminated, MSAccess,?)

VI. Connection between the bibliographic description and the scanned object
a. File naming structure?
b. URL/URI/Handle/GUID
c. What level do we go at this round? Eventually we need page for turning purposes, etc?

VII. Basic search and retrieval of scanned object and metadata description
a. The repository should have basic indexes built to facilitate searching and retrieving the proper metadata.
b. Retrieval should include:
• Title
• Author
• Repository Institution
• File name
• Unique identifier (barcode)
• Unique object identifier (GUID)
• Holdings data (Volume, Issue, etc. How granular?)
• General biological subject heading (?) – This would be incredibly useful at the time of deduping and for reporting contents at various stages of the project.

VIII. Inventory of the overall BHL
a. The repository must record the holdings for what we have. This could be accomplished by ingesting the holdings from each institution and then having the computer system match appropriately. Or the scanning computer system assign holdings as scanned that can then be compared to the institution.
b. Missing items from runs should be able to be generated from holdings of the scanned items.
c. There will need to be some clear distinction between what has been scanned and what has not been scanned.
d. Reporting mechanism for basic statistical data: number of pieces, unique volumes, classification, institutional counts, inventory of what is scanned, inventory of what is left to be scanned, etc.

1. Appendix:
Must Have list
a. As robust as possible bibliographic description
Reuse as much as possible from depository institution
b. Connection between bibliographic metadata to the scanned images.
c. Identification number on each physical piece
d. Globe identification number for intellectual piece that can then be appended with granularity
e. Searchable indexing on basic fields
f. Workflow tracking system
g. Statistical reporting on volumes, completed scanning, classification, etc.

2. Appendix:
Issues to be resolved
a. Do we care if the structure is MARC? Do we need to be able to edit the records in the repository system? Do we need to be able to delete? Able to submit in batch and individual records?
b. What format is most of the data currently? If we ingest only MARC what do we do with the various standards within MARC (are there? NH London different than USMARC?)? Will this box in to only accepting MARC in the future?
c. Other format – or just a database structure?
d. Do we need to hold all information in one place or do we have it be a federated type search interface that then applies extra information to the bibliographic records as needed?
e. How do we capture the holdings of each institution and do we figure out the duplication with in our membership?
f. How do we store the holdings information as we digitize?
g. Utility or Not a Utility:
o Ownership
o RLG and OCLC bring to the table some very interesting expertise but also probably some goals of their own. My concerns deal with the fear of our data being locked up for only members of the Utility.
o An alternative might be for this prototype is for it to actually be “housed” elsewhere until something like permanent residency can be worked out.
o This is not to exclude in the long run a utility from running the larger more robust architecture of the BHL system

h. Linking
external linking
discovery of the title and its parts from external sources
openurl

future links
going to outside sources
outside sources coming in
internal to other “modules” e.g. administrative data, preservation data, etc.

i. OAI compliant
j. PREMIS – preservation metadata? More administrative information? Scanning data?
k. Rare book inventory of taxonomic literature and availability of accompanying metadata. (Smithsonian definitely needs to review this.)
l. Subject groups? Useful for deduping and for post reporting. Potential use for looking for more funds, etc.
Botany vs. Entomology
Linnaean classification
LC Classification
LC Subject headings?
BCA breakdown as a model

3. Appendix:
Data fields that should be collected
Standard Bibliographic records that already exist.
Try to use all legacy data first before creating new stuff.
MODS structure for related items embedded in parent record.
Dublin Core for simplification?

Example of Oxford Digital Library http://www2.odl.ox.ac.uk/guidelines/