Executive Summary

Executive Summary Draft
BHL Meeting
June 12, 2006
Smithsonian Institution
Washington, D.C.

The feasibility of initial work required to create the BHL has already in been proven by the Internet Archive: the large scale scanning of documents and the collection of some metadata associated with title level works can be captured (fetched through Z39.50 and stored in associated files at the title level).

But it is clear that these functions are not the only requirements for a functional 1st level BHL. It was proposed to create a prototype of sorts known now as the Skinny End to End Pipe (SEEP) that incorporates more of the requirements that are essential for a functional and fundable BHL. This would take a single journal title to test processes from shelf to Web.

Marcia Adams (SIL) and Bill Carney (OCLC) discussed the work that is going on to collect in one place the metadata identifying BHL materials. OCLC has offered to do an analysis of the data. Three libraries sent material to OCLC: Missouri Botanical, Natural History London, and Natural History Smithsonian. Initial results of these three pulls of data pointed out some needs for clarification on the Collection Development policy for BHL (what material to consider, what topics, ranges, etc.) And it was stressed that there is a need to have all participating libraries data sent to OCLC as soon as possible.

The “Data and Metadata” table by Martin Kalfatovic and the outline of possible functions provided by Neil Thomson shows some of the unique needs of the BHL. IA may store some of these data and metadata sets, but additional services will be necessary to serve up this information in a functional way. Prototyping some material in IA has shown that some necessary metadata functions are currently not there. For example: there is no place for applying more granular (and required) metadata description and therefore discovery for serial (serial-like) titles and parts; plus, there is no place for the Taxonomic Intelligence tools to reside, work, and deliver the mandatory first level of need.

IA material is not searchable/ findable in a BHL desired way with the important text “locked” on the virtual page. We need to have better title level and part level metadata created for discovery and delivery. We need to have some Taxonomic Intelligence associated with the text to bring the right data to the right users. And we need to have some sort of interface to get the information connected, searchable, and deliverable – even if not for public use. This includes the more powerful PDF tools (possibly from a vendor such as CCS), the better compression of the images (possibly from a vender such as Luratech), the XML mark up of the combined efforts (as seen by the demonstration of Luratech and CCS into METS documents), the Taxonomic Name Intelligence (as demonstrated by of MBL/WHOI, and possibly other taxon identification projects).

Currently there is a gap between what we have accomplished, what we want and what we need to do first to get the BHL started. Internet Archive needs to develop a workflow for serials. Taxonomic Intelligence tools need to be incorporated and the resulting files need to be stored and accessible.

Chris Freeland demonstrated an approach to an interface that melds together all the needs foreseen by the BHL community. His Strawman version incorporated the images production, the OCR text files, the navigation system need for serials, and the taxonomic intelligent identification of species names and resolvers. Neil Thomson outlined all the requirements needed for a fully functional (version X) of the BHL which included the need for GUIDs; interaction with other biodiversity communities (researchers, organizations, publishers, etc); register of intent to scan; and rights metadata.

SEEP prototype would like to test a suitable serial run to begin to see how all these tools and tool sets can be used to form the full scale BHL. A group has formed to try to test SEEP by the second week in July. What might be tested by the SEEP project is the idea of having the BHL be “stuff” driven or “metadata” driven. In reality, there needs to be items to be scanned to initiate the whole process of a title. But without reliable metadata connection the “stuff” is lost. The coordination and timing of the collection of the various required pieces needs to be analyzed.

*Robin Chandler and Robert Miller need to be sent a heads up warning that we would like to contact them about the practicalities of workflow of the scanning and the storing of associated data/ metadata.

*We need to determine the data and metadata that needs to be acquired and at what stages of the workflow. This includes

title level description and article-level description
workflow item level tracking
returned scanned images
OCR’d text
applied taxonomic tool sets
results of the taxonomic intelligence
delivery of images, pdfs, and taxonomic data
delivery of citations, bibliographic records, and other metadata?
notification to OCLC that a title has been scanned

A double stream will need to be developed so that there is the proper metadata to harvest and generated for the serial needs; while, there is a ‘stuff’ generated metadata stream coming out from the scanning process. Combining the diagrams that were initially drawn of the Metadata Repository (MR) and the one outlined by Bill Carney of OCLC, a physical workflow and an information workflow needs to be developed. A “dirty” MR will be made up of the extracts from each participating libraries’ catalogs. The scanning process can then call upon this MR to pull data for each object at the time of scanning. The workflow process needs to insure there is some tracking of each “piece”. After scanning, data will flow out of the scanner of the digital object with the associated metadata. It is at this purgatory stage that the further processes will be applied (OCRing, Intelligent division to intellectual units (CCS), Taxonomic Intelligence (MBL/WHOI), etc.). From there the combined product will be placed in to the “clean” MR. This final MR will be the completed, scanned material with all the 1st level BHL functionality. (See attached diagram.)

*We need to continue to get more libraries collections to OCLC for analysis. The “Best Practices” document needs to be completed and shared. Assistance needs to be offered to those libraries that are having difficulty getting their data to OCLC.

*After that is done, the data and findings need to be examined to determine if the analysis can help guide decision making (which library does what material, what material are we lacking, and who knows what else will become clear).

*Each library needs to begin thinking about how their material will reach a scanning center. Consortial agreements need to be sought to develop ways to keep the scanning centers “fed” with enough material to keep the costs as low as possible.

*Specification requirements need to be drafted up for functions that are determined to be mandatory for 1st phase of BHL. These requirements will help people (Brewster (IA), Sebastian (IndexData), Mark (LuraTech/CCS), Cathy (MBL/WHOI) develop a cost analysis of what it will take to get to where we need to go. SEEP will be testing this as a prototype initially.

Metadata collection pre scanning
Metadata collection post scanning
PDF / Image compression
OCR and XML deliverables
Taxonomic Tool set interface and index enhancement for taxonomic names
Others

The following working groups were identified:

*Workflow: AMNH (Christie?), SIL (Ann J?, Lowell?), NH London (Bernard?), IA (Robert Miller?).

Digital workflow: IA (Brewster?)

Metadata workflow: OCLC (Bill?)

*Post OCLC Analysis/Data returns

What does the merge data show us?
Can this be the initial “dirty” MR?
Collection Analysis Team needs to form
Rights for Metadata: OCLC and IA (Quick agreement settled on regarding data that is returned from OCLC’s analysis that can be shared and any restrictions)

*Extraction of metadata records to OCLC: Suzanne (SIL), Bill (OCLC), Brian (OCLC), Chris (Kew), Joe (Harvard both locations?), Susan (NY Botanical), Eric (AMNH), Bernard (NH London), Zoltan (MOBot)

*Skinny End to End Pipeline: Bernard or Ed Chamberlain(NH London), Martin (SIL), Christie (AMNH), Brewster (IA), Mark (LuraTech).

Opening Remarks>Collections Analysis > Functional Req. > Functional Req. 2 > Putting it together > Future Needs > Final Thoughts