17 Jan 2008 Internet Archive Meeting

Internet Archive / BHL Meeting

17 January 2008

Attending

Brewster Kahle (IA)
Steve (IA)
Tracey Jaquith (IA)
Marcus (IA)
Chris Freeland (MOBOT/BHL)
Martin Kalfatovic (Smithsonian/BHL)

WonderFetch Topics

Implementation of WonderFetch in Scriblio. Steve had already made significant progress in implementing the WonderFetch fields.
Action items:

SIL will modify the current WonderFetch interface to reflect the fields and locations that Steven has specified
It was agreed that BHL should maintain a central, unified WonderFetch site for all BHL members to point their scanning operations at. Once the SIL test is deemed successful, migration to the central WonderFetch site will begin
BHL will develop a tool for the ingest of locally generated spreadsheets into the central WonderFetch site

Timeline:

SIL will finalize the WonderFetch
Central site and spreadsheet ingest developed
BLC members begin using WonderFetch
NHM begins using WonderFetch
NYPL site begins using WonderFetch
UIUC and UNC contributing members begin using WonderFetch (UIUC may begin sooner after discussions with Betsy Kruger)

Scanning and downloading information reporting

Optimum methods for obtaining relevant information. After a discussion with Brewster and Steve, it was determined that a deeper access to Metamanager would not be necessary for BHL (and other contributing libraries) to get more detailed information about the number of titles and pages scanned at the contributor, collection, and scanning center levels.
Working with Tracey (the IA staff person responsible for the IA search engine), it was decided that appropriately formulated search queries would be able to return the key data elements that are necessary for BHL (and other contributing libraries) to manage their IA scanning processes. The specific needs were for:

number of titles scanned during selected date (at the month) level periods. This number would include all titles scanned and publically available on the IA site (not just those that have been “curated” or otherwise “approved” by IA staff.
number of pages scanned (as per above).
these queries will also include the IA item id which will enable simplified access to scanned items for use in the Penn State structural analysis tool.
NOTE: BHL staff are aware that the IA may change or “darken” titles between the time items are scanned and the time that they are “curated” and approved for billing purposes. Contributing members need this information for internal and external reporting purposes and realize that there will be inconsistencies between the counts obtained via the search engine and those billed via Metamanger

BHL staff will wok with Tracey (or other appropriate IA staff to formulate the most effective queries. Tracey will implement selected changes in the IA database during the next build to make the queries more effective.

BHL will build and maintain a portal page for OCA members so that this enumerative data can be easily downloaded for post-processing (e.g. importing into spreadsheets)

A related discussion ensued related to a BLC request for information/statistics related to download of files from the IA site. In looking at the item/page counts question, it was determined that a similar search engine query/display would solve this problem. As per above, BHL will assist in the creation of a portal page that will allow for easily downloadable files files for post-processing (e.g. importing into spreadsheets)

Metadata management

Revising/Updating BHL data within the Internet Archive. Editing of metadata within the IA is not recommended. Though possible, the recommended option will be for the BHL to edit data on the BHL portal. Periodically, the data from the BHL portal will be re-ingested (ContribSubmit) into the IA. This should probably be done in small batches (onesey/twoesey), though larger batch processing may be possible. BHL will evaluate the need for metadata updating and create a process to edit the data within the BHL portal. Synchronization of this metadata with that in the IA will be discussed further.

Front-end application layer on the Internet Archive content

Need for a more effective front end. There has been an expressed need for a more functional front end to the content created by OCA members that resides on the Internet Archive. The BHL has developed the BHL portal that meets the needs of the taxonomic community, but at the same time offers a number of robust digital library functionalities (e.g. a linear and random access page delivery system [PDS]. Additionally, the Library of Congress (LC) has received funding for the development of a PDS for OCA content. The BHL and LC will initiate discussion on how best to start this collaboration. Chris Freeland and Martin Kalfatovic will also meet with members of the Boston Library Consortium (BLC) in March to present on the topic of application layers on Internet Archive hosted content.
There was also a discussion of whether the Sloan Foundation could/should be approached to fund additional “next gen” PDS as a joint OCA/IA project (on top of the LC funding). Brewster wondered what it would take to create a JSTOR like (or improved upon) interface to serial literature available via the IA/OCA.
Brewster noted that the Open Library project is primarily a metadata, title level presentation of OCA/IA material and that there were no current plans to have PDS functionality developed within the Open Library. He also noted that the Open Library has need of additional “bibliographic programmers” that would work onsite at the IA.

Data topics

PDF problems. BHL staff raised the question of problems with viewing some PDF files generated by the IA. Brewster noted that this was a known problem that was related to versioning of the Adobe Acrobat Reader. The currently generated PDF files are sometimes problematic in the current Adobe reader. This problem is not evident in other readers. Brewster assumes that the problem will be remedied future Adobe Reader releases. Frustrating, but not really solvable at the IA end.

OCR for non-English texts. Language specific OCR dictionaries are now being triggered by the MARC language tag in bibliographic records. Brewster confirmed that over 6,000 titles had been revisited by the OCR engine and reprocessed using the appropriate dictionary. Chris and Martin guestimated that this number sounds about right. Martin may ask selected BHL members to spot check some known non-English titles to verify the re-processing. Brewster noted that in this process the old data was overwritten and the new files given the new creation date. A new AbbyGZ file is created and that triggers a cascade of events that will re-derive all the related underlying files (e.g. PDF, DejaVu XML, etc.).

Washington Regional Scanning Center

Planning. Brewster and Martin briefly discussed issues related to the Washington Regional Scanning Center located at the Library of Congress. It was noted that Robert Miller and Eric O. were currently on site at the LC installing machines. A scanning center manager had been hired (Ron Peebles) and that issues related to the transportation of material from non-LC locations to the Adams Building were still an issue that had not been resolved. Martin noted that a number of issues related to access to data about IA production were important due to the way that the Smithsonian would be billed for IA scanning and the workings of Federal fiscal years.

Clarification of the Internet Archive Q&A Process

Questions related to Q&A. Marcus joined us for a discussion of how the Internet Archive Q&A process work. Marcus noted that he is the ultimate approvers of materials and can reject items that have been approved at the regional level. Only once Marcus has approved a book will the pages be sent forward to accounting for billing to the appropriate library.

Completed

Structural analysis. All agreed that the reestablishment of the regularly scheduled conference calls on the Penn State project will assist in keeping the project on track.