Functional Requirements 2

Collections Analysis < Functional Requirements <> Putting it Together > Future Needs > Final Thoughts

Functional Requirements Continued...

1:30pm - 2:15
Martin Kalfatovic, SIL - What data and metadata are required for the BHL
Sebastian Hammer, IndexData - Internet Archive workflow and database design issues

Presentations:

Martin's ppt Edited for size version Martinedited.ppt

Notes:

Martin Kalfatovic handed a list of the types of files that will be generated/needed or expected from the overall BHL project and indicated which files are currently being supported by Internet Archive.

It was pointed out that the PDFs currently are for the entire volume that will not work for the BHL needs - Unit level PDF deliverables will be required.

The embedded data and metadata might need to be editible - for correction and the insertion of GUIDS

Klaus discussed the new JPG2000 lossless vs visual lossless - and the problems with de-skewing and lightening of documents. All of this is post process to be delivered to the browser.

The OCR text structure needs to be in levels and then can be used for taxonomic list and citation resolving. Other parts of the text can also be identified, tagged as used such as author names and geographic locations.

The current Internet Archive model delivers METS, collections and stores MARC, almost has the unit levels, and does have the page level XML. The cataloging module of Internet Archive was developed by IndexData. Sebastian Hammer showed the Z39.50 fetching on the title level. The critical missing piece is the serial information requirements.

Brewster Kahl agreed that the Internet Archive needs to move to achieve the article level and author level needed for serials and that it needs to be a high priority.

Currently, Internet Archive scans and collects some data "board to board" (front cover to back cover of a unit). CCS can work to get to the break down to the article level.

The GAP between what we have and we want and what we need to do first was discussed: 1)Internet Archive needs to figure out a serial workflow. 2) Taxonomic intelligent tools of UBIO data need to reside somewhere 3)Citation resolving needs to be addressed.

Action Items: