Functional Requirements
Collections Analysis <>
Functional Req. 2 >
Putting it Together >
Future Needs >
Final Thoughts
Functional requirements for the BHL Metadata Repository (MR)
11:00-12:30
Chris Freeland, MoBot - Proposed functionality of BHL MR [user interface]
Mark McKinney , Luratech/CCS
- Daniel Lanz, CCS - Creating PDF with OCR and XML struct maps including page # recognition
- Klaus Jung, Luratech - JPEG2000, Luratech compression
David Remsen, uBio/MBL/WHOI - demo of imbedded taxonomic intellegence
Presentations:
Chris' ppt and
more chris CCS_2006.pps(Daniel's ppt) Klaus' ppt David's ppt
Notes:
Please correct any mistakes on this page, or add to the notes if you find your brilliant insight/question/response is not recorded
!
User Interface
Chris Freeland (MoBot) has worked with scientists and other users on how they use infomtion on the web. Thinking about this he began to think of a prototype interface for literature. He demonstrated a possible interface for the BHL, with the following functionality:
- zoom/pan page image
- extract page image
- bookmark (in the app, not in your browser)
- toggle/overlay OCR text and image
- wiki-like annotation/editing for both format & semantics - "multi-valent collaboration"
- "discover names" (to enable this, the OCR needs to have coordinates that relate to actual page-image of names)
- citation linking (sigh)
other desiderata: relevance ranking of some sort for name search results; some sort of visual cue or color coding for format - e.g., name found is in an illustration vs. citation vs. description.
Check out
Botanicus to see the first few functions in action
Some of the things Chris was using was Ajax for the Google Map like naviation - zoom, pan, etc.
The concept of two separate files of OCR and Image - so the OCR could have some editing - semantic mark up, format, corrections, etc. He has some experience with the distributed proof reading model.
CCS/Luratech
Mark McKinney (Luratech) explained how Luratech works with tools for JPG2000 and CCS works on content converstion and structure maps. They have begun to collaborate using both of their companies strengths to provide some interesting results.
Daniel Lanz (CCS) gave an overview of docWorks software which can
- create a METS/ALTO XML digital object
- METS describes the entire digital object
- create automatic structure for object (e.g. pdf "bookmarks")
- initial help creating structure needs human QA ex. Danish newspaper project
- includes OCR within the pdf so object is searchable, and can view the OCR
- can use or encorporate external metadata, e.g. MARC, at either the "input" stage or near the "output" stage of object creation
- can do page sequencing using OCR of page #s so you know if pages are out of order or missing (!!) and can fill in 'logical' page numbers where they aren't printed on page
Klaus Jung (Luratech) gave an overview of Luratech's pdf compression ability and use of JPEG2000 (part 6)
- compression uses a "mixed raster" scheme - file is made up of layers, each layer does one thing very well (e.g., b&w text vs. color)
- work now is mostly on pdf/a archival pdf (ISO standard) and developing the Image Content Server which will do on-the-fly conversion to jpg from jpg2000 and deliver to the web
- also has cute page-turner with zoom/rotate
Taxonomic Intelligence - "Names are what puts the 'B' in BHL"
David Remsen (Woods Hole) gave a demonstration of
FindIt using an SIL title that had been scanned previously by Internet Archive (n.b. pretty dirty OCR), and made a case for name-level searching/tracing being an integral part of what the BHL should do.
The name is the metadata in biology but some problems are: lexical variation, taxon changes (1% a year), spelling errors, rectification of common and latin names.
uBio created taxonomic name recognition algorithm that stores recognized 'names' in NameBank - includes misspellings, synonyms, vernacular names. NameBank is linked to ClassificationBank which includes taxonomic heirarchies and synonyms. Tool is trainable. Current names db has more than 8 million names.
If we do taxonomic discovery of names in parallel with scanning for BHL, it will iteratively help the [BHL] OCR. It will also grow the NameBank which can help drive taxonomic initiatives elsewhere (e.g., GBIF & Species2000).
LinkIt demo using SIL title (click on uBio LinkIt under OCR uncorrected text)- on-the-fly recognition of names, puts synonyms, misspellings in index; cross-index with other taxon lists (ITIS, Species2000); If you go to uBio and upload text, you can browse your scanned text based on class or alpha index of all names.
Discussion:
Q: Where are the names indeces in relationship to the texts? They are outside the text -- you always want to go to the index each time you view your document, since the namebank works iteratively, the taxonomic intelligence is always improving as the bank grows.
Tom G. agreed that we must have this [taxon. intell.] component early on in the BHL development so that the BHL has immediate value to the taxon. community.
Q: Does the NameBank use GUIDs? Yes, it generates it's own GUIDs, since it's gathering irregular (misspelled, vernacular) names. Eventually all the "good" names can get an additional proper taxonomic id, and then all the different GUIDs (ITIS, GBIF, etc.) could be mapped -- that's how LinkIt works.
Names would also need our BHL GUID for page level linking out to other resources (GBIF, etc.)
Decisions/Action Items: