BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

BIRDS OF A FEATHER OCR NOTES September 23, 2010

printer friendly"Birds of a Feather" OCR/Text Enhancement/Articleization Breakout

OCR Text correction
ALA focused on text correction sooner rather than later. Want to have everything going on simultaneously. Volunteer portal for ALA. Ely will provide instructions to those running the volunteer portal for investigation and planning.

1. OCR
Henning: IMPACT: BHL-E discussed giving them text files for correction. On hold but can contact again to see if test files can be supplied. BHL is test bed (have contact) but do not have resources available to clean up a test set and develop algorithms.

Involve Noha and BHL-China. Bibliotheca Alexandrina has software learning process for OCR and it is part of the workflow. It is somewhat manual because each script type must go through iteration before automatic OCR can be used.

Post 1850's solution but for the earlier materials and manuscript pages use Bib Alex. technique (feed a sample and have software learn).

Books with mixed fonts--think about Australian newspaper project. Also, italicizing of taxonomic names causes some problems with OCR.

Need some advice--will IMPACT give it? No deliverables on website.

Graham- Should we think about tagging each text issue in the metadata as it goes in so that later we can run OCR again in a more specialized way. Date is one marker but may be other issues.

Scanning process through IA is single pass through OCR. China not running through ABBY cjk module.

ACTION: Write a clear problem statement and frame it around biodiversity text (scientific names, scripts): Graham noted that we need only 100 pages because our set is too big. Look for variety in language and publication date, make sure species names are included, and use taxonomic works such as Index Animalium, plates, plate explanations. Check with group for really nasty items. Check with Heimo and Francisco for difficult works.

ACTION: Identify 100 test pages and test for correction (see above)
ACTION: Influence Brewster/Robert to incorporate CJK modules (Chris)
ACTION Henning to Contact IMPACT
ACTION: Ask Bob Corrigan to talk to Google (Graham)
ACTION: Talk to Noha about adapting BA system (Chris)
ACTION: Write a clear problem statement and frame it around biodiversity text (scientific names, scripts)
ACTION: Talk with Australian National Library about their process (Ely)

Can take 3 hours a book (Prime OCR with "voting" engine) to run OCR on one computer. Prime is licensed by the processor and is expensive.

Could run open source engines in parallel.

ABBY handles black letter typeface. One problem we have is that languages are often mixed in one publication. (MOBOT uses Prime).

Evernote has better OCR than ABBY but private company. Worked on old German text, too. They had an interest but we have too much.

Other OCR engines Aqua..., OCRopus, Tesseract

Could do the OCR ourselves. Run 3 or 4 different paralel OCR sets to get higher resolution with stochastic evaluation. There is some manual review involved. There are complex techniques but no resources, no skills.

Have expertise within our ranks but need a research project to do an evaluation and then need applications built around the need.

Google would really like this. But hard to get them interested in this project. Are there funding opportunities?

improve ocr and make it better in BHLE work package but no resources for research into problem.

Google, JISC, should be interested but no one seems to care. Some open source solutions.
2. Techniques to enhance text once available.

Text correction, ongoing cleanup and markup. Volunteer correction--does it overwrite or add to? If text is corrected it has to be marked so that it doesn't get run through ocr again.

Deja vu text correction environment: Wikisource. We can drop books into this environment. We should prioritze and be deliberate about what we put in this environment. But it should be plugged into normal tool set. How do we keep track of what has been done? In Australia project, you can see what has been corrected and what hasn't. Corrected text could be marked (EOL). Henning asked how to take the text correction into name finding? Chris noted that any time a text is corrected, the text would be reanalyzed for names.

If we can use something like Wikisource, drop BHL books into a local version of wikisource and then have a button for "correct this text". Process text and bring it back into IA for future use.

ACTION: Develop a prototype/test for "correct this text" feature. Investigate wikisource; Australian solution, Chinese solutions (Chris et al, Ely) (limited development resources).

ACTION: Henning will investigate potential resources from BHL Europe to work on this problem. (Speak to Adrian and then Wolfgang).

Australian text correction is locally built using google code.

Investigate metadata filters "australia" "deserts" etc.
Have users add metadata tags.

Are we going to keep all the regional enhancements in sync? Tag wars!
Try looking at what people look for and can't find and then add tags for those items (Australia). Build tag-based system that sits on top (someone else can do it). Handles would make it very easy.
Sharing tags generates a large load.

3. Markup
Goldengate, etc. Encode test? mark names? Graham noted that we are not resourced to do this so let others do it. Open our data to tools that already do this.

Encourage others to come to BHL and use the data.

4. Articleization

have the facility/capability for pdf.

ACTION: Make articleization feature easier to use.
ACTION: Make articleization feature more visible.

Not enough metadata. Need to gather data.
User behavior. Australia plant names index discussion. Don't try to improve the data.

Patrick Leary has written Nomenfinder looking for nov sp etc. to identify where new species, genera etc. are.
Tropicos tracks new species and then uses open url resolver to link in by citation.

ACTION: Promote use of open URL resolver to other communities.
ACTION: Revise landing page to have a standard article database search boxes and use open url to deliver results. Then add voting engine "no, not what I was looking for" or "yes this is what I was looking for" and store in citebank.