IMPACTNov2010

Participants

BHL/BHLE: Chris Freeland, Henning Scholz
IMPACT: Lieke Ploeger (Manager IMPACT Project Office), Gunter Muhlberger (Sub Project leader Text Recognition), Clemens Neudecker (Interoperability Manager), Stefan Pletschacher (Evaluation Tools)

More information: http://www.impact-project.eu/

IMPACT Pilot-Release materials:
http://www.impact-project.eu/index.php?id=138

IMPACT Best Practice Guides:
http://www.impact-project.eu/uploads/media/IMPACT-ocr-bpg-pilot-s1.pdf
http://www.impact-project.eu/uploads/media/IMPACT-ocr-bpg-pilot-s2.pdf
http://www.impact-project.eu/uploads/media/IMPACT-ocr-bpg-pilot-s3.pdf

Time
Nov 8, 2010
16:00 CET

Agenda

Introduction of both projects
Possible testing of IMPACT tools / BHL as a test case for IMPACT tools
- Specifically the tools for ground truthing
Possible use of BHL OCR test set
- https://spreadsheets.google.com/ccc?key=0AiGG0YxTuAn2dGN4R0ZyTTNTZUhfUlU1cHJ3T2JGdUE&hl=en
GUID/DOI

Meeting Notes (Chris & Henning) | Meeting Notes from Lieke: IMPACT_BHL Europe_telecon_8112010.doc

BHL-EU
· One main interest in extending BHL portal, make it multilingual, improve OCR technology/infrastructure
· Using OCR to search & extract scientific names
· Need high quality OCR
· After IMPACT mtg in Hague Henning thought those tools needed for BHL, want to collaborate with IMPACT
· BHL don’t have resources/capabilities to work on those tools
Gunter – overview of IMPACT
· Main objective: improve OCR
o Start from existing tools like ABBY
o Image segmentation, enhancement, analysis
· Focus on collaboration
o Tools are developed so that you can outsource text correction to user groups & to specific post-correction based on language models
o Some language partners have issues with name-entity extraction
BHL as testset for IMPACT tools
· BHL has 85,000 volumes to use as test set
· BHL try out ground truthing tools
o “ground truth” is the ideal outcome of some purpose
§ for OCR, it would be perfectly correct text
o segmentation
§ text from images
§ paragraph 1 from 2
o evaluation
§ need “ideal” result of some method
§ when we want to evaluate tools, we need ground truth sets
· hundreds or thousands of pages
§ have developed tools to aide ground truthing
§ complete infrastructure for doing this work
§ now producing large amounts of ground truth for documents of their interest
· running texts against their test
§ Our images & OCR
· Our existing text evaluation from Qin
§ Working with service providers in India & other places to farm out work
· For our evaluation set, we need to ground truth
o OCR is flat text file
o Our test set can be run against IMPACT tools
o Then can go to evaluation tools
o Some money available within project for 500-1000 pages / run through the IMPACT ground truth workflow
§ IMPACT will provide 75% of the costs, 25% come from BHL/BHLE
§ Process the dataset from BHL
§ Treat it like the other datasets coming into IMPACT
· Current plan
o Project funding runs out next year
o IMPACT converts to sustainable center of competence
o Organizations become a partner, membership models
§ Low entry barrier
§ Premium model
· Low-level access to basic tools
· Can license tools for use in other products
o BHL plan:
§ First find out if the tools help
§ We can determine if we would want to become members, use tools
· Room for negotiation
§ Would have to determine which tools to use
§ IBM is developing a crowdsourcing tool, offering it is a web service
· Could run it as a plug-in
· “Something in this direction will come”
§ Impact developing a service for named-entity tagging
· Person names
· Geographical names
· We could share our scientific names with them
§ Sustainability comes from center of compentence & memberships
o Way forward
§ IMPACT treat us like a project in their consortium
§ We provide files
§ Files go through tools
§ Our data live alongside other consortium data
§ Output made available for content review & evaluation
· They can make tools available for us
§ Real effort of process is production of ground truth in the first place

Next Steps

· Requirements – they specify
o IBM will need 100-200 pages of same book for crowdsourcing tool
o Need compound objects like
Mixed dataset: plates, text, etc.; complex documents (different languages, fonttypes [except Chinese])
· We provide the files
· They make a cost estimate of what it costs to us
· Data available in Spring 2011
o Lots of new datasets coming in
· Meet at a workshop in Spring to determine way forward
o London & Frankfurt are an option
o IMPACT have one in April
o BHL-EU have one in June