BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

IMPACTNov2010

printer friendly

Participants

BHL/BHLE: Chris Freeland, Henning Scholz
IMPACT: Lieke Ploeger (Manager IMPACT Project Office), Gunter Muhlberger (Sub Project leader Text Recognition), Clemens Neudecker (Interoperability Manager), Stefan Pletschacher (Evaluation Tools)

More information: http://www.impact-project.eu/

IMPACT Pilot-Release materials:
http://www.impact-project.eu/index.php?id=138

IMPACT Best Practice Guides:
http://www.impact-project.eu/uploads/media/IMPACT-ocr-bpg-pilot-s1.pdf
http://www.impact-project.eu/uploads/media/IMPACT-ocr-bpg-pilot-s2.pdf
http://www.impact-project.eu/uploads/media/IMPACT-ocr-bpg-pilot-s3.pdf


Time
Nov 8, 2010
16:00 CET

Agenda


Meeting Notes (Chris & Henning) | Meeting Notes from Lieke: IMPACT_BHL Europe_telecon_8112010.doc

BHL-EU
· One main interest in extending BHL portal, make it multilingual, improve OCR technology/infrastructure
· Using OCR to search & extract scientific names
· Need high quality OCR
· After IMPACT mtg in Hague Henning thought those tools needed for BHL, want to collaborate with IMPACT
· BHL don’t have resources/capabilities to work on those tools
Gunter – overview of IMPACT
· Main objective: improve OCR
o Start from existing tools like ABBY
o Image segmentation, enhancement, analysis
· Focus on collaboration
o Tools are developed so that you can outsource text correction to user groups & to specific post-correction based on language models
o Some language partners have issues with name-entity extraction
BHL as testset for IMPACT tools
· BHL has 85,000 volumes to use as test set
· BHL try out ground truthing tools
o “ground truth” is the ideal outcome of some purpose
§ for OCR, it would be perfectly correct text
o segmentation
§ text from images
§ paragraph 1 from 2
o evaluation
§ need “ideal” result of some method
§ when we want to evaluate tools, we need ground truth sets
· hundreds or thousands of pages
§ have developed tools to aide ground truthing
§ complete infrastructure for doing this work
§ now producing large amounts of ground truth for documents of their interest
· running texts against their test
§ Our images & OCR
· Our existing text evaluation from Qin
§ Working with service providers in India & other places to farm out work
· For our evaluation set, we need to ground truth
o OCR is flat text file
o Our test set can be run against IMPACT tools
o Then can go to evaluation tools
o Some money available within project for 500-1000 pages / run through the IMPACT ground truth workflow
§ IMPACT will provide 75% of the costs, 25% come from BHL/BHLE
§ Process the dataset from BHL
§ Treat it like the other datasets coming into IMPACT
· Current plan
o Project funding runs out next year
o IMPACT converts to sustainable center of competence
o Organizations become a partner, membership models
§ Low entry barrier
§ Premium model
· Low-level access to basic tools
· Can license tools for use in other products
o BHL plan:
§ First find out if the tools help
§ We can determine if we would want to become members, use tools
· Room for negotiation
§ Would have to determine which tools to use
§ IBM is developing a crowdsourcing tool, offering it is a web service
· Could run it as a plug-in
· “Something in this direction will come”
§ Impact developing a service for named-entity tagging
· Person names
· Geographical names
· We could share our scientific names with them
§ Sustainability comes from center of compentence & memberships
o Way forward
§ IMPACT treat us like a project in their consortium
§ We provide files
§ Files go through tools
§ Our data live alongside other consortium data
§ Output made available for content review & evaluation
· They can make tools available for us
§ Real effort of process is production of ground truth in the first place

Next Steps

· Requirements – they specify
o IBM will need 100-200 pages of same book for crowdsourcing tool
o Need compound objects like
Mixed dataset: plates, text, etc.; complex documents (different languages, fonttypes [except Chinese])
· We provide the files
· They make a cost estimate of what it costs to us
· Data available in Spring 2011
o Lots of new datasets coming in
· Meet at a workshop in Spring to determine way forward
o London & Frankfurt are an option
o IMPACT have one in April
o BHL-EU have one in June