BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

Nov 18 2014

Tech review meeting 11/17/14
Present: William, Trish, Mike
BHL
Contributor browse – changes have been pushed to production
Dima – NetiNeti turned off. Process is running smoothly now.
BHL Egypt report – distributed
Uniform titles – displaying in the MODS but not in summary or details. Why? Mike says we wanted to include everything we could in the MODS since its for exchange but not necessarily requested by anyone in the details. Bianca says its being indexed which is how she found it. She would like to see it displayed and could go in details. Bianca will add request to Gemini and specify where it should be in details.

Art of Life
Flickr content from IA – need to decide on what metadata fields we want added by tomorrow. We’ll discuss at 11am Tuesday
Processing stopped at IA server last week - can’t connect to vm at IA. Jake is supposed to combine accounts.
Cost share report – William is working on. William will send examples to Trish of how this was done for global BHL. E.g. PG and Chuck’s tasks – montly meetings,
NEH Interim report due end of Nov - Trish will do
Zooniverse/ConSciCom – Trish will work on next tasks
Monday meeting with British Library - Need to run Query on accuracy ratings of algorithms – Mike will do

Purposeful Gaming
Architecture for Managing multiple OCR versions – Mike W hopes to get servers set up this week but may be next
What set of published documents do we want to use for PG? Our criteria will be: dates of publication, Language, and taxa (William will investigate taxa some more to determine if this is feasible). We won’t do the version of OCR engine since the parameters that are set could vary widely.

Publication date
pre 1800 publications do not do well with OCR so we can eliminate those.
We will focus on 1800s-present day. We do not need to consider copyright status because OCR is not the publication itself (interpretation of the publication)
1880-1850s -
1850-1900
1900-1950
1950-present day
Language –
English only? We don’t have comprehensive dictionaries as much in other languages. But we should also include German, French, Spanish, Italian.
Taxa filter -
William says the structure of botany and zoology publications are different and publications geared towards the general public vs. scientists. More scientific publications would present different challenges for OCR. How would we filter on taxa? Not so much on specific scientifie names but on general subjects - would have to rely on LCSH. William will consider this factor more and determine if its important.
Mike B. is almost done with his scanning - what can we push to him to do OCR on? We could have him prioritize the gold standard set that Europe did for IMPACT. Or this set might be a challenge if lots of OCR problems. Could we have Mike B work on stuff that NYBG has uploaded to BHL for the project? Yes Mike L could generate list from Seed & Nursery catalogs. Need to know how he easily he could download from IA JPEG2. Then convert them for Tesseract. Also There may be more JP2s then get displayed in IA. Mike could give him a list of URLs to download images but this would be a page by page process and very slow. Same issue with downloading images for ConSciCom – we need a script to automate this. Perhaps Mike Ls script could be used for both. Maybe we don’t need to involve Mike B. in having him do the Tesseract. Probably more efficient to have Mike L write script that downloads image from IA, process it in Tesseract and store result in MongoDb. We could then redirect Mike B to other tasks on PG
Text to image linking – Desmond’s latest email indicated he will do more tweaking on contrast to improve accuracy. Github site has latest version and Mike will try to download from site.
Game input samples for Tiltfactor – processed 3 books from Mike’s OCR samples. One was very bad – almost every other word was wrong. Other 2 books were good. Noticed some weaknesses in IA’s OCR. Images or multi-columns on page caused problems. Noticed Tesseract put lines in wrong place. Mike could send the 2 sample books over that were good to Tiltfactor

Mining Biodiversity
Monthly meeting last week – ngram correction of OCR – still working on it but wont’ be done in time for us to use corrected OCR. BHL would like it setup as a web service same as we do for name finder for the life of the project and not sure after that.
Tagging – testing tool ARGO for markup. Test tool with sample pages starting from top. Meet on Friday to compare and agree on what type of tagging rules we want to establish.