Nov 10 2014
Tech review meeting 11/10/14
Present: William, Trish, Mike
BHL
Redirect for Citebank –AMNH link to their search in CB wasn’t redirecting properly. Should be resolved now.
Contributor browse – emails going back and forth. Mike will update Bianca with our consensus.
Conabio – interested in joining BHL. William Flying there mid Dec 14-Dec 17th. William out for 3 weeks at end of year.
Art of Life
Flickr content from IA – debrief from Martin on Thursday?
Processing stopped at IA server last week and Mike is waiting on them to tell us how Mike and Kyle can share processes. Mongodb is in Kyle’s home directory.
Cost share report – William is working on
NEH Interim report due end of Nov - Trish will do
Zooniverse/ConSciCom – Trish sent examples to them. Everyone has filled out Doodle poll except for Jim. Trish will suggest a date from the poll. Jim’s mockup – very basic application.
Query on accuracy ratings of algorithms
Purposeful Gaming
Architecture for Managing multiple OCR versions – Mike and William discussed and have better idea of what functionality is needed. We won’t need another tool but can manage with what we have. Need a mongoDB and wiki install for managing. William talked to mike W. who is out this week. Lucy may be able to help with this but is busy with Garden Glow. What is our timeline for when we want it setup? Soon but Mike L says he doesn’t need it immediately.
What set of published documents do we want to use for PG? Factors William is thinking we should consider: dates of publication, Language, taxa, version of OcR engine. Would be good to have a subset that overlaps both PG and MB. Corrected OCR would happen sooner in MB than PG.
PG only wants to push accurate stuff to game so probably wouldn’t take dates published earlier.
Publication date
pre 1800 publications do not do well with OCR so we can eliminate those.
We will focus on 1800s-present day. We do not need to consider copyright status because OCR is not the publication itself (interpretation of the publication)
1880-1850s -
1850-1900
1900-1950
1950-present day
Language –
English only? We don’t have comprehensive dictionaries as much in other languages. But we should also include German, French, Spanish, Italian.
Taxa filter -
William says the structure of botany and zoology publications are different and publications geared towards the general public vs. scientists. More scientific publications would present different challenges for OCR. How would we filter on taxa? Not so much on specific scientifie names but on general subjects - would have to rely on LCSH. William will consider this factor more and determine if its important.
Version of OCR engine used to generate OCR file
This will be difficult to assess because while we know the version ( info is in the meta.xml file. Look for element “OCR” )we don’t know how they tweaked the settings within the version which can have enormous impacts on the accuracy. We will not use this factor for Purposeful Gaming.
Text to image linking – Desmond’s latest email indicated he will do more tweaking on contrast to improve accuracy. His test site has not been updated. He will be doing the editing tool. He has updates on his blog.
Mining Biodiversity
Altmetrics meeting to be scheduled to catch us up on how they are monitoring URIs
AddThis' s additional recommendations box – complaints from users. It was taken out from BHL but not blog.
Explorer of altmetrics to monitor developments. We’ll measure impact and need to keep ($).
Focus group feedback – most users not "commital" (ie. low interested for their current work in functionality we are creating) but see the potential. Started adding vocabulariess and authority files for automatic tagger to identify entities. Looking at synonyms (words used in same way in context).
Dates for manual tagging? Not set yet.
Found work by iproBiosphere on tagging Biologia Centrali Amerciana that we could use.
Next monthly meeting Friday 12noon