BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

Feb 25 2014

Tech Review 2/24/14
Present: Trish, Mike, William

William
NESCent –EOL-BHL Research Sprint Feb 3-7th in Durham, NC
meeting in North Carolina. Scientists and IT posed questions to EOL and BHL folks. Our API worked fine. Some wanted all the data in one location. Tried to focus them on specific content depending on their interests, fields, subject areas. 4 groups interested in BHL content: 1) Hellenic Centre for Marine Research, Crete 2)Univ of North Carolina, Chapel Hill 3) Univ of Helsinki 4) CONABIO
Got Results after 2.5 days. Harvesting OCR from IA but can’t do by page (only item level)
John Mignault came as well

proIBiosphere – European museums, gardens, researchers
working on prototypes and examples of what they want to do next year if money is available from European Union. They can collaborate with non-Europeans but probably no money to outside institutions.
Good topic for Tech meeting – pros/cons of providing open data for others to repurpose. What does it mean for future of BHL repository?

Mining Biodiversity (DID)
Partners
Univ of Manchester, Developed tools to do data mining, general tools liked into workflows to provide
Annotating platform - ARGO
Also have Named Entity Recognizers & Normalizers
Taggers recognize species and habitat recognition
Event extractors – associations between entities
Semantic searches/tools e.g.KLEIO, ISHER, EvidenceFinder, TerMine
Have language translation but easiest in English. We will focus on English language materials from BHL.

Social Media Lab, Dalhousie University, Canada
Want to build a Community and conversations around digital objects in BHL. Look at social media platforms where BHL is being discussed and visualization those conversations. Challenges - does BHL community discuss topics at the level of

2 hires 1) mining text 2 ) community manager at SI with work with Canadians
What is BHL role? How do we integrate what they create into current BHL portal? Some services are related to names,
Will data mining just be for life of project or will it be ongoing service ? To be determined.
Have to build a training set for data mining just as we did for Art of Life.
Other Canadian partner, computer scientist at Dalhousie, will turn Ngram algorithms (does automated OCR correction). BHL’s role is to determine how to manage the OCR versions within our architecture. Needs OCR dump now.
Crowdsourcing the training set – volunteers from BHL to help build the training set and manually identify the entities and relationships. What size does training set need to be? TBD What types of books need to be represented? Articles, books, seed catalogs, field notebooks
Timelines: TBD one partner needs OK from their funder

Mike
Gemini issues and staff requests have taken up most of Mike’s time for past week. E.g. image files from Australia, server issues, users not getting their PDFs. Working on seed catalog ingest into BHL, hung up at IA due to status
Things to work on: authors display, Gemini upgrade
Working on Art of Life ID list, drop files on FTP site, Tweaking Adm functionality for importing
Searching and hiding externally hosted content left hanging



Trish
Art of Life – are we able to verify how many pages do we have processed to be able to upload to Macaw? Current stats: Mike has received 237k rows from Kyle , 45k rows where an algorithm has said it contains an illustration, only 23k
1st export - 400k pages but there are dupes (Mike confirms 200k unique pages)
Estimates about 46k illustrations have either ABBYY or Contrast =yes
2nd export – 60,000 records from IA (Kyle says all new since last export) no dupes in this batch and does not duplicate
Some items were not processed (20-30k) and are being set aside. We need to determine how we will reprocess those through the algorithm.
Trish would like to upload at least 10k pages this week – not sure of Macaw’s file upload size, asked Joel but didn’t respond. Should we start with smaller files? Go ahead and upload a 10k record file to test how well Macaw handles file sizes but not begin classifying until Joel returns and can clear everything in In Progress
Mike will be sending ID assignments at end of day Tuesday.
Using FTP site at MOBOT – can this be part of an automated workflow? Yes depends on setup at each location. Can write a script to login and upload to FTP site

Purposeful Gaming
Webwise conference was good – most useful content was the wisecamp session on effectiveness leadership & collaboration that I and another attendee suggested, session on crowdsourcing descriptions and transcriptions - difficulty of getting metadata out of platforms into local system (Flickr, scripto)

Met with Mary Flanagan, Dartmouth Tiltfactor Lab – recommended platform at HTML5 then could do wrappers for putting content on mobile, Might still need to be connected to web in order for it to work on mobile device. Mike found technology called Phonegap in which you build game in HTML5 framework then Phonegap can generate native mobile apps for multiple platforms

RFP – send out this week to reviewers – Litzinger staff, Jiri? And try to send off to companies by early next week. William would like to see another version
Project meeting – clarified roles, updated on work done, reviewed transcription tools and looked a bit at RFP