BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

TRACK & BREAKOUT BRIEF REPORTS

printer friendly Reporting from Tracks & Breakouts
Objectives: digitizing biodiversity literature, interoperability internally and externally
Coordinating committee composed of 1 or 2 representatives from each node with lateral work where appropriate. Each node is automous and self-funding (no pot of money in Global BHL). Coordinating committee doesn't replace the legal or other requirements for managing a node. Funding sources are different among the nodes. As new members come in--must pay attention to referrals. If agreement, will draft a set of principles and chart showing the relationships and send for feedback and discussion. No decisions. Branding discussion, need to include more non-western language partners. Define some minimal level of participation to be a "node". Sustainability. Sustainability plans perhaps should be requirement for each cerified node. BHL Europe has sustainability plan by April 2011. Tom is working on one. Henning and Tom will share information on this.

Technology Session (Chris): how download and upload content to Internet Archive. (need link to Mike's presentation). Twitter feeds with presentations. Chris asked if he can send out presentations. Best Practice for IA identifiers (later naming discussion). Reviewed how BHL enhances data. Names put into XML and dropped into IA. Should do this with all. Martin Suzanne and Chris will talk with Open library.
Guids and identifiers. Everyone loves handles. listed common identifiers (don't need issns....) We are assigning OCLC numbers and ISBN but don't need to put extra effort into this. The era of ISBNs has passed. To contemporize older materials we should use DOIs . Expensive. ACTION: Chris will continue discussions with cross ref. BHL Europe is building a global resolver guid minting system. We should use this heavily for global infrastructure. Will continue to monitor lsids--but no adoption yet.

Name finding discussion (Chris). Lakschmi has built alternative to taxonfinder (netineti) and we want to swap out to get the better name finding. Use training data to also pull out people names. Lakschmi to work with goldengate and using BHL as a test bed. Encourage another workshop on namefinding and namefinding tools with others involved in this topic (GBIF, for example).

OCR/Text correction (Chris): Lots of good discussion. Started with services but the root of the problem is OCR. If doing corrections, want to start with good OCR. IMPACT project has possibilities but don't seem ready yet. Talk with Noha about the Bib. Alexandrina process and also Chinese colleagues. Write a clear problem statement and frame it around biodiversity text. Pull together a test set of about 100 (ugly) pages and turn it over for a test bed for others to work on (IMPACT, Google). Ensure that IA is using the most updated OCR and using character based languages. Ely walked us through the Australian National Library text correction work. Once we have the better OCR then we need better text enhancement opportunities. Investigate Wikisource option and Australian process. Will work with Europe since this is a deliverable for them. Markup is out of scope--offer BHL as content testbed but we won't make the tools. Articleization came up and agreed that this feature is currently just a first pass and we want to 1) make the feature easier to use and 2) make it more visible. Talked about accessing content and how people get into BHL content. Tropicos was the example. Promote the use of open url. Revise landing page to have a more article and bibliographic type search that uses the open url resolver code that we have to improve user experience.

Content (Bianca):
Coordinating collections requires 2 components:
#1 Communications
--There is need of communication on global level about scanning/workflow/IPR-issues
--What have you scanned? and are you willing to share this content w/ BHL?
--What are you planning to scan?
--What is/are your various selection methodologies?
#2 Scanning workflow tools to inform these communications
--BHLUS/UK actively scanning using local tools for local needs
----in their current form, these tools are limited and are not necessarily appropriate for use on a global scale
--For Global Tools...
----There is need for pre-digitization tools (such as the current: BHL Scan List + Monographic deduper and future tools: GRIB)
----There is need for post-digitization tools (Like Gemini)
----There has to be found a way for Pre- and post-digitization tools to communicate
#3 What is the level of user influence on collections coordination
--Allowing user to bid for items to be digitized requires lots of help desk work
Need for a brief document that articulates collections issues and defines workflow tools for global partners

Data distribution and Naming (Anthony): too high level a discussion--need more meetings tomorrow. Different tiers of clusters--full and partial mirrors and a metadata only mirror. Noha is using async over SSH but there may be other technologies--need testing. When new partners want to add own data and how to get data to everyone. If partners add, assume everyone wants it. Work out a naming convention to id. data cluster. The role of IA as we move forward with clusters: make sure that new nodes know benefits of IA so they use IA, too (data association). Not all metadata generated at MOBOT is uploaded to cluster. Find a way to sync. Core metadata updates are in one system right now but that could change so need version control.
Handles and identifiers (Martin): how id different file structures. BHL Europe handle minting project. When attach--file structure, file type? Assign DOIs at institutional, file level? How replicate data and metadata across the nodes. Handle/DOI in global context requires discussion. ACTIONS: 1. DOI/Handle and hosting
2. Mirroring/syncing Phil, Mike, Anthony
3. testing between australia and us.
4. formalize syncing schedule
5. look at IA and Smithsonian naming structure
6. Noha, Adrian will look at naming conventions, directory structures
7. Replication and partners requires group discussion as to whether partners are replicating metadata and content as well as the code or partners that may sync only data and files, not coding. Plan ahead rather so structures/codes are common with all. Need a clear view of what each node will do.
May be multiple types of mirrors (code, metadata, data, content....various combinations).

NEW TRACKS for Friday:

Determine synchronicity and then Business Continuity/Resilience
GRIB/Metadata/Scanning Workflow (Boris)