BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

NOTES SEPT 22

printer friendly
Global BHL Technical Meeting NOTES

Woods Hole, MA

22 - 24 September 2010



Attendees & Accommodations
View Agenda

Venue, in Woods Hole village:
http://bit.ly/d0e1AB
Woods Hole Oceanographic Institution Smith Conference Room, and Grass Reading Room, MBLWHOI Library
Link to arrival, logistics, dinner notes:
Non-Technical+BHL+Global+Meeting+notes

Agenda

The goal of the meeting is to bring all signed and prospective BHL partners together to describe priorities and requirements for a Global BHL. The meeting is intended to be focused and result in :1. Creation of a global timeline for milestones & deliverables2. High-level description & prioritization of software and hardware components3. Description of a global governance & policies for collaboration

22 September 2010

8:30am - 9:00am: Arrival, setup & breakfast
9:00am - 9:45am: Welcome & Introductions

9:45am - 10:30am: A Brief History of BHL: Musical presentation of biodiversity/BHL
Martin Kalfatovic

Taking measure of the BHL
Chris Freeland:
BHL-US role in the Global BHL
Referrers to BHL:
EOL
,
Tropicos
,
IPNI
,
Internet Archive
,
BioStor.org
,
BiblioOdyssey

Clustered and distributed Storage: Phil Cryer and Anthony Goddard external image vnd.ms-powerpoint.png 201009_BHLE_Arch_meeting_WoodsHole.ppt
Problem: All data in San Francisco (50-70 TB); First cluster in Woods Hole; selected open source software
Copy all content from IA to WH; 2-3 months to sync the data over T-1 line; mailing drives to London; Syncing; Not all content can be shared by all countries
Fedora-commons integration oto keep track of data: persistent, stable digital archive; not appropriate for front-end but appropriate for preservation
Excellent work.

Graham Higley: Politics. This meeting is about working together to deliver technical solutions: shared technical architecture view and who does what. In parallel is the need for a more political discussion: Global governance model. Local projects should have a say in the long term running of the BHL. There will be a proposal that is reviewed by all participants as we go forward.

10:30am - 11:00am: Comfort break

11:00am - 3:00pm: Getting To Know You (Partner Presentations)
Each regional node will be given an opportunity to make a presentation before the group. The purpose of these presentations is to let everyone know about your specific project and how it connects with BHL as a whole. Each presentation should cover the following points:
Human and other resources available


11:00am - 12:00pm: Partner Presentations - Europe

Henning Scholz: many projects digitizing all over Europe but no common standards/interfaces; BHL Europe wands to standardize and bring all projects to a common platform. About 15,000 volumes available; Publicize to European users; not funded to digitized; not research and development; but build solutions with existing technologies and bring to market. (European Union funding).

Objectives
multilingual access point through BHL and Europeana
review/test approaches for management of digital libraries
imporve interoperability of existing biodiv. libraries
best practices adoption/promotion
Facilitate open access
Raise awareness
strategies for long term preservation
facilitate scanning initiatives
negotiate with rights holders

Adding partners

3 year project: done April 2012; about 4 M Eur from EU; GRIB (web database for content management and collection analysis; Scan list; Best Practices Guide for scanning operations with standards; prototype for German language BHL; business plan for long-term sustainability, key components documented

lots of meetings large and small

Key performance indicators like accessible content, what is in Europeana, interconnected repositories, number of content providers, number of portal languages, page views through BHL and Europeana

Many interconnected biodiversity and non-biod. projects eg Key2Nature, Edit, Europeana

Requirements: must be a service fo the public (common names); European scientists, integration with Global BHL

For the purpose of the project, the languages are European but in practicality, there are some other languages that while not required may be incorporated (Arabic and Indian)

Melita Birthamer: Content and
GRIB
10,000 volumes now but increasing; will have more than 100,000 volumes by April 2012 TIFF, JPG PDF most common

GRIB (collaboration with EDIT) for content management and content deduplication to identify content for scanning

Objectives

Index of bib reference based on taxonomic lit catalogs
Link to full text or information on subscribed journals
web portal/services
allow users to suggest titles for dig
de duplication

OAI-PMH interface mapping to OCLC PICA data format (import)
Export SRU interface tp MAC21, Dublin Core, Pica , PICA short, UNIMARC and others

German and English interface

By Spring 2011 data from Smithsonian will be in GRIB; Other BHL libraries do not know when this is coming

Adrian Smales: Technical Implementation
Overview
high level metadata views, infrastructure, preservation and architectural Model, OAIS; GRIB Global? (source code not ours to use); BHL-Europe tech deliverables (roadmap)
Metadata: structural, descriptive, technical and IPR: DC and Marc21 most common types
Preservation/Storage: cheaper to store on tape; virtualized system disks and tape each of which can be expanded independently

Need a global understanding of commonality and common standards for architecture

Graham Higley

Europeana is really the link that gets the money for BHL Europe and funding limited to 3 years. 2 institutions will keep BHL E going--Henning's team in Berlin may provide long term support and NHM London will continue with the data storage. Other partners may continue as well. Russia may contribute to next round of funding. Project will not stop when funding runs out!

12:00pm - 1:00pm: Lunch

1:00pm - 1:30pm: Partner Presentations - China

1:30pm - 2:00pm: Partner Presentations - Australia
Ely Wallis
BHL linked with Atlas of Living Australia; ALA federally funded through June 2012
wideranging infrastructure and biodidversity knowledge (38 Mil. Australian); share biodiversity knowledge; funding for building infrastructure; Biodiversity Information (species pages) is central to ALA: credible information about species; link through to literature
Gathering specimen info as well--observational as well as vouchered museum specimens. 2 million from voucherd; 8 million from observations; also images and morphological info and expertise; Biosecurity is high profile; resource management, forensics, agriculture, public are some user bases;
Specimen data; spatial data; australian national checklists;
data integration--rare and threatened species; citizen science portal; pest information portal; conservation portal
Digital literature is another component project.

BHL, morphbank, identify life, barcode of life: rich data sources

ALA Biodiversity Information Explorer goes live at the end of October. BHL already integrated into BIF--used ubio to find Australian species names; displays earliest BHL reference plus 9 others that are the ones with the most references to that species. Not comprehensive bib. Eventually will mirror all BHL content but not ready yet; mirrored indexes as a way into all of BHL content
Australian content providers are ready to scan! Making decisions about workflow and content decisions; inventorying what is available. Much Australian published material is already available in BHL
what content is available and what can be scanned in the future= outcome.
Developing new user interface for BHL- right now it is stripped down:
BHL Australia test
Looking at bringing scanning operations to books particularly for rare books
Volunteers to work on content once there.
PLANS: New UI (dec 2010), mirroring content, ingestion and upload process (march 2011), metadata and links, new scanning (mid 2011); can make a case for maintenance funding.
Other projects: annotation services (by users--shared or not); OCR correction by crowdsourcing (Australian newspaper project); adding subject term "Australia" to metadata - might be able to be done by building a "collection" in BHL
Scanning field notebooks and other grey literature is a popular idea.
Australian copyright allows Australia to scan up to 1955
Publishers of Australian publications who have scanned their journals are willing to add them to BHL without restrictions and often up to current publications.
State libraries have been doing mass scanning but not rich in science. Some may be able to offer services, others not.

2:00pm - 2:30pm: Partner Presentations - Brazil
Abel Packer
BHL Brazil via SciELO BHL Network
Motivation to participate: preservation of biodiversity, reduce taxonomic impediment, enrich biodiversity info space; SciELO network of national and thematic collection of quality journals

BHL Brazil/BHL SciELO; political and institutional support from federal government and state of Sao Paulo government
community, institutional and operational support from research community, libraries

Technical work: procedures/criteria for content to be digitized, portal developmet based on open technology to be compatible with BHL (to be operational by Dec 2010) ; metadata exchange OAI; search engine from VHL; services and scanners in process; network governance formalization 2010/2011

2000? books scanned by 2013 and digitized and exposed by BHL

Funding: Ministry of Environment (infrastructure, operation and interoperability) $1 million, FAPESP-article repository and open access journals $400,000
Integration: SciELO. VHL (health), Biota, CRIA (reference centr on enfironmental info), Ministry of Environment, BHL
Slow but sustainable. Threre is political and scientific support.

FUTURE ACTION: Tom noted that the deposit problem happens in many places and a small working group may be able to develop solutions.
FUTURE ACTION: Graham noted that the names management issue is a major thread of architecture.

2:30pm - 3:00pm: Partner Presentations - Egypt
Noha Adly
Biblioheca Alexandrina: have been involved in digital libraries (and EOL) but not BHL so far. BA .... "center of excellence in the production and dissemination of knowledge". Fits within the BHL. BA: "provide universal access to human knowledge" "provide all information to all people at all times"
participating in BHL makes sense; about to create Arabic version of EOL and a mirror site, translate some of EOL and add Arabic content.
EOL in Arabic: technical infrastructure and mirroring; selection of species for translation and publishing Arabic translation of 55,000 pages; add new Arabic content
BA has long-established workflow for mass digitization/ocr workflow; have 10 scanners (started in 2003 with 1); 120 trained specialists working 7 days a week on two shifts: books, photos, negatives, slides, maps
Digital Lab Workflow for books: checkin to digital lab, scanning , processing, qa processing, ocr (corrections by software "learning"), encoding (pdf/djvu), qa pdf, back up and archiving, publishing to repository, check out of digital lab
Digital Assets Repository
Digital Assets Factory, Digital Assets Metadata (Fedora used to manage only metadata), Digital Assets Keeper, Digital Assets Publishers (book viewers, image viewers, search engines
167,000 Arabic books freely available (some copyrighted materials only available inside the library)
Description de L'Egypte
1798
Partner with
World Digital Library
Hosting a mirror of Internet Archive.
Science Supercourse: Powerpoint repository for health, agriculture, environment, computer engineering
Interested in being a BHL mirror site and work on infrastructure.
ADMIN ACTION: Bibliotheca Alexandrina has offered to host a BHL meeting.
ADMIN ACTION: Need strategy for Russian collaboration
ADMIN ACTION: Need strategy for African Collaboration
3:00pm - 3:30pm: Comfort break

3:30pm - 4:00pm: Review of BHL User Survey - Bianca Crowley, BHL Collections Manager
external image vnd.openxmlformats-officedocument.presentationml.presentation.png BHLsurvey2010.pptx
external image pdf.png BHLSurvey2010_questions.pdf
external image pdf.png BHL-E_5pt8_100805.pdf

4:00pm - 4:30pm: Names
Chris Freeland

Name finding: all of the OCR text is sent to Taxonfinder algorithm at MBL. 31 mill pages scanned with 79.5 million name strings. Unique names=1.6 million. 5.1 million unique name strings that have not yet been verified as names. When have a name, link to EOL and Namebank. Have no mechanism to validate the 5.1 million names. Validation is important but don't have the resources. Role of ubio: name recognition (finding strings) and name reconciliation--two sides of the problem. ubio does both-takes name string and gives back identifiers, EOL takes namebank id and give EOL id; Need to separate the 2 services and work separately on each component. Name exists...then up to specialists to validate.


35% error rate on species names ocr.


ACTION: Make the data available. Make random sample available for nomenclators to provide feedback on ratio of good names.


What about common names? EOL is a logical partner. Common name solution starts with tools in process - Partners are EOL, ubio,


ACTION: work on OCR correction


Abel asked if we could get the number of occurences of each name--yes but have not done it.


ACTION Park the names problem

ACTION: Write out the problem and pass along to potential partners: Chris, Henning, Bianca, Connie

4:30pm - 5:00pm: Global Open Access
What copyright issues and distribution limitations will we encounter in sharing materials globally?
Open Acces--What does it mean? Darwin Library project has some things that are open and some that are not. Discovered 12 months into the project. Global world of copyright is very complex. Should the BHL develop a policy--won't take something that can't be distributed widely. There is language on the public wiki, although unenforceable. BHL is not assuming any copyright responsibility. BHL doesn't own any copyright. The question is are we going to institute any software to require certain actions on a user? Tom says no--system would have to be competely rebuilt.
Current IP statement is not a legal declaration.
SciELO has been open access for many years.
Users will get confused and frustrated if we have different categories of access.
Are we willing to eventually lose the brand.
Open access means open access.
European copyright laws are complex; public domain has different requirements for declaration. Need to let content providers know that material can be reused.

"no known copyright restrictions" Smithsonian statement

In the case of the BNF, can only use low resolution images because any commercial use must go through BNF.

ACTION: Graham will work with Nancy to develop a suitably vague statement.