June 7 2012 BHL Global Notes
June 7, 2012 BHL Global Notes
Welcome and Introduction
Brief History:
Global BHL:
BHL US/UK (1st)
BHL-Europe
Brazil
Egypt
China
Australia
2 other global meetings. 1st in Woods Hole in Sep 2010. 2nd in Chicago, IL, 2011.
Progress Updates
Australia (Ely Wallis)
Australia BHL runs under funding from Atlas of living Australia. Federal government funded project. Funding ends June, 2012. No further funding so future of Atlas after this year (plus 18 months of additional funding left) uncertain. BHL-Au not funded by Atlas after June. Atlas is run out of Museum Victoria.
Staff on Au project:
Ely Wallis - project manager
Simon Sherrin - technical lead
There is also:
Digitization project leader
Designer - Simone
Scanning Assistant
6 volunteers (digitization done by volunteers)
Work Recently:
Au divided into 4 parts:
1) Website: Working since last year with Mike L. on synchronization of metadata catalogs between Missouri and Au. Don't have full replica of content from IA yet. However, catalogs are synchronized and if someone searches Au, it pulls content live from IA.
Decision need to make:
There is currently staff funding through October, but then Au needs to decide what to do with project. People like the design of Au site, so there is talk about taking this design and applying to BHL-US/UK site. There are things in US site that aren't in Au site, so designer in Au started coming up with options to add these functions to Au interface.
BHL in Atlas of Living Australia:
Created SOLR index from OCR generated by IA. Not keeping full mirror or generating new OCR, but harvesting OCR from IA and creating index from it. Now put up new literature page on ALA website. Species pages on ALA include images, sightings information, names, classification, and literature tab. This includes name references found in BHL. Taking names and all synonyms from national species list and querying OCR SOLAR index and bringing back information and page images on the fly. If you follow the page link, you go through to same page in BHL-Au book viewer.
Simon working on how you could run the same query in BHL-Au website (as opposed to within ALA website). This funcitonality could be offered to the rest of the group.
2) Bidlist:
Australian version of scanlist. Set up differently than BHL-US/UK scanlist. In Au already had federated library catalog for all Au libraries, run by National Library of Au. Didn't have to build a catalog. National Library handles all de-duping. Also have National Species Lists for animals and plants. Each has a list of citations attached to it. BHL-Au has taken these citations, put in database, and pulled out journal and monograph titles. Then ranked these titles to see which has most number of new species described and have ranked list of titles for what to scan so they first scan literature with most species described. List acting as "wish list," as they have small scanning operations in Au. Already had instance where some volumes in scanlist priority list are already in BHL and permissions already secured for them by MBL. Working with Bianca Crowley to coordinate obtaining permissions from publishers to scan.
Thing missing from bidlist - it you go to particular journal title, there is a function called holdings, including institutional holding and non-institutional holdings that currently both have no data because they're waiting for API from National Library catalog, which is not available publicly until April 2012. Soon they will add the holidngs information from this API.
When you log-in, you can also add your own holdings information if it is not already available from National Library catalog. You can also "vote things up" in the bidlist (increase priority) when you log-in.
Question for Gobal Meeting: How do we align our bidlists?
Also doing new book digitization. Purchased scanning rig. Volunteers doing all scanning under Joe Coleman's supervision. 500 pages an hour, three days a week. Volunteers also doing all post-processing and just starting to learn Macaw so they can do the ingest process as well. 20,500 pages, 100 volumes scanned and uploaded. Started with Museum Victoria's in-house journal (Journal of Museum Victoria).
3) Macaw:
Very important for Au project. Many organizations in Au enthusiastic to put content into BHL that they've already scanned, but many already converted to PDFs themselves and haven't kept the images. They are offering these PDFs, so part of Macaw project is "PDF splitter," so you can take PDF versions of PDF books and split back into images and upload to IA.
Have also incorporated new design interface into Macaw (including image of a macaw!) Since they will have volunteers using Macaw, wanted to make process as simple as possible for them.
4) Other activities:
Joe Coleman doing BHL User blog post on Au researcher Gary Poore, will post Tuesday, June 12 on BHL-US/UK blog.
Working with Bianca Crowley with copyright licensing for Au institutions that can be approached. Working on new forms and licensing guidelines.
Art Project: Museum Victoria asked to participate in Art Project that Google released in April. Aim is to get art online. First release had about 12 institutions in 2 countries, with about 1000 images. 2nd release has over 100 institutions in 40 countries. Museum Victoria asked to participate, and put in indigenous artworks. Had great images in rare books, so they contributed them to the project. Good way to get literature in to the project.
Brazil (Abel Packer)
Project different than other BHL nodes. BHL-SciELO network being implemented as BHL-SciELO Brazil. Hope to expand to other Latin American countries working in SciELO network.
BHL-SciELO network comprises:
SciELO biodiversity - Funded by Sao Paulo Research Foundation. Funds 90% of SciELO program. Dedicated to indexing and publication of scientific literature.
BHL-Brazil - Digitization of essential works of biodiversity related to Brazil. Financed by ministry of environment in Brazil.
Starting on EOL Brazil - dedicated to indexing list of Brazilian species to create a single catalog of species. Have flora cataloged and starting on fauna/
Have network development and operation. Open access is principle. Have common web portal to integrate all this information, as well as common governance committee.
SciELO started with publishing journals. Have today SciELO Brazil and 16 other countries in network. Now have SciELO books (Just launched with 220 books. There is a desire to start a network with university press. However, not all will be open access because University Press demands that some books be commercialized) and BHL-SciELO. This comprises SciELO program. Have 250 journals in SciELO Brazil (not all biodiversity-related), with 1.5 million downloads per day. 950 journal titles in SciELO network (Brazil plus 16 countries).
SciELO is multilingual with Spanish, English, and Portuguese.
BHL-SciELO portal not in English yet but will be by end of June, 2012.
Contents on BHL-SciELO:
Journal collection, curated set from the 950 journals from SciELO network, representing 42 titles in total.
Also have articles from these SciELO journals in BHL-Brazil.
FAPESP Biota Program - dedicated to studying biodiversity of Sao Paulo. Have repository of scientific production. Started to develop repository of publications Brazil publishes in commercial journals with restricted access provided to these articles.
SciELO has an index on blogs related to biodiversity, which is fed as an RSS feed on their website.
SciELO includes a thesaurus on biodiversity, which was just finished. The hope is that it can be adopted by gBHL. In Spanish now, working on translation to English. Have about 3000 terms.
BHL-SciELO imported contents from BHL related to Brazil to serve in their portal. Also digitizing in Brazil from Brazilian libraries - just starting this process. They also have an index to all BHL content.
Started digitization at Universidate de Sao Paulo - Museu de Zoologie. Instituto Butantan also started digitization. Then will move to Museu Paraense Emilio Goeldi (waiting on funds to digitize). Finally Instituto de Botanica. 5 scanners in total. 1 scanner is at the SciELO headquarters.
Highlights/Challenges/Conclusions:
BHL-SciELO becoming reference site for area. Has advantage of being neutral. Widely accepted as a quality program. Trying to use this to help advance the project.
Capacity building is advancing but still have much to learn in terms of methodology, solutions, and technologies. Recently BHL-SciELO staff spent a week in US to learn more about BHL technology and workflow to make digitization a reality in terms of moving data from scanning, to database, to IA.
There are a variety of information sources in SciELO network, so they need to figure out how to integrate these data elements. A group of researchers is testing different interface and integration options.
Network operation and expansion is a difficult operation because they need to combine agendas of each institution with that of overall network.
Political support is a challenge but have succeeded in general.
Egypt (Noha Adly)
Bibliotheca Alexandrina: Channeling BHL and the Middle East
Have been working on digitization of Arabic (with also a few Latin books). Have about 200,000 books digitized. Egypt is busy building a repository and workflow that governs digitization and ingestion process. BHL collection has been a test bed in the Alexandrina repository. The collection from BHL into the repository came in two batches through two processes. First batches received on hard disks (approx. 34,000 books received from UK, sent to China after Egypt was finished with them). Second batch being uploaded from IA. Have uploaded 70,000 books from IA. About 105,000 volumes uploaded in total.
Have 15,938 published on website currently. 3,700 books currently being indexed.
The 16,000 BHL books published are currently (and temporarily) published through main digital asset repository. BHL is a collection in the larger library repository. Functionality includes a book viewer and searching inside the book. The text is indexed on SOLR. Optimized for English, Arabic and French. Repository also has annotation functionality, including highlighting functionality and selecting span of pages to underline. Users can also incorporate notes into pages, rate books, leave comments, and arrange books on personal "bookshelves." Users can also embed books in other websites and share books through social media outlets. Can also have advanced search and search into content. Selecting a search result will show you where your within-text search words are located in the page viewer view.
Repository manages Arabic content, which will be provided to BHL. Can also search within books and collections in Arabic.
Same browser suited for reading the correct way for Arabic content (right to left instead of left to right) as well as English-language content.
Books received (approx. 104,000 volumes) are uploaded onto an online archive, designed around peta box, using same architecture as IA. Have 10 racks on 340 Linex machines. Total storage space 314 TB. All content received packaged and archived on machines. These machines are just for archiving, not serving the content. Includes check sums to check for errors.
Question: What are the needs of the gBHL? Does it need nodes to store/archive content as received from IA? It has costs. Also, had to change some content for publishing, and these adjusted copies are archived, not the original.
Books downloaded from IA (about 70,000 books) - current list of books on IA updated weekly. Compared with list of books they have to know what new to download. Question: if books are updated in IA, what will happen for those books already downloaded? The books can be updated partially (OCR can be re-done), but to get the changes into your portal, you have to re-ingest all of the files from IA, not just the ones that are changed (IA does not let you download just portions of the files). Need to work out procedure for obtaining files only that have been changed, not whole packages of files.
Once books downloaded from IA, have in JPEG2000. Need to change to JPEG to publish. Egypt viewer converts to PNG (from jpeg) on the fly, and conversion from JPEG2000 is time consuming and not good for publishing. First did conversion using image magic. Threaded Java solution was implemented to convert JP2 to JPEG. Run on 10 machines, but rate was slow (120 books per day). Egypt then went to extracting JPEG from djvu file rather than using image magic.
Some JP2 images have not been processed, which is another problem. There are also redundant files, like color palettes. Also repeated files (pages existing twice with different names). Other JP2 problems include zero and one based file names (not consistent standard for naming files). These problems all affect searching highlights (correctly highlighting where a search term is within a book).
When getting files from DJVU files, data is consistent, and the download rate is high. However, the quality of image extracted from digital file not as good (JPEG from DJVU file image quality not as good as JP2 to JPEG image quality).
Thus, going to get images from JPEG2 conversion, extract correct files names and order from DJVU file to eliminate redundancy, repetition, non-standard files names and XML word coordinates. Plan to redirect black edges unprocessed images to humans to correct. Correcting the black edges problem must be a coordinated effort among global partners.
Intend to compile content in Arabic as well as other languages. Currently verifying copyright status and other BHL eligibility criteria.
Also working with associations with other institutions in Egypt with more biodiversity literature to cooperate to digitize and include content and publish on BHL. Intend to have customized website for BHL featuring Arab region content. Currently do not have integration with uBio, so will be working to add features to viewer in order to extract names information.
Black Edges (Martin) - more black edges coming from scanning operations as we do more work outside of IA scanning workflow. Will probably also see more microform coming in from IA. Those coming from non-BHL member institutions we also will not have permission to change these files and write back to IA to override files existing on IA to remove black edges. Individual nodes could take the scan and clean them up themselves, but once each portal starts doing their own clean-up, replication problems exist. We will probably not be cleaning up files manually.
Parking Lot: Each node updating scans themselves from IA.
Europe (Henning Scholz)
Vision Statement: European biodiversity knowledge freely available to everyone. Done through community portal, global reference index to biodiversity, Europeana, and BLE.
BHL-E partners were each scanning, some had own portal, some not. Idea was to collection all this content and serve via one platform, which is hosted in London. Then all could be delivered to users through one access point.
Still waiting for connection to other global nodes.
GRIB has 800,000 deduplicated records from 7-8 libraries. GRIB will be kept running until 2020. Had some software, hardware, and system problems that slowed up the development process. Currently the digitization widget does not work. That is the next step, followed by importing European partner catalogs, after which international partner catalogs. At this point, other global partners can use the GRIB for workflow management. Timescale looks like 12 months at this point.
BHL-E Best Practices Guide - similar to US Cookbook. This an important contribution from BHL-E. It summarizes the entire workflow process for BHL-E.
Preservation and archive system includes ingestion process and access site. Content will be served via Europeana (is currenlty being served from here) and eventually BHL-E portal.
Biggest challenge over year was pre-ingest process. Tool is now up and running. Each institution can upload its own content with this tool. Designed by colleagues in Vienna. TIFF files are created from process. Taxon finder and OCR process is included in tool.
In portal, can search for common names of species to find content. Can search for acronyms and suggested related terms.
BLE - A great educational tool. Adding humanity components to biodiversity literature. Want to develop it into an interactive tool so others can create collections, add facts, and collate books. Will come as functionality with "Europeana-Creative." Will know next week whether this project (Europeana-Creative) will be funded.
Content is available in Europeana. Once enough content is available in BHL-E portal, the website will be exposed to Europeana, allowing users to access BHL-E from Europeana.
OCR Technology Experimentation:
Processed US content with impact workflow tools. Still some bugs to be fixed in the data set itself. Crowdsourcing options are being investigated. Project in Europe called SCAPE working on OCR tools, like page-type separation. Could be a useful application for gBHL (automatically tag page types). The experiments with these OCR tools has shown that high quality scan, language information in the metadata, and font type information in the metadata need to be available. Tesseract 3.0x is the tool BHL-E has been experimenting with. It is a tool that is improving and a good alternative to other tools.
BHL-E project officially ended April, 2012. Waiting on Commission's review to close down project completely. Want to continue project, and there are a number of partners interested in continuing work. Vienna will continue to provide ingest management, technical support, and software maintenance. Berlin will continue coordination, network management, fundraising and dissemination. There is hope to get funding from new projects and proposals that are upcoming. NHM will continue to provide IT infrastructure management, including a technical director. Hoping that partner libraries will continue to help with in-kind contributions, like continuing to scan content to be added to the portal.
All code is open source and available for partners.
BHL-E is ingesting BHL-US books and European partner books.
BHL-E has their own metadata schema that they map everything to and archive that on BHL-E site. Serials coming from the US are still problematic for conversion to acceptable schema for use in BHL-E, thus they are only ingesting monographs from BHL-US currently.
US/UK (Martin Kalfatovic)
Changes:
New bHL Governance. Have executive committee (chair, vie-chair, secretary). Project Director (Martin) and Tehcnical Director (chRIS)
Steering Committee - operates on dues membership. 9 members chosen this level. They all contribute 10,000USD to be part of this committee, which directs overall goals and obectives of project while also choosing executive committee members. Hope to increase members in future.
Institutional Council - any institution that shows interest in goals of bHL and commitment to participate in BHL (like through staff time and scanning).
Library of Congress - will become the 15th member of the BHL-US/UK. Hope to have agreement signed soon. Will be participating in Steering Committee level.
Connie lead on membership review group, looking to coming up with guidelines for incorporating new members into BHL.
Nancy Gwinn (SIL) chair of executive group. Connie (MCZ) Vice-Chair. Susan Fraser (NYBG) Secretary. Keeps day-to-day operations of BHL going.
Tom Garnett retired end of March, 2012.
Grace Costantino became BHL program manager.
JJ Ford from MCZ came to Smithsonian from MCZ to manage SIL-BHL scanning workflow.
Good News:
BHL Flickr: Very successful. 390,000 views in 10 months, over 30,000 images.
Funding Sources:
NEH Grant - 260,000 grant from NEH for MOBOT, for doing image recognition in BHL corpus. Can go through pages in BHL, identify images in text, segregate, and push into flickr and other environments. Further goal to create linked data objects around these images and crowd-sourcing of captions of these images.
$200,000 federal appropriation to fund BHL activities
90,000 Steering Committee dues
Generated 3,900 USD in gifts from donation button
45,000USD from JRS to fund meeting in Cape Town to enable 6 US partners to go to Cape Town where SANBI will be hosting these US partners and 20 additional African colleagues to develop a BHL-Africa project
iTunes U:
Useful expansion of BHL audience beyond taxonomic community. Every quarter there will be new collections to iTunes U. Steady usage of collection over time, with spike in usage when new collections come online
Life and Literature meeting and JRS preliminary Africa meeting
ALA booth hosted with EOL - Grace Costantino and Breen Burnes hosted booth
BHL-SciELO staff came to Washington for meeting to discuss expanding Brazil workflow for digitization
Just short of 40 million pages, 105,000 volumes, 55,000 titles
Overall website statistics holding firm, with few dips during weekends.
Technical Developments:
New process allows staff to tag images during pagination process for incorporation into Flickr. Rapidly increased amount of content added to Flickr.
Macaw: Scanning workflow and ingest management tool. Based on Paginator tool, allowing for creating robust page level metadatae and pishng content to other repositories. USGS, Au, and Washington currently have installations. Other institutions are looking to add installations as well.
Interface Changes:
Donate Button
Flickr stream
Featured Collections rotating
Social Media:
Active Facebook, Twitter, and blog.
Continue to receive compliments for BHL. We track all of these praise statements on the wiki.
Governance
Global By-Laws (Nancy Gwinn)
At last global meeting, Nancy tasked with coming up with set of by-laws that gBHL could consider adopting to provide governance structure to gBHL group.
By-Laws are based on those used for BHL-US/UK. Document at meeting reflects the latest changes to BHL-US/UK by-laws
We should understand who has responsibility for organizing and convening gBHL group when needed.
Also should accept objectives and operational procedures for gBHL. This is a coordinating group, not a legal group, so we should have an agreement on how we will all participate.
Decision: it is a good idea to have by-laws.
This will be an internal document but it could be made available on our website, as long as there is context available around the document. MOBOT will be hosting a landing page for gBHL for hosting these document and links to the global portals. Basically a "brochure-ware" site.
Below section indicates changes made to governance document at meeting. These changes already captured in document, so see updated draft for up-to-date version.
Page 1:
gBHL Coordinating Committe (gBHL-CC) name of group to organize gBHL.
Purposes and Principles section taken from Tom's original document.
Section 2, section 4 and 3 - add Heritage to Global Biodiversity Library.
Bianca: Add definition of Open Access to Section 4, Item A.
Is the current definition of open access sufficient, and is it important to make a statement about commercial use?
Graham: These are guidelines and probably not the appropriate place to the subtleties of open access definition, especially since it will vary by node. Let these definitions be provided by each node, and each node has the opportunity to define open access as appropriate for them. This was also discussed at last meeting, but it was decided to leave this statement as it is. Could make this section linked to a definition of Open Access, in either a glossary or another web page, since this document will be available publicly. This statement could also be part of the contextual statement around the document that could link to definitions of open access.
Page 2:
Article 2, Section A: Refers to Minimal Requirements, but there is no reference to a list of these minimal requirements. Do we actually have minimal requirements for participation in gBHL we've agreed to and can we document them?
Abel: There is an issue with the idea of certified vs. participating institutions. Who decides when an institution moves from certified to participating status? There needs to be a committee responsible for drafting minimal requirements and creating a document to establish them.
Action Item - create Minimal Requirements Subcommittee. Abel Packer will lead group. Ely Wallis, Grace Costantino, Noha Adley, and Jane Smith will also join to come up with potential list of minimal requirements for gBHL membership to present to group. Will try to put these together during global meeting.
*
Minimum requirements for certified members to BHL (first draft discussion):
1) Content and methodology - must create content operational with BHL content. Must publish their content; use compatible methodology; working in coordination with other members; representing a
2) Participation in BHL activities and provide representative to group; must sign an MOU
3) Express publicly that they agree with gBHL principles
*
Ex-officio Members - should say program director and technical director staff from any node may participate as Ex-Officio members. only official representatives can vote.
Article 4 (previously article 3, changed to article 4): Duties: Refer program objectives to Article 2, section 3.
Changed "Program Objectives" to "gBHL Objectives."
Element "C" on Objective Planning changed to first article "Article A"
Article 3 - change to gBHL Member, indicating that each node only gets 1 representative that can vote.
Article 3, section C: Just say that BHL node program directors and technical directors can serve as ex-officio participants.
Page 3:
Delete "acting through coordinating committee" in first sentence on page.
Article 5: Compensation - agreed to this statement as group
Article 6: Officers and Duties:
In Section 1, add Exectuive Committee to sentence so it defines Exectuive Committee as chair, vice-chair, and secretary
Ely: How does the gBHL-CC Executive Committee fit with the BHL-US/UK Executive Committee? gBHL-CC Executive Committee is only responsible for gBHL activities, but node executive committee members are responsible for their own node work. Executive Committees of nodes might not even be part of the gBHL Executive Committee.
If we agree to Executive Committee, we have to elect these people today. In the US/UK, the Executive Committee has a call every week. That might not be necessary for gBHL Executive Committee, but there should still be regular communication and transparent to group.
Chair, Vice-Chair and Secretary of elected from Certified Members. Certified Members have not yet been established. It's the representative from each of the group. gBHL-CC members can
CC composed of 1 representative from each BHL member. We don't state who that person is.
Secretary will keep official notes from the meeting, and we will have to decide if we will expose these notes in a public way.
Page 4:
Article 7: defines executive committee.
Article 8, section 3: Chair shall cast deciding vote in case of a tie.
Vote on Accepting By-laws: Voted into action by Committee Members allows to vote. Will vote on members of the Executive Committee tomorrow.
Technical Topics
Synchronization/Global Replication/Backup/Redirection
Synchronization:
We sent disks of information from IA in Woods Hole to London for ingesting, then to Egypt, then to China. London got most of the files copied that were sent, but this was not the entire corpus of BHL material. They then started running a process to get the rest of the files from the cluster at MBL. NHM set up a process to do this, but the process failed several times. Some files of each item were incomplete. Chinese colleagues still copying disks.
Idea was that we would share the content that we have from different countries. One way to do this was for BHL-US to put content into IA so that it could be shared. Other nodes getting content from IA is a slow process.
Question: What is the way to get everything across nodes synchronized and if there are changes for one node, how is that reflected in all other nodes?
Abel: All members should commit to put their collections in IA in their own project collections and then from there you do the synchronization.
John: Some issues with using IA as a permanent synchronization platform. IA might go away, we have no control over IA, and retrieving information from IA can be very slow.
Martin: For the forseeable future, we need to consider IA the hub for synchronization until we can get a viable other instance of content to share.
Question: How frequently and how should we synchronize different types of information? (e.g. content, copyright information, metadata, pagination)
Abel: It is the responsibility of each member to make sure their content is kept up to date within the synchronized archive.
John: Each member institution responsible for uploading most recent version of materials to IA. They then replicate parts of the collection from other institutions from IA into their own portal.
Noha: We need to have a site that will be the intermediary repository where everyone puts their material and everyone else gets their content from. IA already has the infrastructure for that. IA is something that could go away, so at least some members of BHL should commit to have a complete copy of BHL so we have one or more sites that have all information as a backup; 1) Most content in IA was digitized through their workflow. There are partners that are going to submit their content (which wasn't digitized by IA) into the IA. If the format of incoming content (file itself, like JPEG2000 vs. TIFF) is different than IA's expected format, will IA accept that? 2) would all partners agree to comply to the metadata structure of IA? 3) We need to find a way to synchronize the curation of the data that various nodes have been doing. Some changes, for instance, that BHL-US has done (like improving/correcting metadata within actual BHL portal), is not uploaded into IA, so other portals could not access this information if we go through IA. We need to find a way through IA or outside of it to synchronize the curated data.
Joel: We can upload different files that IA doesn't understand and it will ignore it, but leave it in IA. We could upload information to IA that it doesn't use but it will be kept and then other partners can access it. We do this already with names.xml and METS files.
John: We must make it explicit to IA that we'd be uploading this information so it isn't deleted from IA files.
Mike: Question of should we follow IA standards - portal ingests things from IA using their file formats; it's an argument for having people agree to conform to IA standards.
Abel: We don't worry about checking to see if data is showing correctly in other people's portals. We put our information into IA and then each portal just worries about making sure they're getting the files and displaying them how they want to. We should then create a second synchronized place for a copy of the content in IA.
John: We're not talking about a distributed network anymore but a master/slave situation where IA is the master and each node is a slave, and everything synchronizes through IA.
Noha: Centralized model will be difficult to achieve, but there should be a node that is acting as the hub where everyone places content and takes content from.
John: If there's one repository, why then don't we just have everyone run their applications against that single repository.
Noha: There are costs with replicating the whole corpus. We need to decide what we're downloading and storing from IA, because we might not need all of the files that are kept in IA. We should only store those files that are useful.
John: if you're going to have a hub, it must be replicated exactly in all of the various hubs, not just some files.
Mike: We can identify what we need and what we don't and only replicate those that we need.
Graham: Should IA be the hub, or one of us pull everything down from IA and that node acts as the hub for everyone?
BHL-US and Egypt currently pulling new/updated information from IA once a week.
If you have special files associated with your content, you'll just upload those files to IA, it will ignore them, but when the other nodes go to grab the data, they will grab these extra files and use them on their portal.
For now, we're going to keep using IA, we'll upload what we have to IA and get our content from there, but we need at least two nodes to commit to copy IA. Smithsonian and Egypt can probably agree to keep complete copies of IA on their servers, but Egypt wants to be able to optimize what they keep to only keep the files that are relevant for BHL. The "Cluster" will also be another copy of the data.
We need to define a subset of the files that IA creates that we actually want to keep copies of. Define a minimum set of IA files that we want to keep. The volunteers (SIL/Egypt) are agreeing to store at least the minimum set of files from IA that we agree on.
Each partner will also agree to upload their content into IA.
Wolfgang: Europe would have to create a reduced file of what we have to give to IA and then an additional zip file with all of our extra information
2 bits to synchronization - the content and the catalog. Currently, MOBOT and Au are replicating catalogs between each other and then getting files from IA.
Wolfgang: Can you put articles into IA?
Yes, you would just have each article as an individual item.
Abel: Each member is expected to run their own collection in addition to having IA as an additional content repository so other nodes can get that content. That way each node can also have their own special content associated with the scans for their own collection.
Noha: For each partner that is adding new information, maybe we can list those file types (like pagination files) and figure out a common format where this additional data is placed and find another place outside of IA where this information is synchronized.
John: If IA is the central repository, then we need to find a way to put all information into IA that we need.
Mike: IA can't handle any structure within the files.
John: If the relationships within the files are important, we need to find some way to define those relationships for IA, such as through METS.
Graham: The three institutions that agreed to keep master copies of files need to decide what the minimum requirements of file types are.
Martin: That group in consultation with the larger group can decide what type of pointer system we should use to define hierarchy.
Patricia: If you work with groups that don't have good bandwidth, you need to make sure that the countries can download whole copies of the repositories for local access to content.
Summary:
1) IA is used right now. Each member responsible to upload to IA
2) We will look into having a copy of the minimum files from IA at SIL, Egypt, and Europe (and the Cluster). This group will meet to decide on minimum standards for files to download from IA and pointer files into IA (
Synchronization Group: Mike L., John Migneault, Chris Sleep, Noha, someone at SI)
3) Must still be aware that technology is changing so we might have to look for other solutions in the future.
Questions: Which files do we keep?
When and how do we synchronize?
How do we synchronize the curated metadata?
Global Replication:
How can a site know what book is on which site?
How can a site request a copy of a book from another site?
Could sites implement a web service that provides a query and access to their content in a certain standard to check if a site has a book?
Our networks will be synchronized, but there is a time delay in the synchronization. Also, the systems are not synchronized yet, so what do we do in the meantime until they are?
Joel: The system must be built to accommodate for the fact that when looking for a book you might not find it, or you might not be able to retrieve it.
Do we need these replication services right now?
Ely: Yes. I'm constantly getting people asking me about why they can't find all content within all sites.
John: Who is this for? Is it for users or are we talking about this as an underlying mechanism for replication? At a machine level, not talking about user access, we need a system that asks another node if they have content and then tells it to get that content.
Mike: Each node will need a search API (can have a single API that each node is responsible for implementing) that allows sites to determine which titles are held.
Global replication is a second priority. Synchronization is our primary priority.
Synchronization group of SIL, MOBOT, Europe, Egypt will be the group responsible for determining how global replication will happen.
Redirection
Can we implement redirection to another node that holds the book based on different things like geolocation, current availability or current load on servers?
Noha: Are we all sharing the same code or will each institution have its own code?
Joel: We can all use our own code as long as we agree on the same protocol. To redirect, you'd have to know if the node holds the book. Load balancing and availability are various use cases. If three or four nodes have a complete copy of the files, you can check any of those four to find the material. This falls back to the synchronization question.
Graham: We are talking about the exception rather than norm. This is a fail-over approach. If a particular node can't help the user, the node can go to a different server to find the information.
We will not worry about redirection now but will keep going with node synchronization.
Consistency/Code Share/Shared Metadata Standard
Consistency
Maintaining the consistency of files. What procedures should be followed when a consistency error is found?
The Synchronization Group SIL, Egypt, Europe, MOBOT group will look into this question of consistency.
Code Share
How to share and benefit from local developments.
Could we consider co-development?
Fabio: We could share our code on github. Each node will have issues with different types of code, different platforms, etc, but it's a good start to get our stuff out there.
We need to make all our code available first and then we can take a look at everyone's code and see what makes sense for each node to take.
Bianca: Can we create a description of the platform each node built on and the platform they built on?
Noha: Our platform will most likely be Java Ruby
Get one version control repository set up to which all members can contribute their code to so that we can all see what everyone has and share applications.
Noha: We are breaking our site up into different modules so that others can use our applications. Egypt is sharing their code via Github.
Action Item: Move all BHL code into Github, including Macaw. Wolfgang will look into this. Github link will be on the gBHL website.
Shared metadata Standard
What metadata standard can we all agree to use for data exchange? How would each site maintain different formats?
A standard metadata format is a requirement.
We could say METS but we'll have to define a schema/profile for the METS.
Ely: Yes we need a standard but we need someone to tell us what our options could be.
Joel: There's already METS in use, so let's use this as a starting point and we'll move forward with standardizing from there.
Profile is not standard with METS, so that's why Europe didn't use it (they use OLAF).
William: Is it feasible to do a walkthrough to METS from OLAF?
Wolfgang: Yes, if we know the profile. Therefore BHL-Europe needs to be involved in the discussion to work on standard metadata profile.
Action Item: Group to participate to develop standard metadata profile: Simon from Au, SIL (person to be named), Wolfgang from BHL-Europe, Egypt (person to be named), Trish Rose-Sandler from MOBOT (requested). Needs to be working closely with synchronization group.