BHL Cookbook

printer friendly

Return to Outline and Assignments Page

BHL Cookbook

Or, Making Digital Libraries the BHL way

collection decisions
1. bhl partners
2. selection - ConnieR Aug 24, 2010(not sure I would call it selection...)
3. ingest from open sources
  
  Criteria for ingesting non BHL member contributed content from the Internet Archive, also known as "Ingest" Ingest Criteria Revised
4. requests from the public
  
  - dukeg1 Aug 25, 2010
  - lipscombb Aug 24, 2010
5. Deduplication
  
  deduplication see DeDuplication
6. "Orange bag"
  1. CitebankProviders
  2. Bianca's Orange Bag
Scanning
1. specifications
  1. For content providers
    1. BHL Digitization Specifications.doc- lipscombb Sep 28, 2010 THIS IS THE ONE
    2. BHL_Digitization_Specs.doc from standards - lipscombb Oct 7, 2010 deleted standards page as content was out of date
    3. BHL_Digitization_Specs v2.doc from standards
    4. BHL_Digitization_Specs_2008.doc from standards
    5. BHL_Digitization_Specs_2009.doc from standards
    6. BHL_Digitization_Specs.pdffrom Specifications
    7. BHL_Digitization_Specs_20090520.pdf from Specifications
    8. BHL-E_2pt1_20090805.pdf from Preparing Content for Ingest not currently Internet Archive
    9. https://docs.google.com/Doc?docid=0ASGG0YxTuAn2ZGd2anZ2a3pfMmZzOXZybg&hl=en
  2. To document the scanning specifications for the content we already have in BHL IA & MOBOT scanning specs - lipscombb Oct 7, 2010
2. data needed for scanning
  
  - thompsonkeri Aug 24, 2010
3. Quality
  
  - dukeg1 Aug 25, 2010
4. formats/derivative files
  
  - lipscombb Aug 24, 2010 Does this relate to Digital Imaging Specifications

3. Metadata

DeDuplication

- Pilsk
1. Monographs
- Pilsk
1. Serials - matt707 Oct 14, 2010
for scanning
1. bound-with issues (Bound-withs)

4. Synchronization

IA
BHL sites (MoBot, Woods Hole, London, Australia?)
Local Systems
other projects
1. OCLC Synchronization - Pilsk Aug 25, 2010

Communication
1. Email Groups
  
  *EVERYONE*
2. Feedback
  
  - dukeg1 Aug 25, 2010
3. Wikis
  
  - dukeg1 Aug 25, 2010 (I will do the pubic-facing wiki, but someone else please take our private wiki)
4. face to face meetings
  
  - ConnieR Aug 25, 2010
Delivery (Portal)
1. metadata
  1. How to search
  2. MODS
  3. Pagination - dukeg1 Aug 25, 2010
  4. Merging - dukeg1 Aug 25, 2010
2. Identifiers

DOIs

Deaccessioning
- lipscombb Aug 24, 2010
- dukeg1 Aug 25, 2010
- ConnieR Aug 25, 2010
1. go dark
  1. why
  2. identifiers
  3. notifications
2. removed
  1. why
  2. identifier
  3. notifications
Policies: BHL's open access and reuse

(Tom and pointing to the public wiki documentation)

2.2 Data Needed Before Scanning Starts
In order to select titles to scan and to deliver scanned books that can be found, displayed and navigated by end-users, you must have metadata, and lots of it. At a minimum you must be able to supply both a MARC bibliographic record and related item level information such as volume or issue dates for the item you wish to scan. Below, we address data needs for two kinds of scanning - that done with/by the Internet Archive, and that done by the library, with the intention of supplying the scans to Internet Archive.

In either case, you will need for this recipe:
-One standard Integrated Library System, or other source of MARC records*
-A way to get data out of the ILS, including Z39.50
-Unique identifiers for each item you wish to scan, preferably Barcodes
-One Systems Librarian (optional, but recommended)

The before-scanning IA process in brief: when a book is shipped to an IA scanning center, a person called a "book loader" will search your catalog for the book in hand. They then download the MARC record for the book to the IA servers, assign a "collection" to it, add item metadata, and put it in the queue to be scanned.

For scanning with Internet Archive:
To scan with Internet Archive, you will need to work with them to set up your library as a data source for retrieving MARC records. You must have an ILS that is capable of allowing Z39.50 searches, and a Z39.50 index for whatever unique identifier you are using for your items, preferably barcode, but any unique identifier for a title - such as OCLC number, LCCN, ISBN, ISSN, or local control number will work.
IA staff will need to have detailed information on your Z39.50 connection before your library can be set up as a data source. Hopefully you have a Systems Librarian, or other support staff that can supply this.
You will then need to supply the unique identifier with each book that will be scanned, so that the bookloader can retrieve the correct MARC record with a minimum of error. This is why we recommend using the barcode as the unique identifier when possible. The bookloader can simply scan your barcode to bring up the correct MARC record.
Good, now they have your MARC record, all they need is the item-specific data.

In order to supply the correct item level metadata for your book, BHL and IA devised a system to "push" metadata into the IA scanning software system using a URL string containing the data. We call this "WonderFetch". WonderFetch eliminates the need for the IA bookloader to manually type in copyright, volume, or other item level information. Just as your MARC record goes into a file at IA, the item data also goes into a file, called foo_meta.xml.

Please see the WonderFetch page for instructions on what to put in the URL that will push the item data to IA. Most BHL partners send their IA scanning center a spreadsheet or "Packinglist" that contains a list of all the books in a particular shipment, much like an invoice, that also contains a WonderFetch link for each of the items in the invoice.
How you get the item data, and how you construct the URL will vary according to your ILS and how much you can do with data from your ILS.
Possible solutions for creating WonderFetch links include:
-creating a spreadsheet (instructions on how to do this are provided on the WonderFetch-Instructions page) and manually entering in the item data for each book you wish to scan
-programmatically exporting data from your ILS, say from circulation records or canned searches and using it to populate a spreadsheet which then creates the links.
-pulling item level data directly from your ILS via a database query
-pulling item level data from a database and creating a web page that includes WonderFetch links that are created on the fly (SIL does this.)

For Scanning at Home and sending data to Internet Archive:
If you are thinking of doing this, hopefully you have a programmer, developer, or extremely tech-y librarian on staff, because you can't possibly create all the necessary metadata by hand. Or you probably could, but it would take much too long.

Files you need to send IA for each book include:
-foo_orig_jp2.tar file containing all the images you have scanned, in JPEG2000 format
-foo_MARC.xml (your MARC record, in MARCxml)
-foo_meta.xml (your item level data)
-foo_scandata.xml (the page level data)
-foo_files.xml (the list of image files included in your .tar of images)

Images
All images should be JPEG2000, compressed 85% (if you wish - this is how much IA compresses theirs, and yes, this is the copy of record for IA. Don't give me that look.)
Images must be named foo_0000.jp2, foo_0001.jp2 where foo is your unique IA identifier. How to decide and create the unique ID (must be unique within IA) is up to you. All images must be named with a 4 digit sequential number to maintain the page order. (See what do I do if I missed a page? below)
Images should be rolled into a .tar before uploading, and that .tar named foo_orig_jp2.tar

MARC
This is pretty easy. Just send the MARCxml for the record. There are plenty of programs that will transform your MARC to MARCxml.

Item data
See the WonderFetch page for mandatory data on this page. Essentially, though, everything in that _meta.xml list is mandatory, aside from IP/Copyright, title identifier, and volume/issue when it's not relevant.

Page data
Please see this example of a bare-bones scandata.xml file.
Mandatory elements on this page are:
-"bookData" must have the unique identifier and the count of total images (must match the # of image files)
-"pageData" must have leafNum, pageType, addToAccessFormats (this controls if the page displays or not), origWidth and origHeight, cropbox w and h, handSide (Right or Left for Recto and Verso), and bookStart=true for the title page
the first page leafNum value must match the first image filename enumeration. if you start your filenaming with 0000, your first leafNum should be 0.

<page leafNum="1">
	<pageType>Cover</pageType>
	<addToAccessFormats>true</addToAccessFormats>
	<origWidth>4698</origWidth>
	<origHeight>6270</origHeight>
	<cropBox>
	<x>0</x>
	<y>0</y>
	<w>4698</w>
	<h>6270</h>
	</cropBox>
	<bookStart>true</bookStart>
	<handSide>RIGHT</handSide>

2.3 Quality:
To assure that the digital manifestations of the items scanned and added to the Biodiversity Heritage Library match the quality of their physical counterparts.

Steps:
In order to ensure that scans of those items that are sent for scanning by the Biodiversity Heritage Library match the quality of the physical items that are sent, BHL staff developed quality assurance policy and procedures. This involves QA'ing 100% of a sample of books from a given shipment cart.

The QA process begins once a cart of books is returned from scanning. Upon receipt of book trucks, the invoice sent along with the returned cart is matched against the cart itself. If items are missing, the scanning center is contacted and provided with a list of missing (or excess) items.Staff then document missing/excess items (documentation of missing or excess items is maintained by each individual institution - there is no BHL-wide documentation of this). All invoices/manifests of items sent and returned are stored by each individual insitution for future audit purposes.

QA is done on a statistical sampling of books in the returned shipment, using either the pdf or flip book on the IA site. The number of books chosen to QA is based on the number of books in the returned cart. BHL staff use a NISO Standards Chart that details how many books in a given shipment should be QA'd. This sampling procedure is also used by IA during their QA process. There is no documented method for choosing which books are part of the sample. Staff often attempt to identify and prioritize for QA those books on the returned cart that look like they may be problematic (foreign language, bound-with monograph, etc.).

In general, a shipment will fail QA if more than 2% error rate is found. For QA purposes, BHL staff differentiate between "major" and "minor" fails of items QA'd (See QA Policy documentation). If a shipment has a 2% error rate, the entire shipment (minus any books that passed QA) is sent back to the scanning center, after which time the scanning center can decide to either QA the entire cart or rescan the entire cart.

The procedures for checking a physical item in-hand against the digital manifestation of the item on IA involves both a metadata check and a scan images check.

For metadata checks, staff make sure that the metadata for the item on IA correctly reflects the item in hand. If the metadata is attached to the wrong item, staff notify the scanning center, who should in turn be able to fix the problem. If QA is being done on material that was scanned more than a month ago, metadata should be corrected by hand in the portal via the editing tool.

For scan images checks, staff simply check each page of the physical book while clicking along with the flip-book or PDF on IA (i.e. turn a page, click the scan. repeat for entire volume). Staff are particularly looking for missing pages, unreadable text, text that has been cut off, the scanner's hand turning the page in the shot, anything that the camera might have caught that obscures the text, etc. If any problems such as these are discovered, the books are put aside and sent back to IA on the next shipment for correction. If the problem involves only a few pages, IA will simply rescan the pages in question and insert them in place of the offending pages in the scan file. If the problem is with the entire scan file (all pages are too light/too blurry), the entire book is rescanned. The corrected books are then returned to the BHL partner institution.

Documentation:
Wiki:
QA+Procedures (QA Procedures outlined)
QA+Policy (QA Policy Outlined)
QA+sampling+chart (NISO chart)

Results:
By performing consistent QA on returned carts, BHL staff have seen a decrease in the number of QA errors. The policy of returning entire carts with more than a 2% error rate prompted our scanning partner to perform extensive QA on the materials as they're scanned - a policy they were lacking. As a result, many of the errors are found before they are returned back to BHL partners, and the number of returned carts for QA fails has significantly decreased.

Lessons learned:
When performing QA, it is important to have the physical item in-hand. Page numbers often lie or hide plates within otherwise numerically correct spans.

For any pages that seem too light, off-color, etc. download and check those pages in the PDF. Rarely, staff have found that pages missing (or seeming too light to read on the flip-book) will, in fact, be readable and present in the PDF. Furthermore, ensure that the OCR'd text in the PDF for suspect pages is there and more or less correct (it is OCR....).
Bottom line: if there seems to be something wrong with the flip-book double check the PDF against the book in hand page by page.

It is important to have a methodology in place that allows for certain QA corrections to be made without having to rescan the entire item. Often, the problem with a book involves only a few pages (a few pages missing, text obscured on a few select pages, etc.). Rather than having to rescan an entire book for only a few pages, develop a procedure that allows a few pages to be rescanned and inserted into the existing scan file in place of the "bad" page scans.

It is important to rely on user feedback for QA. It is impossible for staff to find all QA problems with all materials in BHL, but users will find these problems. However, it is important that digital libraries have a means of dealing with QA problems that may be discovered by staff or users long after the item has been scanned. This involves working out an agreement with scanning partners that allows books for which errors are found long after scanning is complete to be sent back for correction.

3.2. DeDuplication:
To minimize the scanning of the same material by more than one BHL partner.
Tools and Specifics:
DedupingTools
Method:
Determined that monographs and serial workflow were different and developed two different tools to work on the process.
Overall Lessons Learned: Deduping before scanning is extremely difficult. There will always be duplication in part or complete title runs. Acceptance of duplications needs to be discussed and agreed upon by the group. The cost of trying to reach a level of duplication needs to be a factor. It might be easier and cheaper to deal with post duplication than slowing down the process of finding material to scan. Methods need to be in place on how to deal with duplications – both intentional and unintentional duplication. Results of merging material from various scanning venues and making dark duplicates that are not needed need to be thought through fully. Issues to include are the establishment of persistent URL/URIs, the credits and branding of library that did the scanning, the discovery of titles from more than one major access point (material cataloged by both monographic separate, part of a series, and volume of a serial), etc.
Google outlines issues they have in deduplication in the blog post about counting the number of books that exist.
http://booksearch.blogspot.com/2010/08/books-of-world-stand-up-and-be-counted.html

Does this mean that there are 600 million unique books in the world? Hardly. There is still a lot of duplication within a single provider (e.g. libraries holding multiple distinct copies of a book) and among providers -- for example, we have 96 records from 46 providers for “Programming Perl, 3rd Edition”. Twice every week we group all those records into “tome” clusters, taking into account nearly all attributes of each record.

When evaluating record similarity, not all attributes are created equal. For example, when two records contain the same ISBN this is a very strong (but not absolute) signal that they describe the same book, but if they contain different ISBNs, then they definitely describe different books. We trust OCLC and LCCN number similarity slightly less, both because of the inconsistencies noted above and because these numbers do not have checksums, so catalogers have a tendency to mistype them.

We put even less trust in the “free-form” attributes such as titles, author names and publisher names. For example, are “Lecture Notes in Computer Science, Volume 1234” and “Proceedings of the 4th international symposium on Logical Foundations of Computer Science” the same book? They are indeed, but there’s no way for a computer to know that from titles alone. We have to deal with these differences between cataloging practices all the time.

We tend to rely on publisher names, as they are cataloged, even less. While publishers are very protective of their names, catalogers are much less so. Consider two records for “At the Mountains of Madness and Other Tales of Terror” by H.P. Lovecraft, published in 1971. One claims that the book it describes has been published by Ballantine Books, another that the publisher is Beagle Books. Is this one book or two? This is a mystery, since Beagle Books is not a known publisher. Only looking at the actual cover of the book will clear this up. The book is published by Ballantine as part of “A Beagle Horror Collection”, which appears to have been mistakenly cataloged as a publisher name by a harried librarian. We also use publication years, volume numbers, and other information.
- […]
Our handling of serials is still imperfect. Serials cataloging practices vary widely across institutions. The volume descriptions are free-form and are often entered as an afterthought. For example, “volume 325, number 6”, “no. 325 sec. 6”, and “V325NO6” all describe the same bound volume. The same can be said for the vast holdings of the government documents in US libraries. At the moment we estimate that we know of 16 million bound serial and government document volumes. This number is likely to rise as our disambiguating algorithms become smarter.

3.2.1 Monographic DeDuplication
Monographs:
Before scanning, a list of titles is created by the BHL member for potential scanning. The monographic deduper is a master monographic dataset that can be used to compare spreadsheet lists of titles to be scanned. A comparison is done and titles are then added to the master dataset. As more lists are added, the master data set grows indicating the titles that have been chosen to be scanned by the institutions.
Functions:
Upload a list of titles with appropriate fields to match against the growing database of materials already determined to be scanned.
Monograph+Dedup+Tool

De-Duper Instructions: Step by Step Process of Using the DeDuper.doc

Requirements:
Designed to ingest packlists/picklists in excel (.xls) format . This tool now requires picklists to contain the following column header names (NOT including punctuation marks): "Local Number", "OCLC", "Title", "Author", "Volume", "Chronology", "Call Number", "Publisher", "Publisher Place". These are the standard names that we agreed to. Please note that your picklist can contain other information, but the tool will ignore it. If duplicates are found, you’ll be able to see which institution scanned it, and when. If duplicates are found, picklists can be edited online and downloaded as a .csv file.
Lessons Learned:
The method in place turn out to be a very cumbersome task that lack sophisticated filtering requiring a lot of manual work. This current method does not coordinate with titles that are cataloged as monographic separates that have a serial/series title that needs to compare. It became evident that metadata is inconsistent across the institutions. OCLC numbers are not a guaranteed match. Not integrated into after scanning process for updating and linking to the portal for material scanned. Each institution had different workflows and had to make this work.

3.2.2 Serials Deduplication
Serials – Serials are a complex challenge to keep track of in a mass scanning project. Theoretically you could spend a month gathering together tens of volumes of a serials title, and as you go into the scanning process, another library might be ready to scan the same title. Serials deduplication before the fact of duplicating work is essential.

Functions: Originally developed, at the Natural History Museum London using CakePHP, now maintained by the Natural History Museum Vienna, the BHL Scanlist (former names: Bidlist, Union List of Serials, Mashup) contains all of the serials bibliographic and holding records of the participating and associated libraries.
http://bhl.nhm-wien.ac.at/scanlist/
In 2007, partner libraries provided MARC dumps of their serials records and holding statements. From 119,377 records, matching was performed using OCLC numbers, ISSNs, and title (245), out of which 70,764 unique titles were identified.
Each library can access the Scanlist through a username and password. Once logged in, you can search for individual titles. You can merge records of titles held by multiple institutions into a master record, and you can place an “intention to scan” statement on a title which you intend to scan, with the ability to add free text notes to an intention to scan “bid”. A bid can be marked “started, complete, or hold” as the scanning workflow progresses.
Requirements: It is necessary to be able to comfortably make a statement that you intend to scan volumes of a serial title in a public (within the project) manner, so participating libraries will not duplicate your efforts. Future needs are for the Scanlist or GRIB to be fully conversant with the BHL admin and Gemini functionality, as such connections will enhance the overall efficiency of the BHL serials scanning workflow.

Lessons learned: The Scanlist has been a successfully functional tool since 2007. It does accomplish many aspects of what it was charged with serving. Serials are complex and unpredictable, with titles and enumeration methods not always constant. This fact requires a constant level of hands on care in the successful operation of this tool.

4.4.1 Title: OCLC BHL Synchronization
GOAL: To represent holdings from BHL (OCLC Symbol=BHLMR; MARC code=DcWaBHL) in OCLC as digital manifestation. Potential reuse of the metadata from OCLC digital manifestation to take advantage of gift from Bowker for ISBNs for monographs in English.
Steps:
BHL exported all monographs and serials from BHL member institutions and sent the titles, OCLC numbers and a suggested mapping to OCLC.
Titles that do not have OCLC numbers were not included.
Duplicates detected by OCLC were skipped.
Documentation:
Wiki: BHL+OCLC+Synchronization and connected pages.
Results:
Not completely inclusive of all material in OCLC. Only titles that already had OCLC numbers. Still no ISBNs or a workflow for submitting titles to Bowker for ISBNs. Gift was for only English language material and limited number.
Lessons learned:
OCLC numbers are the only identifier workable in matching and synching with OCLC.
If we have titles that do not have OCLC numbers, the mapping to OCLC for batch loading will need to be reviewed. Metadata for Non-OCLC numbered records need to have the minimum data elements for creation of the digital manifestation record in OCLC.
OCLC still has not struck a deal with Bowker to submit for ISBNs. ISBNs for digital works of original print is also still under discussion in the standards community. ISBNs have been needed for “print on demand” potential workflow/ money tracking. No decision has been made. No discussion has happened.

5.1 Communication: Email
BHL Members and Non members can join the overall discussion list hosted by American Museum of Natural History: biodivheritagelibrary@lists.amnh.org

BHL Europe has a technology group at BHLe-tech@googlegroups.com

BHL Staff (worker bees) have a discussion listserv hosted by Smithsonian: BHL-STAFF@si-listserv.si.edu
BHL

Locally: Smithsonian Staff have a internal list to reach Library staff involved with BHL: SILBHLTaskforce

Discussions of interest that members of BHL monitor include:
tdwg@lists.tdwg.org
A list with very low traffic for official TDWG announcements from the executive committee http://lists.tdwg.org/mailman/listinfo/tdwg
tdwg-content@lists.tdwg.org
A list for all non technical discussions covering all TDWG working groups / standards http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-tag@lists.tdwg.org
The technical architecture group list that will host discussions about the technical details of standards http://lists.tdwg.org/mailman/listinfo/tdwg-tag

5.2 Feedback:
To allow users to submit feedback to BHL and allow BHL staff to collect, organize, and resolve user feedback as well as maintain communication with users submitting feedback.

Steps:
BHL implemented a "user feedback" button on the BHL website. This button is located in three places on our website: 1) in the website header under the title "Feedback;" 2) at the title level of content in BHL (if a user accesses a title in BHL, on the bibliographic screen for that title is a button labeled "Report an Error," which links to the feedback form); and 3) at the item level of content in BHL (if a user accesses the scan images of an item in BHL, there is an icon above the scan images that links to the feedback form). This feedback form has two divisions depending on the type of feedback a user wants to submit. Users can either submit feedback about questions/comments/problems they notice with content, or they can submit feedback for scan requests (i.e. items that users want BHL to scan). <For more information on Requests, please see related item in Cookbook>. The feedback form also includes a field in which users can enter their email address. All correspondence with the user will occur via that email address.

BHL implemented the use of an issue tracking system typically used in the software development industry called "Gemini." When users submit feedback on BHL, this feedback is entered into Gemini as a new "Issue" or "Ticket." Each "Issue" can be individually opened, and there are a variety of functions staff can perform on that "Issue" to respond to or resolve the issue. (Example: Staff can mark the status, resolution, and type of issue. Staff can also assign issues to other staff members, leave comments on the issues indicating the work they are doing in relation to the issue, and link related issues in Gemini). Gemini has an email notification feature that allows staff to be notified by email when they are assigned to an issue or when an issue they are assigned to is updated in some way.

In order to maintain communication with users submitting feedback, BHL implemented the use of a BHL gmail account. This account, which has as a domain name of feedback@biodiversitylibrary.org, is used to send all correspondence to users submitting feedback. This correspondence is sent to the email address the user provides in the submitted feedback. The user will receive several emails from BHL regarding their feedback. The BHL website automatically sends an email to users when they initially submit their feedback. This is a standard email response form. Then, when staff begin dealing with the submitted feedback, they will send an email to the user letting him or her know what steps will be taken to resolve the issue. Should it be necessary throughout the life of the issue, staff will maintain communication with the user to keep him or her informed of the progress being made.

When staff have completed all necessary tasks associated with a user-submitted feedback, the issue is marked as "closed" and "complete" in Gemini, thus removing it from the list of active issues.

Documentation:
Wiki:
Gemini+Feedback+Tracking (For basic documentation on BHL and Gemini)
Boiler+plate+responses+for+responding+to+specific+issues (documentation of some standard responses sent to users regarding their feedback).

Results:
The use of Gemini for tracking and responding to user submitted feedback has been a great success. Users typically receive a response from a staff member regarding their feedback the day it is received. Gemini has proven to be an extremely useful tool for collaborating among BHL's many partner institutions. User feedback is rarely resolved with the involvement of one institution. It often requires many institutions working together to resolve these issues. Gemini serves as a platform by which BHL staff can communicate and collaborate to resolve an issue. Because Gemini allows for email notifications, it is very easy for staff to stay updated with any issues assigned to them.

Gemini has also proven very useful as we at BHL plunge into the rocky waters of gap fills. With only some institutions still scanning, there are often difficulties encountered for scanning gap fills that are only held by institutions that are no longer scanning. Through Gemini, BHl staff have developed a workflow by which institutions that are no longer scanning can FedEx their needed volumes to institutions that are scanning so that these items can be scanned and added to the collection. The communication and logistics for this process are handled via Gemini.

Lessons learned:
The use of an issue tracking system is very important for keeping user-submitted feedback organized and efficiently dealt with. It is imperative that any issue tracking system used has an email notification function that will inform staff when they are assigned to an issue or when issues they are assigned to are updated. Without this notification, staff are often unable to keep track of all their issues, or they are unaware when another staff member asks them to perform a new task.

Furthermore, the success of using an issue tracking system is completely dependent on the involvement of staff. If staff do not respond to their issues, or if they do not complete the tasks required of them to resolve an issue, the entire system will fail. Thus, it is imperative to obtain staff buy-in when using an issue tracking system for responding to user-submitted feedback.

Issue tracking seems to work best if there is an initial point person that reviews all submitted feedback and assigns it to the appropriate people. It is important for this person to also monitor the progress of all issues. BHL staff have often encountered situations where the initial people assigned to an issue cannot perform the necessary task associated with it. If a point person is not keeping on eye on these developments, the issue may never be assigned to a new person that perhaps can perform the necessary task. In other words, there is the danger that issues will become orphaned unless someone is watching to ensure that there is always a responsible party associated with it.

It is also important to train staff in the workings of the issue tracking system. This training was somewhat neglected at BHL, and as a result it took more time to obtain staff buy-in because staff did not know how the system worked. Providing a clear demonstration of the navigation and features of the system at the beginning of the process can not only help to obtain staff buy-in, but can also result in a diminishment of the initial learning curve, resulting in faster response to feedback from the very beginning of the implementation of such as system.

5.3 Wikis:
To facilitate communication and collaboration with both staff and the public.

Steps:
BHL employs the use of both a private and a public wiki. The private wiki is used for communication and collaboration among staff. The public wiki is used as a means of communication with the public.

Private Wiki: (More should be added here!)
BHL Private Wiki uses the Wikispaces software. We pay a fee to have the wiki not publically available. This allows for communication that is only seen within the BHL community. The wiki administers are very open in allowing those that are interested in participating are members.

The Wiki pages are used for collaborative editing and collecting data from all members. It is allows community editing which has been most successful with agenda building and tracking decisions.

Wikispaces allows uploading of files. Wiki pages link to uploaded files so that we can share stagnant versions of documents and data.

Pages link to one another and link out of wikispaces when appropriate. Best practices is to not have “orphan pages” without a link to either the table of contents or to a master page. Tags are used for pages to facilitate locating various pages.

Lessons Learned:
BHL found that the discussion tab has not been very successful with few people monitoring those areas of the wiki.
The wiki grew quickly and has now become confusing. It is difficult to know when a page has data that is outdated or has been revised elsewhere. Archiving the pages has become too cumbersome. We would like to mark some pages as irrelevant and are investigating the deletion function as more of an archiving function on a page by page basis.

Public Wiki:
BHL staff desired to find a means by which information about the project, project members, collection, technical developments, affiliated BHL projects, and help/tutorials could be communicated to the public. Staff desired that whatever method was employed would be easy to update and accessible to a variety of staff so that it would not be dependent on the technical team to implemented any changes or updates that were required. Much of this information was posted on stagnant pages associated with the BHL portal itself, and, because only the technical team could update these pages, much of the information was outdated. Thus, the use of a wiki to post this information for the public in a way that could be easily updated when necessary was deemed appropriate.

BHL formed a Public-Facing Wiki Committee, and, using Wiki Spaces, created a public wiki site. The committee decided to create pages on this public wiki that would include both the information then housed on the BHL site itself and any additional information, such as Help/Tutorials, that should be desired. The wiki was thus composed of the following categories: “About,” “BHL FAQ,” “Help/Tutorials,” “Developer Tools and API,” “Documentation,” “Licensing and Copyright,” “Permissions,” “BHL Social Networking Sites,” “Events,” and “Contact and Feedback.” The members of the BHL Public-Facing Wiki committee and members of the technical development team were added as administrators to the public wiki, thus allowing these individuals to make any changes to the wiki. The corresponding pages were then linked to the BHL portal, replacing the stagnant pages that once housed this information.

Documentation:
Wiki:
public+facing+wiki (Public Facing Wiki Page on Private Wiki)

Results:
The use of the public wiki seems a good choice for posting information for the public that frequently changes. The wiki has proven useful for answering many questions, such as how to take advantage of certain BHL features like downloading a custom PDF, that staff receive as feedback from users. A single point of reference to which to send users is extremely helpful.

There have still been some minor problems with wiki “ownership.” While the Public-Facing Wiki Committee members do take the responsibility to fix or update things brought to their attention, there is currently no real effort in place to continually add to the public-wiki. Discussions have begun regarding placing responsibility for maintaining the wiki on a single individual, though no definite plans are in place.

Some questions do arise about public users who request membership to the wiki. BHL is as of yet still undecided as to whether public-users can be granted membership to the wiki, since this membership also means that the public will be able to edit pages. BHL has yet to determine a solid decision on this point.

Lessons learned:
It is helpful to post information for the public that frequently changes on a platform that is editable by many people, not just the technical development team. In this respect, wikis are a good choice.

If the use of a wiki for sharing information with the public is undertaken, it should be decided early-on whether staff will allow public users to become members of that wiki, thus giving them the ability to edit pages.

It is a good idea to form a committee responsible for the creation and maintenance of the wiki, but it is also important to appoint a single person as the “lead” of the wiki, who thus maintains more responsibility with regards to regular updates and adding new information as needed. This increases efficiency of the wiki.

6.3 Pagination:
To improve the pagination of items in BHL for both user interface and citation resolving involving machine-to-machine communication.

Steps:
Initial pagination of scanned items is done at the time of scanning. This pagination is very minimal and is partly manual and partly automated. The scanner can click a range of pages and choose a pagination style (i.e. Roman Numerals, etc.) and increment amount. The scanning software will then assert page numbers based on these specifications. The page types can also be asserted (i.e. text, illustration, foldout, title page, etc.). The default page type is text, and the scanners usually indicate differing page types of text, title page, and occasionally illustrations or foldouts.

BHL staff realized that the pagination of items coming into BHL was often incorrect (what is listed as page 5 is not page 5). Furthermore, even when it was correct, it was so minimal as to often be unhelpful. Illustrations or plates are often not indicated as such, and instead display to users simply as "text." When they are marked as "illustrations," plate or figure numbers are usually not given, so users have to scroll through the entire list of illustrations to find a particular plate they need. This presented particular problems in regards to PDF generation, during which users are completely dependent on the pagination provided for an item when selecting their pages. Additionally, many volumes in BHL are bound with many issues, or contain many different articles. There is no indication of the start and stop of these issues or articles when pagination is performed at the time of scanning, and this proved to be problematic for user navigation. BHL staff realized that in order to meet user needs, manual pagination on the portal side (i.e. after scanning and ingestion into BHL) must be performed, with particular attention to correcting or adding pagination where there was none, or asserting plate numbers, issue starts and stops, and articles.

The technical development team at BHL created a pagination interface by which BHL staff could manually update the pagination of an item. This interface allows staff to select page types (as many as needed), assert page or plate numbers, indicate where issues and/or articles start and stop, ascribe volume, year, and issue numbers to an entire range of pages, and add additional "Prefixes" to pages that display descriptions to users in the portal (i.e. a Prefix that includes the abbreviation of an article title at the start of the article in the volume).

BHL staff began manually correcting pagination on an as-needed basis using this portal interface. Because the process was extremely labor-intensive, a minimal amount of content was manually paginated, and often pagination was performed only to the level of articulating plates. During a face-to-face staff meeting in November 2009, SIL staff performed a demonstration of pagination processes to the rest of the BHL staff as a basic how-to for pagination. In initial discussions regarding pagination policy, it was remarked that using interns, and eventually crowd sourcing, would be ideal for manual pagination.

The realized importance of pagination increased as staff began to receive more and more feedback regarding pagination problems. Furthermore, it was discovered that pagination procedures were inconsistent among BHL staffers, and many of the processes being performed were not actually compatible with citation resolving and machine-to-machine communication that relied on the standard insertion of pagination information. Thus, many pagination efforts by staff were actually "breaking" the citation resolving component of BHL. In order to standardize pagination procedures across BHL and ensure that pagination processes were not in conflict with other aspects of BHL, BHL staff developed a "[[file/detail/Pagination How To.doc|Pagination How-To]]" guide that not only clearly outlines the steps needed to paginate, but also asserts the "rules" by which this information should be provided.

Documentation:
Wiki:
Pagination+ST+Plan (Initial Policy Discussions)
files/Pagination+How+To.doc (Pagination How-To Guide)
Pagination (Pagination Discussion Page)

Results:
The creation of the Pagination How-To guide has greatly improved pagination procedures at BHL. Not only has a concrete policy been established, but there is a complete document available for staff that wish to learn how to paginate. This reduces the amount of manual training that must be done for new staff wishing to begin paginating.

While some institutions have staff and interns devoted to pagination full-time, manual pagination at BHL is still only being done on a small scale, and often in response to user feedback. Large-scale intern projects involving pagination, or allowing crowd-sourcing for pagination, while still in discussion, have not been developed.

Lessons learned:
Pagination is an extremely labor-intensive process. The more pagination that can be performed at the time of scanning, the better.

It is essential for a clear, consistent pagination policy involving processes to be developed early on in any digital library project. This policy must be developed with the help of technical staff that understand how any machine-to-machine communication might use the pagination that is provided for an item. It is important that any staff that wish to begin pagination processes have access to and agree to comply with the policies set forth in pagination documentation.

6.4 Merging:
To ensure that there is only one entry for each unique title in BHL.

Steps:
Each institution that scans a title for BHL has a separate entry for that title ingested into the BHL. This means that, if three different institutions scan a group of volumes from the same title (i.e. one institution scans volume 1-10, another 11-20, and the third 21-30), that title will have three separate entries in BHL. Thus, when users search for that title, they will get three search result hits returned to them. BHL staff want to ensure that users get only one search result hit when they search for a title in BHL. In order to accomplish this, staff must perform manual merges of titles in BHL. This involves choosing one of the entries to serve as the “parent” or “master” record into which all other entries of the title will be merged. BHL staff decided that all duplicate volumes that might result from a title merge would be kept and simply sequenced together in the list of volumes available for a title. This decision was reached because staff did not have the time to go through each available copy of a volume to determine the "best" one. Thus, all copies would be retained.

BHL technical staff developed an interface on the administration side of the portal to allow staff to perform title merges. The title merge process was originally envisioned as a primary-secondary relationship in which one title would be marked as the primary record and all other titles merged into it would be secondary titles that were not search-able on the user side of BHL. However, staff soon found this to be problematic when considering another relationship in BHL – monographs and serials. Items scanned in BHL often needed to be associated with two records – a serial record of which they are a part and a corresponding monographic record. However, if the relationship among titles was viewed as primary and secondary, where only one record could be primary for a given item and only primary titles were searchable in the general search on BHL, items could not be discovered by users under both the serial and monographic manifestation of the item. Thus, in order to allow items to be not only associated with more than one bibliographic record in BHL, but also searchable under either, BHL technical staff changed the applied technique from primary-secondary to searchable-unsearchable.

In the case of simple title merging, one title entry in BHL would be chosen as the master entry into which all duplicate entries would be merged. BHL technical staff added an option to each title entry of “published” or “unpublished.” All duplicate entries were then marked as “unpublished” in BHL, thus ensuring that users would receive only one search result hit for that title.

In the case of items that needed to be associated with more than one title record, the merge process would be performed, associating the item with multiple title records (see how-to for detailed instructions), but all title records would then be left as “published” on BHL. This way, users could find the same item under either its serial or monographic title.

Staff soon discovered another problem with the merging process. Individuals or institutions that had linked directly to a title in BHL under an entry that was subsequently merged into another entry were not being taken to the updated, merged record in BHL that contained all volumes of a certain title. Instead, they were being taken to the old manifestation of the record that did not contain all volumes available after merges were performed. Thus, in order to address this problem, BHL technical staff implemented the use of a “replaced by” field. When a merge is performed, the title ID of the record that is chosen as the master record for the merge is entered into the replaced by field of all duplicate records in BHL. Then, when users accessed the “old” link, BHL would automatically re-route them to the new, complete merged record.

Not surprisingly, staff soon discovered yet another problem created by the merge process. Since title merging might occur at any stage in the scanning process, it is possible that institutions will continue scanning items under titles that have been merged into other records on BHL and “unpublished.” When this happened, any items that were scanned and ingested into the portal to an “unpublished” title after a merge occurred never showed up for public viewing in BHL. In order to rectify this, BHL technical staff implemented a script that would take into account the “replaced by” field in a title record and ensure that items subsequently added to unpublished, merged titles would be added to the master, published title and thus available for viewing on BHL.

Documentation:
Wiki:
Merging How-To
Primary+and+Secondary+Title+Discussion (Primary/Secondary Discussion)
Standards+for+Bibliographic+Items (Bibliographic Description Policy)

Results:
The creation of a “Title Merging” how-to document has been extremely valuable for BHL staff. It provides a single point of reference that all staff can look to when performing merges and reduces the amount of in-person training that needs to be performed.

While there is a standardized title merging process in place that works, it is still often difficult to find the titles that need to be merged. Staff often stumble across title merge needs by accident or by feedback submitted from users. Additionally, titles merging is often never “complete” because scanning and ingesting is continuously occurring. A title that has only one entry in BHL one day, or that had previously been merged to one entry, may have several entries the next when additional items are ingested into the portal. Thus, as long as scanning and ingesting is occurring, title merging is an ongoing process.

As described in the steps, many problems that were unforeseen related to title merging have required additional steps to be implemented in order to allow title merging to work “correctly.” Staff will continue to monitor work done with title merging to catch any additional unforeseen problems that may still present themselves.

Lessons learned:
It is important to have documentation of how to perform merges, particularly when additional steps implemented in order to address certain problems that are discovered are added to the process.

It is also important to keep in mind that other individuals or institutions may link to items in a digital library collection, and if the title link that the institution uses is then merged into another title, thus in a sense “deactivating” that title link, the outside link will no longer work. Digital libraries must take measures to ensure that users linking to old or merged title links will be correctly routed to the new links.

Another important factor to consider is that title merging might occur at any stage in the scanning process. If an institution’s title entry in BHL is merged into another title, but the institution continues to scan items under their title entry, these items may never show up on the digital library. Steps must be taken to ensure that any volumes scanned after merging occurs will still become available to the public.

Finally, BHL staff rely a great deal on user feedback for notification of needed title merges. It is impossible for staff to find all needed title merges in BHL, but users will find them. Having a means of collecting feedback is important for many reasons, including notification of needed title merges.

BHL Cookbook

Table of Contents

BHL Cookbook

ingest from open sources

requests from the public

Deduplication

"Orange bag"

Scanning

specifications

data needed for scanning

Quality

formats/derivative files

DeDuplication

Monographs

for scanning

BHL sites (MoBot, Woods Hole, London, Australia?)

Local Systems

other projects

Communication

Email Groups

Feedback

Wikis

face to face meetings

Delivery (Portal)

metadata

How to search

Merging - dukeg1 Aug 25, 2010

DOIs

Policies: BHL's open access and reuse