Gemini background emails

From: Mike Lichtenberg [mailto:mike.lichtenberg@mobot.org]

Sent: Thursday, March 10, 2016 3:44 PM

To: Adam L. Chandler <alc28@cornell.edu>; Suzanne Pilsk (PilskS@si.edu) <PilskS@si.edu>; Bianca Crowley (CrowleyB@si.edu) <CrowleyB@si.edu>; Matthew Person (mperson@mbl.edu) <mperson@mbl.edu>; Susan Lynch (slynch@nybg.org) <slynch@nybg.org>; 'dduncan@fieldmuseum.org' <dduncan@fieldmuseum.org>; 'm.loran@nhm.ac.uk' <m.loran@nhm.ac.uk>

Subject: Questions about Discovery Tools Gemini issues related to holdings info

All,

I am working on a much longer email concerning how I think the Gemini issues related to holdings information can be addressed, but in the meantime I have the following questions and comments…

First, a comment. Several of the Gemini issues mention needing dates in ISO 8601 format, and specify that format as YYYY-MM-DD. I just wanted to point out that ISO 8601 does not require YYYY-MM-DD; YYYY-MM and simply YYYY are also valid. This is important because many of the volumes in BHL do include only a year value.

Next, a couple questions…

The last line of Gemini Issue 57281 reads “We need to investigate what to do with serial gaps”. I guess that means that “Determine how BHL will represent serial gaps in KBART exports” is an actionable item, and a separate Gemini issue… but who should it be assigned to? Who will make such a determination? Do we need to do that?

For the next question, let’s assume that we come up with a way to represent and identify holdings information for inclusion in KBART. Also assume that newly ingested materials may not always immediately have the necessary holdings metadata. As an example, let’s say that for a particular journal, BHL has volumes 3-10. Then, volume 2 is ingested but is missing necessary metadata, so its place in the holdings cannot be determined (unless it is manually updated). Is it OK to state that volumes like our hypothetical volume 2 will simply not be included in KBART holdings data?

Thanks,

MIKE

From: Mike Lichtenberg [mailto:mike.lichtenberg@mobot.org]

Sent: Friday, March 11, 2016 1:34 PM

To: Adam L. Chandler <alc28@cornell.edu>; Suzanne Pilsk (PilskS@si.edu) <PilskS@si.edu>; Bianca Crowley (CrowleyB@si.edu) <CrowleyB@si.edu>; Matthew Person (mperson@mbl.edu) <mperson@mbl.edu>; Susan Lynch (slynch@nybg.org) <slynch@nybg.org>; 'Diana Duncan' (dduncan@fieldmuseum.org) <dduncan@fieldmuseum.org>; 'm.loran@nhm.ac.uk' <m.loran@nhm.ac.uk>

Subject: Discovery Tools - KBART Holdings data

Many apologies in advance for the lengthy email. Unfortunately, this isn't a topic that is easily summarized.

I have been looking at the four Gemini issues (57280, 57281, 57282, 57284) related to holdings information... that is, dates and volumes of the first and last issues online.

BACKGROUND

This is a hard problem to solve. The accuracy of the holdings information is dependent on the parsing of the Volume values of 57,085 volumes from 3,981 serials in BHL. Those volume values can come in many forms. How well those values are parsed affects the ability of automated processes to keep serial volumes in the correct sequence. If they are not correctly sequenced, then the "first" and "last" volumes cannot be determined. Oh, and by the way, new data arrives weekly.

Some of the problems with parsing the volume values are due to how the data is formatted at the source (Internet Archive), and some are our own doing.

BHL has adopted a standard format for the volume value that is good for the user interface and for end-users. On the other hand, it is not at all good for the auto-sequencing algorithm that attempts to put volumes in their proper order, or for the extraction of holdings information. It is worth noting that it is also very bad for the operation of BHL's OpenURL resolver. To be fair, many of the volume values at Internet Archive are not optimal for automated parsing either. It is unfortunate that BHL's standard format could have improved the "parsability" of the original values, but did not.

In a moment, I will use OpenURL to illustrate a point, so if anyone is not familiar with it, OpenURL is a way to encode a description of a resource such as a book or article in a URL in order to help users access that resource. For example, http://www.biodiversitylibrary.org/openurl?&genre=book&title=Contributions+from+the+United+States+National+Herbarium&date=2008&volume=56&spage=10 will take users directly to page 10 of volume 56 of Contributions from the United States National Herbarium. In a way, it is a standardized way of submitting a search to BHL.

Back to the issues with Volumes values... for an example of the problems, consider the following item:

https://archive.org/details/annualreport1st108geol
http://www.biodiversitylibrary.org/item/103543

In IA, the Volume value for this item is "v. 8", and in BHL this value has been standardized to "v. 8 (1874)".

To understand why the standardized value is bad for technologies that need to read and interpret the volume information, consider that a request for this volume via OpenURL will specify volume = "8". Not "v. 8", not "vol 8", not "v. 8 (1874)"... just "8". Because a year value from the 1800's has been added to almost EVERY volume for this title (see http://www.biodiversitylibrary.org/bibliography/15810), they all include an "8". Since that is true, the OpenURL resolver can no longer easily narrow down its search to just the "true" volume 8. It is tempting to argue that such a volume string is easily parsed (after all, the OpenURL resolver just needs to ignore the date in parentheses, right?), but consider this brief list of BHL volume values:

12
new ser.:no.49 (1904)
(1892)
1st 1881
29-Feb-80
no.918
Zoology v.11:pt.3=pt.33 (1884)
9th (1887)
vol. 1901
text
no.18
1945
Narrative v.1:pt.2 (1885)
v.89=no.609-622 (1915-1917)

Some of those reflect how the volume appears at Internet Archive,and some have been modified by BHL staff. Either way, it is not realistic to expect a parser to work acceptably with that variety of volume formats to process. And while I used OpenURL as the example here, the same volume parsing problems will affect the determination of holdings information.

PROPOSED CHANGES

So, with that as a background, the question that is relevant to Discovery Tools is how do we get and maintain holdings information for existing and future BHL materials?

I think a number of changes are necessary to the data model, the data ingest procedures, the OpenURL resolver, the data exports, the OAI feeds, and the BHL APIs. In addition, there will necessarily be additional work required by BHL staff to maintain the data. A real concern is that this will be a significant amount of work, no matter what can be done automatically. New reports will need to be added to help identify and complete the maintenance work.

Following are the specifics of what I have in mind. I am not sure if each of these should be turned into Gemini issues immediately, or if further discussion (with the Tech Team or EC, maybe?) needs to happen first.

1) Add the following fields to the Item table. This will allow for more specific detail about each item to be recorded in a more useful format. Currently, this detail is embedded within the Volume values:

StartVolume - this should contain JUST a volume designation (i.e. "8")
EndVolume - this should contain JUST a volume designation (i.e. "8")
EndYear
StartIssue
EndIssue
StartNumber
EndNumber
StartSeries
EndSeries
StartPart
EndPart

With these fields in place, a volume of "v. 1 no.1 (1890-March)" will be represented by the following values in the Item table:

Volume (existing field) - v.89=no.609-622 (1915-1917)
StartVolume - 89
Year (existing field) - 1915
EndYear - 1917
StartNumber - 609
EndNumber - 622

As you can see, the existing Volume field in the Item table will continue to hold the value shown in the UI. However, that field will no longer be used for machine processes.

2) Update the OpenUrl resolver, APIs, OAI feeds, and data exports to use StartVolume and the other new fields when appropriate.

3) Update the ingest process to attempt to parse elements of the Volume values into each of the new fields. [Note that this new parsing process will NOT be full-proof, and therefore manual work will be needed to validate/correct newly ingested data. This manual work will be similar to the work done now to normalize the Volume values... but now there will be more fields to update.]

4) Change how the ingest process sequences volumes in a serial. Instead of attempting to insert newly ingested volumes into the proper place in the sequence of existing volumes, assign all newly ingested items a sequence value of 10000, effectively putting them at the end of the list of volumes. [This means that new volumes will always need to be put into their proper place manually. On the other hand, it will no longer be possible for newly ingested items to mess up the ordering of all of the existing volumes.]

5) Create scripts to automatically clean up the data as best as is possible. They should be created with the assumption that they may be used periodically (once a month/quarter/year), though they might end up being used only once (when the new fields are added to the item table).

6) Create the following new reports to aid in identifying and cleaning up serial volumes:
A list of Items that have Volume values, but no values in the new Volume/Number/Issue/Series/Part/Date fields
A list of Titles (Serials) with "unsequenced" volumes

7) Create a report to support "Gap Filling" activities. [Gemini issue 57284 mentions reports to support "Gap Filling". I think I need an explanation of what exactly "Gap Filling" is, as well as details about what information is needed on a report. I think I understand, and I think that the other changes I have described will support identifying gaps... but I need a little more detail to be sure about what is needed.]

8) Create a process to produce KBART data exports. [The new fields in the Item table will support the extraction of Holdings information that is needed to complete the KBART data.]

NEW WORKFLOW

With the proposed changes in place, I anticipate the new workflow for adding a new volume to a serial to be the following:

1. Volume ingested from Internet Archive
a. It is placed at the end of the list of volumes in the serial.
b. The volume string is parsed as well as possible, and the appropriate metadata fields in BHL are populated.
c. The volume is NOT included in holdings information.
2. A BHL staffer pulls a report listing items that need to be sequenced. (The new volume will be on the report.)
3. The BHL staffer updates the item.
a. The item is placed it in the proper place in the list of volumes in the serial.
b. If necessary, the metadata fields are updated with the proper information from the volume string.
c. The volume string that is displayed in the public BHL UI is updated to the standard format.
d. The volume now appears in holdings information.

Hopefully this hasn't overwhelmed everyone too much. Let me know your thoughts, questions, comments.

Thanks,

MIKE