BHL Harvesting, IA Updating Dilemma
Problem Statement:
We have reason to suspect there may be a disconnect between BHL's harvesting methodology and IA's page insertion or file updating practice. Please see our specific concerns and examples outlined below.
Below is a list of specific areas of concerns regarding IA scan updating and BHL harvesting that we have encountered:
- When IA inserts missing pages into previously scanned books, they do so using the same identifier that the book was previously scanned under. It appears that BHL, though harvesting these new scans, does not correspondingly harvest updated pagination and the pagination is therefore disjointed with the page being displayed.
- When IA rescans and inserts the same pages several times , it appears that BHL does not always harvest the most recent update to the scan, or does not link to the most recently updated file on IA. The images displayed on BHL are thus not the correct images, which are displayed on IA (please see example 1 below)
- If IA rescans a book under the same identifier as a previous scan because there was a missing page in the first scan and subsequently misses a different page in the rescan, though the page count is still the same, and thus should not theoretically be a problem for BHL, will there be implications on BHL's side? Does this affect anything in terms of how the pagination displays or cause other issues?
Below is a list of concerns we have regarding any changes that might be made to the harvesting process in order to rectify the above issues:
- We want to ensure that any solutions that might be implemented will not eliminate or overwrite any manual edits (pagination, merging, volume enumeration, etc.) that have been made to items on BHL.
- We want to ensure that any solutions that might be implemented will not effect the validity of any persistent urls that users may already have linked to.
Examples:
Example One (SIL):
- Die Crustaceen des südlichen Europa: Crustacea Podophthalmia. Mit einer Übersicht über die horizontale Verbreitung sämmtlicher europäischer Arten (1863)
This book was first scanned on April 21 and sent back on May 14, 2009 due to a missing page. The pages were inserted (or rescanned?) on June 13.
According to our notes, the original scan was missing plate VII (recto) and the explanation for plate VII (verso of plate t.p.). It was sent back for page insertion.
From the meta.xml file:
<updatedate>2009-04-21 17:46:10</updatedate>
<curation>[curator]
dorothy@archive.org[/curator][date]20090502015434[/date][state]approved[/state]</curation>
HOWEVER, the scandata.xml, jp2.tar abbyy, djvu, and pdf files are all dated from June. It seems like they did not correctly update the meta.xml file.
Here is the inserted page seen in IA, (turn the page - note that there is no t.p. for plate VIII):
http://www.archive.org/stream/diecrustaceendes00hell#page/n379/mode/2up
Here is where that page should be in the book in BHL, but isn’t (then keep going - note there is a t.p. for plate VIII):
http://biodiversitylibrary.org/page/13059124
Example 2 (SIL):
Essai de classification des lépidoptères producteurs de soie (1897)
This book was first scanned on March 30th, and was sent back because of a missing page. The page was inserted May 21.
In meta.xml we see:
<scandate>20090521024433</scandate>
You can see that Page 34 was properly inserted in IA:
http://www.archive.org/details/essaideclassific31901labo
However, in BHL, while the inserted pages do display, the pagination is now incorrectly associated with the scans:
http://biodiversitylibrary.org/page/12983647
Example 3 (same scenario as Example 2) (SIL):
Mollusques terrestres et fluviatiles
- Was missing page 34 and therefore sent back for page insertion.
- In BHL, while the inserted page does display, the pagination is now incorrectly associated with the scan:
- http://biodiversitylibrary.org/page/12983647
Example 4 (Same scanarios as Example 2 and 3) (SIL):
La vie et les murs des animaux : Zoophytes et mollusques
- Was missing pages 4-5 and therefore sent back for page insertion.
- In BHL, while the inserted pages do display, the pagination is now incorreclty associated with the scans:
- http://biodiversitylibrary.org/page/12169251
Example 5 (Same scenario as Examples 2, 3, and 4) (SIL):
Snake Venoms
- Was missing plate 27 and therefore sent back for page insertion.
- In BHL, while the inserted pages do display, the pagination is now incorreclty associated with the scans:
- http://biodiversitylibrary.org/page/12284320
Example 6 (Same scenario as Examples 2, 3, 4, and 5) (SIL):
Cave vertebrates of America
- Was missing plate 15 and therefore sent back for page insertion.
- In BHL, while the inserted pages do display, the pagination is now incorreclty associated with the scans:
- http://biodiversitylibrary.org/page/12300206
MBLWHOI Library Examples:
All of these examples had page insertions after the initial scan