IA Harvest

Steps to harvest IA scans into BHL Portal

The following is an explanation of how Internet Archive (http://www.archive.org) data is harvested into the database for the Biodiversity Heritage Library (http://www.biodiversitylibrary.org). In order to be harvested, all items must be part of the "biodiversity" collection within Internet Archive. Collection is set either through passing a parameter in the Wonderfetch URL, or by the IA scanner choosing "biodiversity" from a drop-down menu when they are using the Biblio scanning software.

1. To harvest scans created during a specific date range: run an Internet Archive query to identify books (via identifier) to
harvest. An example query is: http://www.archive.org/services/search.php?query=collection:(biodiversity)+AND+updatedate:[2007-10-14+TO+2007-10-28]&submit=submit

Notice the date range that is included in the query. By making use of this date range, it is possible to run an harvest process on a monthly/weekly/daily basis, and pick up items as they are added or modified at the Internet Archive.

This query will return a response similar to the example below. The key information that will be extracted from the response is the Internet Archive identifiers. These IDs are located in the <str> elements that have a name attribute with a value of “identifier”.

The identifiers contained in this example response are “transactionsofli05linn” and“annalesacademici40rijk”.

<?xml version="1.0" encoding="UTF-8" ?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">220</int>
<lst name="params">
<str name="wt">xml</str>
<str name="rows">1000</str>
<str name="start">0</str>
<str name="q">collection:(biodiversity ) AND updatedate:[2007-10-14T00:00:00Z TO 2007-10-28T00:00:00Z]</str>
<str name="fl">identifier, mediatype, collection</str>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<arr name="collection">
<str>biodiversity</str>
</arr>
<str name="identifier">transactionsofli05linn</str>
<str name="mediatype">texts</str>
</doc>
<doc>
<arr name="collection">
<str>biodiversity</str>
</arr>
<str name="identifier">annalesacademici40rijk</str>
<str name="mediatype">texts</str>
</doc>
</result>
</response>

2. For each Internet Archive identifier, download the _FILES.XML file. This file is located at the following URL:

http://www.archive.org/download/<Internet Archive Identifier>_files.xml

This XML contains a list of all of the files that are available for download for a particular Internet Archive identifier. It also includes a format for each file. If desired, this format value can be used to determine which information to download for each identifier (only download files with a format of “PDF”, only download files with a format of “Metadata”, etc).

3. Using the file list found in the _FILES.XML file for each Internet Archive identifier, download each desired file from the following URL:

http://www.archive.org/download/<Filename>

4. Once you have downloaded all of the desired files for each Internet Archive identifier, parse the information from the metadata (XML) files into an appropriate data store.

The database for the Biodiversity Heritage Library (BHL) website existed prior to the creation of the Internet Archive harvesting process. Therefore, to get the data from the XML files into the BHL database, it was useful to extract the information into a set of database tables that map closely to the format of the XML files. From those tables the data is cleaned and transformed before being inserted into the “final” set of BHL tables that are used as the data source for the website.

Depending on how the Internet Archive data is to be used by an harvesting organization, it may be necessary to perform a similar extract-clean-and-store process, or it may be feasible to leave the data in the XML files downloaded from the Internet Archive.

NOTE 1

Here is a list of the metadata files that may exist for each Internet Archive item.

<Internet Archive Identifier>_dc.xml – Dublin Core data
<Internet Archive Identifier>_djvu.xml – An XML representation of the OCR data
<Internet Archive Identifier>_files.xml – A list of the files that exist for the item
<Internet Archive Identifier>_marc.xml – MARC data
<Internet Archive Identifier>_meta.xml – Additional Dublin Core data, scanning information, item status
<Internet Archive Identifier>_metasource.xml – Item contributor
<Internet Archive Identifier>_scandata.xml – Scanning information for each page of the item

NOTE 2

When harvesting the information from Internet Archive, it is important to note that materials become available for download before an item has been fully “approved” for publication on the Internet Archive site. In practice, this means that the information is subject to change. Once an item is “approved”, it can then be assumed that it will no longer change.

To determine if an item has been approved, you must examine the <Internet Archive Identifier>_META.XML file for a <curation> element. If that element exists, and if part of that elements value is “[state]approved[/state]”, then the item has been approved.

For harvesting into the BHL database, items are not published to the BHL website until they move to an “approved” state at Internet Archive. When the Internet Archive approves an item, the item’s “updatedate” changes. Because this date is part of the query submitted in the first step of the harvest process, this allows us to pick up approved items without having to go back and requery the original dates that the item appeared.

NOTE 3

The _SCANDATA.XML file contains details of what is on each page of a given book (cover, title page, table of contents, text, and so on). Unfortunately, this file does not always exist at the http://www.archive.org/download/ location.

For newer materials (late 2007 and following), it seems that this file does exist in nearly all cases. Materials added to the Internet Archive in mid 2007 or earlier, however, may or may not include this file.

If no _SCANDATA.XML file exists at the http://www.archive.org/download/ location for a given identifier, there is a second place to look for this file. There may be a file named SCANDATA.ZIP that exists in the download folder for an item. If so, this ZIP file may contain the _SCANDATA.XML file. The steps for extracting the XML file from this alternate location are:

a) First, run the following query to get the physical file location of the SCANDATA.ZIP file:

http://www.archive.org/services/find_file.php?file=<IA Identifier>&loconly=1

The results of this query will look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<results>
<location host="ia340904.us.archive.org" dir="/2/items/genussalpa00broo" />
<location host="ia340905.us.archive.org" dir="/2/items/genussalpa00broo" />
</results>

Extract the values of the “host” and “dir” attributes of the first <location> element.

b) Use the “host” and “dir” values to submit the following request to the Internet Archive:

http://<host>/zipview.php?zip=<dir>/scandata.zip&file=scandata.xml

This request will download the SCANDATA.XML file (if it exists).