IA OAI interface

From e-mail dated 10/25/2007

Brewster-

Following up on our discussions last week --

We've done some preliminary OAI harvesting of IA/OCA site, and for us at
least, as you suggested, OAI-PMH makes more sense going forward than
continuing to try and use meta-manager for identifying digitized books
to capture or recapture from the Archive. We can use your data provider
as it is currently being run at the moment, but there are some issues
and inefficiencies involved. Specifically:

1) Use of OAI sets:

There appear to be OAI sets defined for the individual OCA contributors;
however, most of these OAI sets appear to be empty. Thus, from:

http://www.archive.org/services/oai.php?verb=ListSets

there is a set with setSpec =
collection:university_of_illinois_urbana-champaign; however,

http://www.archive.org/services/oai.php?verb=ListRecords&metadataPrefix=oai_dc&set=collection:university_of_illinois_urbana-champaign

yields a <error code="noRecordsMatch">

From separate test harvesting it appears that most of the digitized
books from UIUC are members only of sets set=mediatype:texts and
set=collection:americana

For now we will harvest set=mediatype:texts, currently containing over
280,000 and filter for matching <dc:contributor> values
(value=University of Illinois Urbana-Champaign). This is not a serious
problem, but in the long-run, and to avoid overloading the
<dc:contributor> element, would be nice if associations to the
contributor-specific OAI sets could be established.

2) Location of files for downloading:

As we discussed the only URL in your OAI records is for the splash
screen at OCA. While this makes sense, it means that we have to assume
that "details" can safely be changed to "download" in the URL string to
get to the download directory. As long as that works, we're okay, but
may want to think about use of a second metadata format to deal with
this issue. See further discussion below.

3) Use of <dc:format>:

According to the DCMI one-to-one principle, since the identifier in your
metadata record points to the splash screen, your metadata record should
describe only the splash screen. The other formats of the intellectual
object should show in <dc:relation> field, i.e., dcq:isFormatOf in
qualified DC. However, no one follows the DCMI one-to-one rule in
practice, and <dc:format> is arguably more meaningful here than
<dc:relation>, at least in context of unqualified DC, so it makes sense.
Keep in mind, however, that some people may try to do machine processing
on the <dc:format> values -- e.g., in order to know if a b+w PDF exists,
dc:format=Grayscale LuraTech PDF in addition to dc:format=Standard
LuraTech PDF. That may or may not be a good thing from your standpoint.
Again, a 2nd OAI-PMH metadata format option as described below might be
the better way to go.

4) <dc:date> and <oai:datestamp>:

The only <dc:date> value in the sample of harvested records we've looked
at so far shows the publication date of the work. This is fine, but it
means that there's nothing in the metadata record that would change when
one of the component files changes (e.g. is updated). In theory this
means that <oai:datestamp> should not change just because you've redone
a derivative, e.g., re-OCR'd