Ingesting

Talk with Steve & Robert - 8/15/2007

We discussed how some libraries are pulling scanned files out of IA for local hosting, as is being done at UIUC and is being planned at Harvard.
I walked them through the process that Tim Cole at UIUC has devised, specifically how they are using MetaManager to manually select & export information because:

The server location for the files is obscured from RSS feeds & URLs
IA’s OAI provider doesn’t supply enough meaningful information

Steve recommended that they (and BHL) NOT use MetaManager to export, or to build IA server locations into any part of workflow. The reason is that IA operates on a cluster and the server on which an object is located can, and will, change. He provided an API that does this ‘locate’ function for any scanned item, which is stable & could be used as part of workflow.

An example:
The Genus Salpa, from MBLWHOI Library, is available at:
http://www.archive.org/details/genussalpa00broo

Currently, as of this writing, the files are located at:
http://ia340904.us.archive.org/2/items/genussalpa00broo

That URL (including server location ia340904…/2/) is what UIUC is exporting out of Metamanager. Rather than using that URL, IA recommends using this API:
http://www.archive.org/download/genussalpa00broo

Note the /download/ directive. This API does a locate for the item across the cluster and redirects to the current server location. This API should be used instead of hardcoding server locations.

There are many files for a single item in IA, such as JP2s, PDFs, TXT, DejaVu, etc. To locate those files, use the same API:
http://www.archive.org/download/genussalpa00broo/genussalpa00broo.pdf

Other APIs:

e.g. creator:"siznax"
http://www.archive.org/services/search.php?query=creator%3A%22siznax%22+&submit=submit

get an item's meta XML

http://www.archive.org/download/{id}/{file}

just locate item

http://www.archive.org/services/find_file.php?file={identifier}
http://www.archive.org/services/find_file.php?file={identifier}&loconly=1

another way to get _meta.xml

http://{host}/item_xml.php?dir={dir}&se=1

Next Steps:

Communicate this to Tim Cole & the PennState folks.
Revise plan for BHL ingesting.
Discuss enhancing OAI at OCA meeting?