BHL Portal

To ingest materials from IA, need to know:

1. What has been scanned?
Is RSS feed for biodiversity collection sufficient & scalable?

Example:

Bulletin of the Natural History Museum (Volume 1)
http://www.archive.org/details/bulletinofnatura01entolond
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond

For any scanned volume, need to know:

1. What is its identifier?
bulletinofnatura01entolond
-source: RSS

2. When was it scanned?
2007-03-29 02:27:49
-source: in ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

3. What server is it on?
ia340917
-source: not in RSS? not sure how to get.

4. Where are its pages?

Low res JPG:
http://ia340917.us.archive.org/zipview.php?zip=/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_flippy.zip

JP2:
http://ia340917.us.archive.org/zipview.php?zip=/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_jp2.zip

4.5. What are their page numbers?
http://ia340917.us.archive.org/zipview.php?zip=/1/items/bulletinofnatura01entolond/scandata.zip&file=scandata.xml

4.6. Which page is the title page?
<bookplateleaf>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

5. What title does it belong to?
<call_number> in _meta.xml
If no <call_number>, then in zquery.

zquery:
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_metasource.xml

245a from:
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_marc.xml

<title> from:
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

6. Where is its PDF?
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond.pdf

7. Where is its MARCXML?
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_marc.xml

8. Where is the OCR?

Page-level?

ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_djvu.xml

Entire volume?

ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_djvu.txt

9. What institution does it belong to?
<contributor>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

10. Who sponsored scanning?
<sponsor>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

10.5. Where was it scanned?
<scanningcenter>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

11. How was it described as a volume by IA scanners?
<volume>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

12. What articles does it contain?
TBD

13. When was it added to BHL portal?
BHL responsibility

Concerns:

Will zipview.php change? Should we be using something else/another way to get these behind the scenes bits of data for ingestion?
How do we know server (ia340917) & cluster (/1/) ? Wrong terminology?
Should we copy OCR, PDF, JPG, and/or JP2 local to BHL portal?
1. Persistence
2. Stability
3. Scalability
4. Concerns from IA in hitting files behind the scenes?