BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

BHL Portal

To ingest materials from IA, need to know:

1. What has been scanned?
Is RSS feed for biodiversity collection sufficient & scalable?



Example:

Bulletin of the Natural History Museum (Volume 1)
http://www.archive.org/details/bulletinofnatura01entolond
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond

For any scanned volume, need to know:

1. What is its identifier?
bulletinofnatura01entolond
-source: RSS

2. When was it scanned?
2007-03-29 02:27:49
-source: in ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

3. What server is it on?
ia340917
-source: not in RSS? not sure how to get.

4. Where are its pages?

Low res JPG:
http://ia340917.us.archive.org/zipview.php?zip=/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_flippy.zip

JP2:
http://ia340917.us.archive.org/zipview.php?zip=/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_jp2.zip

4.5. What are their page numbers?
http://ia340917.us.archive.org/zipview.php?zip=/1/items/bulletinofnatura01entolond/scandata.zip&file=scandata.xml

4.6. Which page is the title page?
<bookplateleaf>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml


5. What title does it belong to?
<call_number> in _meta.xml
If no <call_number>, then in zquery.

zquery:
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_metasource.xml

245a from:
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_marc.xml

<title> from:
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

6. Where is its PDF?
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond.pdf

7. Where is its MARCXML?
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_marc.xml

8. Where is the OCR?
    1. Page-level?
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_djvu.xml
    1. Entire volume?
ftp://ia340917.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_djvu.txt

9. What institution does it belong to?
<contributor>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

10. Who sponsored scanning?
<sponsor>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

10.5. Where was it scanned?
<scanningcenter>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

11. How was it described as a volume by IA scanners?
<volume>
ftp://ia340916.us.archive.org/1/items/bulletinofnatura01entolond/bulletinofnatura01entolond_meta.xml

12. What articles does it contain?
TBD

13. When was it added to BHL portal?
BHL responsibility



Concerns:

  1. Will zipview.php change? Should we be using something else/another way to get these behind the scenes bits of data for ingestion?
  2. How do we know server (ia340917) & cluster (/1/) ? Wrong terminology?
  3. Should we copy OCR, PDF, JPG, and/or JP2 local to BHL portal?
    1. Persistence
    2. Stability
    3. Scalability
    4. Concerns from IA in hitting files behind the scenes?