Uploading to IA
See also
Page Level Metadata for Ingesting Content
See also Detailed instructions posted on Archive.org
Uploading Files to IA for submission to BHL
Brief instructions and pointers on what and how to upload directly to IA for ingest into BHL, for those who are doing their own in-house scanning.
IA uses Amazon S3 storage, and some of the procedures found
here in the S3 documentation may help.
WHAT You need Before you Start:
- an account on www.archive.org Account creation page
- rights for that account to create items that are part of the Biodiversity collection (contact Mike Lichtenberg to coordinate access to the biodiversity collection)
Files you need to create and then send to IA for each book include:
- foo_orig_jp2.tar file containing all the scanned page images
- foo_MARC.xml (your MARC record, in MARCxml)
- foo_scandata.xml (the page level data)
where
foo is a unique item identifier - unique in IA.
You will also need to provide your item level data - this is not supplied in a file, but is provided as part of the
cURL command used to upload your files.
Scanning
When you scan, scan the ENTIRE book, including both covers and end papers. Scan ALL pages, with the following exceptions:
- do not scan: tissue that covers illustrations, unless it contains printing, or other information;
- do not scan: blank pages if there are more than 6 blank pages in a row (as when a book has been re-bound and 'filler' pages added to the beginning or end) you may scan the first 6 and skip the remainder.
All images that are going to be uploaded to IA should be saved in JPEG2000, compressed 85% (if you wish - this is how much IA compresses)
Naming Images:
Images must be named with your unique identifier, followed by a page image sequence number padded to 4 digits, e.g.,
foo_0000.jp2,
foo_0001.jp2,
foo_0002.jp2. How to choose and create the unique ID (must be unique within IA) is up to you. All images must be named with a 4 digit sequential number to maintain the page order. (See what do I do if I
missed a page? below). IA's unique id formula is Author (8 chars) + Title (8 chars) + year (4 chars) + Volume(4-8 chars). You should always check (programmatically or otherwise) to make sure that your created ID is indeed unique at the Internet Archive. The worst that can happen is that you can overwrite one of your own existing items with different data.
Image Compression
There are two choices when uploading the images. You can leave your JP2 files minimally compressed (larger files) or offer a greater level of compression (smaller files). If you send the larger files, you may encounter timeouts while uploading, and the Internet Archive will create the smaller files for you. If you upload the smaller files, then IA will not need to create them, your item will be readable online sooner, but IA will also not have the larger files on their servers. It is not known if there is a true disadvantage to not having the larger files at the Internet Archive (aside from the benefits of having a complete duplicate somewhere else for redundancy purposes.)
Naming Conventions: Minimally Compressed images
- Images should be named as follows: foo_orig_0000.jp2, foo_orig_0001.jp2, foo_orig_0002.jp2
- Images should be placed into a directory: foo_orig_jp2
- This directory should be compressed into a .tar (linux tape archive) file: foo_orig_jp2.tar
Naming Conventions: Compressed "final" images
- Use roughly 85% compression level to maintain quality and minimize size (use your best judgement)
- Images should be named as follows: foo_0000.jp2, foo_0001.jp2, foo_0002.jp2
- Images should be placed into a directory: foo_jp2
- This directory should be compressed into a .zip file: foo_jp2.zip
MARC
This is pretty easy. Just send the MARCXML for the record. There are plenty of programs that will transform your MARC to MARCXML. Just make sure to name your MARCXML file correctly (see above.)
Item data
See the
WonderFetch page for mandatory data on this page. Essentially, though, everything in that _meta.xml list is mandatory, aside from IP/Copyright, title identifier, and volume/issue when it's not relevant.
This information is submitted as part of the cURL used when uploading files (see below under
cURL.)
Page data
Please see
this example of a bare-bones scandata.xml file.
Mandatory elements in this file are:
- "bookData" must have the unique identifier and the count of total images (must match the number of image files you submit, or the derive will fail)
- "pageData" must have
- leafNum (sequence number)
- pageType (see Page Level Metadata for Ingesting Content for acceptable page type values)
- addToAccessFormats (this controls if the image displays in the page turner and is added to the pdf, used to suppress the color target at start/end of book)
- origWidth and origHeight (full width and height of the original image, in pixels)
- cropbox width and height (this can be the same as the origWidth and origHeight, unless you are able to capture and send cropbox sizes.)
- handSide (Right or Left for Recto and Verso),
- and bookStart=true for the title page
- the first page leafNum value must match the first image filename enumeration. if you start your filenaming with 0000, your first leafNum should be 0.
<page leafNum="1">
<pageType>Cover</pageType>
<addToAccessFormats>true</addToAccessFormats>
<origWidth>4698</origWidth>
<origHeight>6270</origHeight>
<cropBox>
<x>0</x>
<y>0</y>
<w>4698</w>
<h>6270</h>
</cropBox>
<bookStart>true</bookStart>
<handSide>RIGHT</handSide>
</page>
Uploading via cURL
The basic method to upload to IA's S3 servers is to use the "curl" program to access the REST-based upload mechanism. I believe this is a preferred method over uploading via FTP. It also allows for more automation in that the CURL command can be constructed and called from another program. To read IA's limited documentation on this method, see this URL:
http://www.archive.org/help/abouts3.txt
Prerequisites
Besides having an account at archive.org and having access to the collection you are interested in, you need to an API key to use this system. To get one, log into your account and go to
http://www.archive.org/account/s3.php to generate (or regenerate) the key.
Of course, you must have
curl installed on your computer, too.
Uploading to IA
The basic form for uploading everything to IA is the following:
curl \
--location \
--header "authorization: LOW APIKEY:APISECRET"
--header "x-archive-auto-make-bucket:1"
--header "x-archive-queue-derive:1"
--header "x-archive-meta-mediatype:texts"
--header "x-archive-meta-sponsor:Smithsonian Institution Libraries"
--header "x-archive-meta-title:Annelides Polychetes"
--header "x-archive-meta-curation:[curator]biodiversitylibrary.org[/curator][date]20100727093829[/date][state]approved[/state]"
--header "x-archive-meta-collection:biodiversity"
--header "x-archive-meta-creator:Orleans, Louis Philippe Robert"
--header "x-archive-meta01-subject:Polychaeta"
--header "x-archive-meta02-subject:Annelida"
--header "x-archive-meta03-subject:Scientific expeditions"
--header "x-archive-meta-publisher:Brussels: Imprimerie Scientifique, 1911"
--header "x-archive-meta-date:1911"
--header "x-archive-meta-language:FRE"
--upload-file "IDENTIFIER_marc.xml" "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_marc.xml"
--upload-file "IDENTIFIER_scandata.xml" "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_scandata.xml"
--upload-file "IDENTIFIER_orig_jp2.tar" "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_orig_jp2.tar"
Notes
The curl command's --upload-file switch takes two arguments, the path and filename of the file to upload and the destination URL for the item. The path and filename are specific to your computer. In all of our examples, we are assuming that the CURL command is being called from the same folder as the files that are being uploaded.
The APIKEY, APISECRET and IDENTIFIER should be replaced with the appropriate values.
Although this example shows the command broken into multiple lines, it should be combined into exactly one line by removing all the line breaks. Alternatively, on Linux-like systems you can use the backslash continuation character to allow the command to span multiple lines.
Creating an identifier
The Identifier is described by IA as:
- IDENTIFIER: Unique in Archive's collection, alphanumeric (URL safe), this is the original name adopted by the originating collection (alphanumeric characters and _-. Best if from 5 to 80 characters). One format is [title:8-16][vol:2][author:4][scanninglocation:0-4]
Cryptic, but generally we go with 8 to 16 letters of the Title, excluding common words such as "a", "the", etc., two letters of the volume, or "00" if there is no volume, and the first four letters of the author's last name.
In the above command, the header
x-archive-auto-make-bucket:1 tells IA's system to create the bucket if it does not exist. If it already exists, then the curl command will return an error. In such a case, a new identifer should be used. A quick check for an identifier can be done by simply going to
http://www.archive.org/details/IDENTIFIER. If it says
Item cannot be found then there's a very good chance that the identifier is available, although there's a small chance that it may be hidden from the outside world.
Notes on the metadata
We tested sending metadata as a IDENTIFIER_meta.xml and that didn't seem to work. IA seemed to recognize that this file would clash with it's management of the meta.xml file and so it renamed our file to IDENTIFIER_meta.xml_ and that wasn't good. So all metadata must be specified in the headers of the document. If a metadata item (such as "description") is sent multiple times, the two headers will be combined into one, separated by a comma and a space. It does seem that IA is smart enough to allow multiple <description> and <subject> tags in the XML file. Therefore the method to add a numeric counter to the header to indicate that you want two separate tags in the XML.
In this example, the two subjects will be combined into one <subject> tag and separated by a comma
curl \
--location \
--header "authorization: LOW APIKEY:APISECRET"
--header "x-archive-auto-make-bucket:1"
--header "x-archive-meta-subject:Polychaeta"
--header "x-archive-meta-subject:Annelida"
[ ... ]
In this example, the same two subjects will be separated into two <subject> tags in the XML
curl \
--location \
--header "authorization: LOW APIKEY:APISECRET"
--header "x-archive-auto-make-bucket:1"
--header "x-archive-meta01-subject:Polychaeta"
--header "x-archive-meta02-subject:Annelida"
[ ... ]
Special metadata
It has recently been discovered that some metadata fields have special meaning to the Internet Archive. For example, we can manually specify values for "call_number" as well as "call-number". The former is given special handling and it's value appears at the top of the page with the most prominent metadata for the item. It was not immediately clear how you can upload this information via the CURL headers because an underscore is converted to a hyphen on the receiving end.
We learned from the S3 tech support that we can specify a double hyphen which is converted to an underscore. For example these will be treated differently. (as of this writing, this has not been tested yet.)
[ ... ]
--header "x-archive-meta-call-number: Z5351 .S77 1976 Suppl."
--header "x-archive-meta-call--number: Z5351 .S77 1976 Suppl."
[ ... ]
Sending the derive signal
The header
x-archive-queue-derive:1 will tell IA to automatically start the derivation process to convert the files you upload into other forms, such as the OCR text and the "Flippy Book" and ePub and other eBook formats. The eBook formats seem to depend on the scandata.xml file being perfect, we're still trying to figure out what this means.
Updating Information
Sometimes we need to fix some data. This is usually updating metadata or updating a missing file. The jury is still out on whether or not we can use this technique to update the item to add missing pages, but when we have information on that, we'll update this page.
Updating Metadata
The header
x-archive-ignore-preexisting-bucket:1 seems to erase all of the metadata, so if you use it, you should resent ALL of the metadata for that item in the headers.
The full command to update the metadata looks like the following. Note that we are on a Linux-like system and are uploading /dev/null to a nonexistent URL. This is the method given by IA to update the metadata without uploading another file.
curl \
--location \
--header "authorization: LOW APIKEY:APISECRET"
--header 'x-archive-ignore-preexisting-bucket:1'
--header "x-archive-meta-mediatype:texts"
--header "x-archive-meta-sponsor:Smithsonian Institution Libraries"
--header "x-archive-meta-title:Annelides Polychetes"
--header "x-archive-meta-curation:[curator]biodiversitylibrary.org[/curator][date]20100727093829[/date][state]approved[/state]"
--header "x-archive-meta-collection:biodiversity"
--header "x-archive-meta-creator:Orleans, Louis Philippe Robert"
--header "x-archive-meta-date:1911"
--header "x-archive-meta-language:FRE"
--header "x-archive-meta-noindex:true"
--upload-file /dev/null http://s3.us.archive.org/IDENTIFIER
Sending a missing file
To upload a missing file use a command of the following format.
curl --location \
--header "authorization: LOW APIKEY:APISECRET"
--upload-file new_file.txt http://s3.us.archive.org/IDENTIFIER/new_file.txt
Re-Deriving updated images
When we have to re-derive, it's best if we go into the item manager for the book in question. The URL is:
http://www.archive.org/item-mgr.php?identifier=IDENTIFIER
Other notes
In the scandata.xml, the page type "Delete" does not actually delete the page from the PDF/Page Turner/etc. The <addToAccessFormats> tag is what suppresses the page from the page-turning application. Set it to true or false depending on whether you want the page to display or not. Updates to the <addToAccessFormats> in the scandata.xml file will take effect immediately for the online page turner (
http://www.archive.org/stream/IDENTIFIER) but the derive process will need to be redone in order to update the PDF and other downstream files.
Best time to upload to IA
Morning, before 9am ET. It'll process your changes or uploads almost immediately, but once 10am hits, you're guaranteed to be stuck in the queue waiting for the server to get to you. You may also wait up to 2 hours for an OCR server to become available to process your request. Best to send content up to IA overnight, possibly after 2am or 3am ET.