BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

Uploading to IA

See also Page Level Metadata for Ingesting Content
See also Detailed instructions posted on Archive.org

Uploading Files to IA for submission to BHL


Brief instructions and pointers on what and how to upload directly to IA for ingest into BHL, for those who are doing their own in-house scanning.
IA uses Amazon S3 storage, and some of the procedures found here in the S3 documentation may help.

WHAT You need Before you Start:

Files you need to create and then send to IA for each book include:
where foo is a unique item identifier - unique in IA.

You will also need to provide your item level data - this is not supplied in a file, but is provided as part of the cURL command used to upload your files.

Scanning
When you scan, scan the ENTIRE book, including both covers and end papers. Scan ALL pages, with the following exceptions:

All images that are going to be uploaded to IA should be saved in JPEG2000, compressed 85% (if you wish - this is how much IA compresses)

Naming Images:
Images must be named with your unique identifier, followed by a page image sequence number padded to 4 digits, e.g., foo_0000.jp2, foo_0001.jp2, foo_0002.jp2. How to choose and create the unique ID (must be unique within IA) is up to you. All images must be named with a 4 digit sequential number to maintain the page order. (See what do I do if I missed a page? below). IA's unique id formula is Author (8 chars) + Title (8 chars) + year (4 chars) + Volume(4-8 chars). You should always check (programmatically or otherwise) to make sure that your created ID is indeed unique at the Internet Archive. The worst that can happen is that you can overwrite one of your own existing items with different data.

Image Compression
There are two choices when uploading the images. You can leave your JP2 files minimally compressed (larger files) or offer a greater level of compression (smaller files). If you send the larger files, you may encounter timeouts while uploading, and the Internet Archive will create the smaller files for you. If you upload the smaller files, then IA will not need to create them, your item will be readable online sooner, but IA will also not have the larger files on their servers. It is not known if there is a true disadvantage to not having the larger files at the Internet Archive (aside from the benefits of having a complete duplicate somewhere else for redundancy purposes.)

Naming Conventions: Minimally Compressed images

Naming Conventions: Compressed "final" images

MARC
This is pretty easy. Just send the MARCXML for the record. There are plenty of programs that will transform your MARC to MARCXML. Just make sure to name your MARCXML file correctly (see above.)

Item data
See the WonderFetch page for mandatory data on this page. Essentially, though, everything in that _meta.xml list is mandatory, aside from IP/Copyright, title identifier, and volume/issue when it's not relevant.

This information is submitted as part of the cURL used when uploading files (see below under cURL.)

Page data
Please see this example of a bare-bones scandata.xml file.
Mandatory elements in this file are:

<page leafNum="1">
    <pageType>Cover</pageType>
    <addToAccessFormats>true</addToAccessFormats>
    <origWidth>4698</origWidth>
    <origHeight>6270</origHeight>
    <cropBox>
        <x>0</x>
        <y>0</y>
        <w>4698</w>
        <h>6270</h>
    </cropBox>
    <bookStart>true</bookStart>
    <handSide>RIGHT</handSide>
</page>

Uploading via cURL

The basic method to upload to IA's S3 servers is to use the "curl" program to access the REST-based upload mechanism. I believe this is a preferred method over uploading via FTP. It also allows for more automation in that the CURL command can be constructed and called from another program. To read IA's limited documentation on this method, see this URL: http://www.archive.org/help/abouts3.txt

Prerequisites

Besides having an account at archive.org and having access to the collection you are interested in, you need to an API key to use this system. To get one, log into your account and go to http://www.archive.org/account/s3.php to generate (or regenerate) the key.

Of course, you must have curl installed on your computer, too.

Uploading to IA


The basic form for uploading everything to IA is the following:
    curl \
    --location \
    --header "authorization: LOW APIKEY:APISECRET"
    --header "x-archive-auto-make-bucket:1"
    --header "x-archive-queue-derive:1"
    --header "x-archive-meta-mediatype:texts"
    --header "x-archive-meta-sponsor:Smithsonian Institution Libraries"
    --header "x-archive-meta-title:Annelides Polychetes"
    --header "x-archive-meta-curation:[curator]biodiversitylibrary.org[/curator][date]20100727093829[/date][state]approved[/state]"
    --header "x-archive-meta-collection:biodiversity"
    --header "x-archive-meta-creator:Orleans, Louis Philippe Robert"
    --header "x-archive-meta01-subject:Polychaeta"
    --header "x-archive-meta02-subject:Annelida"
    --header "x-archive-meta03-subject:Scientific expeditions"
    --header "x-archive-meta-publisher:Brussels: Imprimerie Scientifique, 1911"
    --header "x-archive-meta-date:1911"
    --header "x-archive-meta-language:FRE"
    --upload-file "IDENTIFIER_marc.xml" "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_marc.xml"
    --upload-file "IDENTIFIER_scandata.xml" "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_scandata.xml"
    --upload-file "IDENTIFIER_orig_jp2.tar" "http://s3.us.archive.org/IDENTIFIER/IDENTIFIER_orig_jp2.tar"

Notes
The curl command's --upload-file switch takes two arguments, the path and filename of the file to upload and the destination URL for the item. The path and filename are specific to your computer. In all of our examples, we are assuming that the CURL command is being called from the same folder as the files that are being uploaded.
The APIKEY, APISECRET and IDENTIFIER should be replaced with the appropriate values.
Although this example shows the command broken into multiple lines, it should be combined into exactly one line by removing all the line breaks. Alternatively, on Linux-like systems you can use the backslash continuation character to allow the command to span multiple lines.
Creating an identifier

The Identifier is described by IA as:

Cryptic, but generally we go with 8 to 16 letters of the Title, excluding common words such as "a", "the", etc., two letters of the volume, or "00" if there is no volume, and the first four letters of the author's last name.

In the above command, the header x-archive-auto-make-bucket:1 tells IA's system to create the bucket if it does not exist. If it already exists, then the curl command will return an error. In such a case, a new identifer should be used. A quick check for an identifier can be done by simply going to http://www.archive.org/details/IDENTIFIER. If it says Item cannot be found then there's a very good chance that the identifier is available, although there's a small chance that it may be hidden from the outside world.

Notes on the metadata
We tested sending metadata as a IDENTIFIER_meta.xml and that didn't seem to work. IA seemed to recognize that this file would clash with it's management of the meta.xml file and so it renamed our file to IDENTIFIER_meta.xml_ and that wasn't good. So all metadata must be specified in the headers of the document. If a metadata item (such as "description") is sent multiple times, the two headers will be combined into one, separated by a comma and a space. It does seem that IA is smart enough to allow multiple <description> and <subject> tags in the XML file. Therefore the method to add a numeric counter to the header to indicate that you want two separate tags in the XML.

In this example, the two subjects will be combined into one <subject> tag and separated by a comma
    curl \
    --location \
    --header "authorization: LOW APIKEY:APISECRET"
    --header "x-archive-auto-make-bucket:1"
    --header "x-archive-meta-subject:Polychaeta"
    --header "x-archive-meta-subject:Annelida"
    [ ... ]

In this example, the same two subjects will be separated into two <subject> tags in the XML
    curl \
    --location \
    --header "authorization: LOW APIKEY:APISECRET"
    --header "x-archive-auto-make-bucket:1"
    --header "x-archive-meta01-subject:Polychaeta"
    --header "x-archive-meta02-subject:Annelida"
    [ ... ]
Special metadata
It has recently been discovered that some metadata fields have special meaning to the Internet Archive. For example, we can manually specify values for "call_number" as well as "call-number". The former is given special handling and it's value appears at the top of the page with the most prominent metadata for the item. It was not immediately clear how you can upload this information via the CURL headers because an underscore is converted to a hyphen on the receiving end.

We learned from the S3 tech support that we can specify a double hyphen which is converted to an underscore. For example these will be treated differently. (as of this writing, this has not been tested yet.)

    [ ... ]
    --header "x-archive-meta-call-number: Z5351 .S77 1976 Suppl."
    --header "x-archive-meta-call--number: Z5351 .S77 1976 Suppl."
    [ ... ]

Sending the derive signal
The header x-archive-queue-derive:1 will tell IA to automatically start the derivation process to convert the files you upload into other forms, such as the OCR text and the "Flippy Book" and ePub and other eBook formats. The eBook formats seem to depend on the scandata.xml file being perfect, we're still trying to figure out what this means.

Updating Information

Sometimes we need to fix some data. This is usually updating metadata or updating a missing file. The jury is still out on whether or not we can use this technique to update the item to add missing pages, but when we have information on that, we'll update this page.

Updating Metadata
The header x-archive-ignore-preexisting-bucket:1 seems to erase all of the metadata, so if you use it, you should resent ALL of the metadata for that item in the headers.

The full command to update the metadata looks like the following. Note that we are on a Linux-like system and are uploading /dev/null to a nonexistent URL. This is the method given by IA to update the metadata without uploading another file.
    curl \
    --location \
    --header "authorization: LOW APIKEY:APISECRET"
    --header 'x-archive-ignore-preexisting-bucket:1'
    --header "x-archive-meta-mediatype:texts"
    --header "x-archive-meta-sponsor:Smithsonian Institution Libraries"
    --header "x-archive-meta-title:Annelides Polychetes"
    --header "x-archive-meta-curation:[curator]biodiversitylibrary.org[/curator][date]20100727093829[/date][state]approved[/state]"
    --header "x-archive-meta-collection:biodiversity"
    --header "x-archive-meta-creator:Orleans, Louis Philippe Robert"
    --header "x-archive-meta-date:1911"
    --header "x-archive-meta-language:FRE"
    --header "x-archive-meta-noindex:true"
    --upload-file /dev/null http://s3.us.archive.org/IDENTIFIER

Sending a missing file
To upload a missing file use a command of the following format.
    curl --location \
    --header "authorization: LOW APIKEY:APISECRET"
    --upload-file new_file.txt http://s3.us.archive.org/IDENTIFIER/new_file.txt

Re-Deriving updated images
When we have to re-derive, it's best if we go into the item manager for the book in question. The URL is: http://www.archive.org/item-mgr.php?identifier=IDENTIFIER

Other notes

In the scandata.xml, the page type "Delete" does not actually delete the page from the PDF/Page Turner/etc. The <addToAccessFormats> tag is what suppresses the page from the page-turning application. Set it to true or false depending on whether you want the page to display or not. Updates to the <addToAccessFormats> in the scandata.xml file will take effect immediately for the online page turner ( http://www.archive.org/stream/IDENTIFIER) but the derive process will need to be redone in order to update the PDF and other downstream files.

Best time to upload to IA
Morning, before 9am ET. It'll process your changes or uploads almost immediately, but once 10am hits, you're guaranteed to be stuck in the queue waiting for the server to get to you. You may also wait up to 2 hours for an OCR server to become available to process your request. Best to send content up to IA overnight, possibly after 2am or 3am ET.