BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

BHL Technical Meeting (18 October 2007)

BHL/Internet Archive Technical Meeting

18 October 2007



Item/Volume-level metadata in Packing Lists & stored in IA

  1. Need additional fields in MetaManager & in _meta.xml
    • Enum & Chron From/To
  2. Year(s) of publication in bound object
  3. Volume(s)/Issue(s) in bound object
  4. Optional; some libraries cannot support
  5. ItemID
Outcome of Discussions: IA will store additional metadata sent to them. May need to add additional fields, but will review existing fields to determine if ones are appropriate for use. Can preload fields in Biblio software by passing parameters in URL; could add URL to Packing List or could put a web interface from each library’s Picking/Packing DB. The focus is on storage of the data at IA; BHL will provide the indexing & citation resolution functionality.

Action Items: Steve to provide parameters (completed). Chris, Bernard (NHM), Keri (SIL), John (NYGB), Michael (BPL) to review and determine how best to match our metadata. Single-scanning operations at SIL & NHM to test procedures; once issues resolved larger scanning centers to adopt practices.

From Steve:
i've added the following possible GET args which may
be used to prepolate the Metaform (which gets written
as the _meta.xml file on the cluster) in the biblio
tool in support of BHL metadata needs.

1) year Year of publication in bound object
2) volume Volume(s)/Issue(s) in bound object
3) bib_id Local number or other
4) license License selection (see usage below)
5) dd Due diligence statement selection (see usage)

programmatic args and additional usage is defined here:

http://www-steve.us.archive.org/biblio?f=help#advanced
http://www-steve.us.archive.org/biblio?f=usage

Click for an example of a pre-populated Metaform.
(please try not to submit)

note selected args:

term meta tag GET arg

=

==========================================1) volume volume &b_v=vol.+1,+series+1-3
2) year year &year=1960-1971
3) bib_id identifier-bib &b_ib=0987654321
4) license licenseurl &lic=by-nc-sa
5) dd duediligence &dd=dd3

which would result in meta XML tags:

1) <volume>vol. 1, series 1-3</volume>
2) <year>1960-1971</year>
3) <identifier-bib>0987654321</identifier-bib>
4) <licenseurl>Attribution + Noncommercial + ShareAlike</licenseurl>
5) <duediligence>Due Diligence statement 3</duediligence>

please review and comment on the license and DD defintions here:
License Selections (&lic=)...

[by] => Attribution alone
[by-nc] => Attribution + Noncommercial
[by-nd] => Attribution + NoDerivs
[by-sa] => Attribution + ShareAlike
[by-nc-nd] => Attribution + Noncommercial + NoDerivs
[by-nc-sa] => Attribution + Noncommercial + ShareAlike
[nr1] => purl.org/bhl/neg/rights/1
[nr2] => purl.org/bhl/neg/rights/2
[nr3] => purl.org/bhl/neg/rights/3

Due Diligence Statements (&dd=)...

[dd1] => Due Diligence statement 1
[dd2] => Due Diligence statement 2
[dd3] => Due Diligence statement 3
http://www-steve.us.archive.org/biblio?f=usage


IPR

  1. Copyright notice
  2. Public Domain
  3. Negotiated Rights
  4. Creative Commons
  5. Orphaned Works w/ LC & Stanford
Outcome of Discussions: Fields are needed to store & display copyright & use status. History of why the Copyright notice box went away: Microsoft. IA has several behind the scenes; the two most appropriate for our needs are <license-url> and <copyright-status>. Group decided to narrow license scenarios to 1) Creative Commons flavors, 2) Negotiated with publisher, and 3) Orphaned works.

Action Items: Libraries will include url to save in <license-url> in Packing List. If the object to be scanned is public domain or out of copyright, we will determine which flavor of Creative Commons to apply. If we’ve negotiated rights for a title, we’ll document that, including any paperwork necessary, on a web page in the BHL Portal and submit the URL to store in <license-url>. If the item is an orphaned work we will include a text statement in Packing List describing measures taken to determine orphaned status; this text will be stored in <copyright-status>.Steve is working give BHL some ideas of what he thinks he can do. Bernard will be contact to see if we are on target with solving this.

Foldouts

  1. When available, what price?
Outcome of Discussions: New foldout solution to be rolled out in primary scanning centers (NHM, Boston, SIL) by end of November. Another option may be available to allow libraries to use their own scanning equipment to scan foldouts, transfer images to USB flash drive or network, and insert them at the time of Scribe scanning.

Action Items: Robert to roll out existing foldout solution. Option 2 will be tested at UNC after their scanning center is up and running.

Language in OCR

  1. MARC fixed field to indicate
Outcome of Discussions: Solution in place within a few days for all newly scanned items. Procedures needed for reprocessing books already scanned & derived; could be a problem as files will have to be wiped out to make processing possible.

Action Items: Betsy to work with Raj & Steve on reprocessing UIUC scans & evaluate procedures for other BHL partners. Who from BHL will be the point person to help Raj work on closing the loop on older files that we need?

Inconsistent practices across scanning centers

  1. Marking different items when multiple volumes bound together
  2. Recording IA identifier back in Packing List
Did not discuss

CiteSeer

Did not discuss

Post-scanning quality control checks

  1. Done by IA?
    1. No idea of what has passed or not or why
  2. If the host library wants to QC, how?
  3. What is the process?
  4. Some orgs can’t see red rows; some had to ask for rejects

OAI harvesting is the best way to get things out of IA (?) … There is some real problems with the use of the "Date" issues - especially with revised logs.

Editing metadata after scanning

  1. Wrong assignment of metadata
    1. Mismatched books
    2. Bound-with (use title info for first for all in bound-with)
  2. Truncated fields
    1. Diacritics truncate field

Outcome of Discussions: There is a Post MetaFetch available for reassigning MARC after scanning. But, could be risky to open to users. There is also an interface for editing _meta.xml - www.us.archive.org/biblio. As for diacritic truncation, appears to be a character encoding problem inherent in fetch, but can be resolved by loading the record in Biblio.

Action Items: When we find wrong assignment of MARC (bound-withs or otherwise), we should inform Marcus, who can safely make the change for us. IA to provide URL for editing _meta.xml. Keri and Suzanne to walk Robert through problems encountered at SIL & discuss solutions, possibly will get access to the editing tools.. Diacritics is a known problem and is on the "being fixed" list - Steve is contact at IA (?)

Saving copies locally

  1. Speeds of download - outbound bandwidth
    1. Other methods, such as BitTorrent?
  2. APIs for inquiry & ingest

Outcome of Discussions: IA just increased outbound bandwidth. OAI better alternative for ingest than UIUC’s existing (but working!) solution. Should evaluate RSYNC in ingest process.

Action Items: Tim Cole to work with Raj & Steve to refine OAI implementation at IA.

Roundtripping metadata

  1. How to keep library, IA, BHL and OpenLibrary metadata in sync
Outcome of Discussions: Bibliographic Metadata: Changes to library MARC rare after scanning; not important to keep MARC in IA in sync with library. OpenLibrary was out of scope for those assembled. Other Metadata: IA wants (such as names); exact specs to be defined. How can we update stuff when needed? IA does want as much "stuff" as possibly - in XML. Continually dropping off of information is possible. Web IDing is an issue. Cautions about "over stomping" things.

Action Items: Chris to discuss naming conventions with Raj.