BHLE_WP2_D2point1_round2

D2.1 Here we will have a refined discussion about a couple of points after further comments

[Henning 05/08/09]: D2.1 is now finished. Thanks for all the contributions. This does not mean that the discussion is now over, but needs to be rearranged. Kai will work on this after the STL meeting.

Deadline Fri 17th July

Latest version of D2.1
BHL-E_2pt1_20090710.doc

[Henning 05/08/09]: D2.1 is now finished. Thanks for all the contributions. This does not mean that the discussion is now over, but needs to be rearranged. Kai will work on this after the STL meeting.

1. Minimum encoding standards?

2. Structural datasets like DFG viewer

3. Sub-division of document types

4. Page numbers

5. Contributor / Sponsor clarification (in progress)

Questions (total 4):

1. Minimum encoding standards?

The DFG states
"Each digital reproduction, At least at the title level, must be catalogued according to applicable
library and archive standards and listed in a central reference system (library network, central portal,
virtual subject library etc.). Analogous rules apply to archives. If data are recorded outside of existing
library networks or central portals, it is expected that interim results be archived in XML.?"
We cannot state just MARC21 because contributions of existing material may be from places using Archive standards
cataloguing or non-standard (spreadsheet etc...).
What do we expect as a minimum without raising the barrier to high?

Answer:
[Suzanne 31/7/09] I am interested in understanding not only the coding of the material/schema (MARC, MARC21, UniMarc, Dublin Core, Mods, etc.) but in the rules used to apply for the content of the fields. AACR, AACR2, IFLA ISBD, DAC
[Bernard 4/8/09]; Agreed, but something for interpretation via an Appendix given the deadline has passed. Wouldn't want to be too restrictive here and end up with less content.
[Suz 4/8/09] Yup. As long as we can collect what people are using so that we can begin to know how to interpret what we are getting. I think it will cut down on "assumptions" for some of the fields definitions - knowing (for an example only) that Library 1: authors are always added entries and therefore need to be identified for matching with other Library 2 who has the concept of Main Entry. Or Library 1 abbreviated publisher names and Library 2 doesn't. Or... etc. My intention is not to be restrictive as much as to be more inclusive and to find out what their data means.

[Suzanne 4/9/09] Comments - Not sure if this the right document to be looking for this information or not:
Do you want additional information to fill in the tables like the one in 4.3?
Title - Do we want to gather uniform titles? Would they be helpful in the dedupping and gathering of materials together (i.e. FRBRizing?)

The Start Date and End Date and "recorded date" can be found in MARC fixed fields 008 bytes 07-14. With the date code for the type of date in 008 yte 06.

Other dates are located in 362 and as indicated already in the 260 $c. Here is where the rules for how to identify and record the data is important. Do they use punctuation marks to indicate specific rules? (Use of question marks, square brackets, abbreviation 'ca.', use of digit fillers "m" or "u" or "-") Do they choose dates from copyright pages, title pages, colophon?

Call numbers - An indication of what system is used and any unique local decision making is critical to know. We also need to add codes for other schema besides the 050 and 090. 055 for our Canadian friends, 060-061 for our Medical friends. 092 for our Dewey using friends.

Subjects - An indication of the thesaurus used, how they are coded and etc. To make sure we don't limit our content here at all! More the better, especially if we understand how they are choosing from a controlled vocabulary or not or ???

Language is also coded in 546 in MARC

Creator
In table 4.6 for Creator - I am not sure where this data will be coming from - but more the better. "Also known as", cross references, broader context for dates (flourished, worked on mushroom specimens from yyyy to yyyy, worked on mineral specimens from yyyy to yyyy etc.)

2. Structural datasets like DFG viewer

Comment: "Is BHL working with structural datasets, like: http://dfg-viewer.de/en/structural-data-set/. More data
might be included under page level, like article, beginning and end of article, similar things."
ie - can we add extra page-level fields and what would those be?

Answer:

[Tom Garnett, 10 July] Is the DFG viewer compatible with the NLM DTD http://dtd.nlm.nih.gov/archiving/ ? This is the XML format used by an enormous number of current scientific journals for their digital journal article format. The closer we can move the format of the digitized journal literature in BHL to this format the easier it will be for the legacy, older journal literature in the BHL to interoperate with the the current and future scientific literature. Without compatability here, we run the risk of BHL journal liteature existing in a specialized digital ghetto.

[Chris 7/10] The DFG-Viewer is based on METS/MODS. We should be able to crosswalk data between BHL/BHL-E's implementation of METS/MODS to NLM and other schema.

3. Sub-division of document types

[Patricia, July 10] Concerning the sub-division of document types, will this also be part of the meta-data at a higher level. For example will they be sorted in categories and sub-categories? Like
Book
---- Atlas
---- Biography
---- Monograph
---- Encyclopedia
---- e-Book ?
.....
Journal
---- Scientific journal
---- Popular journal
....
or are these categories not considered and it is kept at a very generic level using only Title, Item, Page ?

Answer:
[Bernard 23/7/09]: I can see no reason not to include subdivisions as an optional field.

[Suzanne 31/7/09]: In our MARC records we have some coding that can help separate various types/format of materials. The fix field bytes would be the place to help separate some of these categories and subcategories - possibly not the ones outlined above but other concepts captured at the type of cataloging. E.g. Leader byte06 (type) or 07 (bibliographic level)
[Bernard 4/8/09]; Will add 06 and 07 LDR as options.

Question/Statement:
[Suzanne 31/7/09]: I believe we need to see about incorporating some of the work from the http://viaf.org/ International Authority File
[Bernard 4/8/09]; Will add this as option.

There also needs to be discussion on the collection of series and serial statements and relationships (citation resolving of numbering, title change connections, etc.)

4. Page numbers

[Francisco 16 July]
Scans without page numbers again. Took some time until I found some examples to illustrate what I meant.
Reeve 1849
http://www.biodiversitylibrary.org/item/35904
This is an important zoological work (Reeve 1849). It has 89 plates, with explanations of plates, descriptions of hundreds of new names of species, and high-quality figures. Digitized in 2008 by Smithsonian Institution Libraries.
No page numbers in the scroll box. I see only "Title page" and text text text text text. Please try to find Plate 65. Will take you 10 minutes.
This is what I meant, we need page numbers. We cannot find anything in a work if we just have naked pages and no page or plate numbers associated to them.
This is also what I meant when I said, if I find something like that on the BHL website, I would scan the work once again from our own sources.
(I am also asking myself why the Smithsonian scan has such a bad colour quality. Colour can be important for a taxonomist).

Walckenær 1802
http://www.biodiversitylibrary.org/item/34049
This is a work from 1802, where 130 pages with Latin page numbers are not shown in the scrollbox, the 303 pages of text do show up, plates 1-7 do not show up.

La Cepède 1802
Smithsonian (2009): http://www.biodiversitylibrary.org/item/44010
Harvard (2008): http://www.biodiversitylibrary.org/item/30730
Gallica (2007): http://gallica.bnf.fr/ark:/12148/bpt6k97533v
This is an example of failed deduplication, but it also shows that quite different metadata standards are applied by different libraries.
This would be the complete citation with pages:
La Cepède, B. G. E. de 1802. Histoire naturelle des poissons. Tome quatrième. - pp. j-xliv [= 1-44], 1-728, Pl. 1-16. Paris. (Plassan).
The 16 plates are inserted within the work, for example Pl. 16 comes after p. 674.
Harvard has simply skipped the numbering of the Roman pages in the scrollbox, in contrast to Smithsonian where also the Roman pages can be selected. In both cases the plate numbers were not recorded in the scrollbox. A researcher who needs to find Plate 12 quickly does not get a good service in none of the two scans.
Harvard has a higher resolution, but has cut very closely at the margin so that sometimes text with information was cut off, for example on Pl. 16. Smithsonian has scanned in lower resolution, but without cutting margins too sharply.
So although Harvard has a higher resolution and certainly a more expensive scanning machine, the result is inferior to that of Smithsonian who have worked more carefully.
Gallica has still an extremely ugly search function, this has always been a problem and still is, but in its scans Gallica has applied, like so often, the best metadata standard of all. The scrollbox shows all 3 paginated sets, the Roman pages in Roman numbers, the Arabic pagination in Arabic numbers, and finally, the best service of all, the plates in Arabic numbers. After p. 674 we just see p. 16. This is simple, and of course incorrect because it is not a page (but a plate), but extremely useful.
Gallica's scanning quality is really ugly, just bitonal in a very bad resolution, including the plates, (Harvard and Smithsonian have colour scans, the black and white plates look much better than in Gallica), but on the other hand the images load much more quickly than those from Harvard and Smithsonian.
For the taxonomic user who needs to consult information quickly, the Gallica document will certainly be by lengths the best choice.

La Cepède 1800 (volume 2 of the same series)
Here we can compare 4 scans from within 2 years:
Gallica 2007: http://gallica.bnf.fr/ark:/12148/bpt6k975315
Göttingen 2008: http://resolver.sub.uni-goettingen.de/purl?PPN574644156 and http://resolver.sub.uni-goettingen.de/purl?PPN574644423
Harvard 2008: http://www.biodiversitylibrary.org/item/30017
Smithsonian 2009: http://www.biodiversitylibrary.org/item/44012
In Göttingen the plates were bound in another volume and scanned separately.The plate numbers do not show up in the scrollbox, but they do show up in the "Inhaltsverzeichnis" section. I have told them to put them in the scrollbox in future scans.
A page in Harvard and Smithsonian takes much longer to load than Göttingen which has a much higher scanning quality (Gallica and Göttingen 6-7 sec, Harvard and Smithsonian 12-15, sometimes up to 30 sec).
It is almost impossible to compare scanning quality of Harvard and Smithsonian with Göttingen and Gallica if you do not know where Plate 12 is bound in the Harvard and Smithsonian volumes. If you do not know that you cannot find the plate in the scan without searching 30 minutes for it.

Answer:
[Chris Freeland: 8/3/2009]: This is an answer to Francisco's issues concerning page numbers. The Internet Archive applies page numbers to the books as/or shortly after they are scanned. IA only records page numbers for text pages; they do not record plate numbers or tag the images as illustrations. We have raised this issue with them on numerous occasions, but they cite the need to keep their metadata creation & entry to a minimum to achieve the very affordable 10 cents a page for digitization. The lack of page numbers shouldn't force the rescan of a book; the page numbers are metadata that have to be applied alongside the image, meaning that you can always add in missing page numbers and plate numbers based on an existing scan. Missouri Botanical Garden in its scanning operations does apply plate numbers and tag images with a page type using an application we developed called the Paginator. It allows the description of metadata on every page, resulting in page-level metadata as shown here:
http://www.biodiversitylibrary.org/item/9519

We have a web version of the Paginator for BHL in the BHL Admin. The only issue with its use is the staff time required to do the work. From MOBOT's experience it can take as long as 15 minutes to properly paginate a single volume, though the vast majority take 2-5 minutes. I've added this to the list of items to discuss in Leiden. I'd like to demo the tool and then discuss implications with what AIT are building.

[Francisco, 03 Aug 2009]
The given example is a perfect solution.
"meaning that you can always add in missing page numbers and plate numbers based on an existing scan." - who is "you", any BHL user or only a registered member?

[Chris: 8/3/2009]: Only registered users have the ability to change book or page metadata. Each BHL library who contributes materials to the portal has at least one person who is a registered user and who can make updates. Again, this is an issue we need to discuss as a group at the Leiden meeting.

5. Contributor / Sponsor clarification (in progress)

[Bianca Lipscomb: 8/5/2009]: It would be useful to clarify the language associated with the Item level Definition table (4.4.2). The way I have come to understand it "Scanning Contributor" is the Institution(s) that have contributed the physical objects for digitization vs. "Scanning Sponsor" is the Institution(s) funding and/or performing the actual scanning efforts. This is at least how different US institutions have handled this for IA:
Examples:

California Digital Library
Smithsonian Institution
Library of Congress

BHLE_WP2_D2point1_round2

[Henning 05/08/09]: D2.1 is now finished. Thanks for all the contributions. This does not mean that the discussion is now over, but needs to be rearranged. Kai will work on this after the STL meeting.

Table of Contents

1. Minimum encoding standards?

2. Structural datasets like DFG viewer

3. Sub-division of document types

4. Page numbers

5. Contributor / Sponsor clarification (in progress)