BHLE_WP2_D2point1

[Bernard 10 Jul 2009]

BHL-Europe | WP1 | WP2 | WP3 | WP4 | WP5

D2.1 Catalogue of content holder requirements (quality, quantity, accessibility, standards, specs of content and metadata)

BHL_Digitization_Specs_20090520.pdf

Task: Circulate existing BHL digi specs for comment and review via BHL wiki
Deadline: 30 June
Intro:

This is what the BHL uses when advising external users how to produce content which is of a minimum standard for ingest to BHL.
It should be detailed enough to have what we need, but not raise the barrier of contribution to an unacceptable
level. Our main goal with BHL is to get clean scans which from which reasonable OCR can be derived for indexing.
When referring to the above pdf, please state page number in Acrobat to help discussion.
Let the discussion begin ....
Bernard (20/5/09)
[continues below at section "Discussion 20 May to 30 June"]

[Bernard 10 Jul 2009]

Thanks for all contributions so far. More revisions required (suggestions from DFG).

CLICK HERE FOR NEW ROUND 2 WIKI PAGE.

*
[Bernard 25 June]
DEADLINE OF 30 JUNE APPROACHES - DRAFT 2 HERE D2point1-BHLE.doc
I've tried to incorporate all views here, so please take a last look.
Outstanding issues:
1. TIFF vs JP2
How about
preservation format = TIFF,
Access format (where supplied) = jp2k?

Note that jp2k is gaining support in that there is now open source and scalable delivery possible via djakota
See http://bit.ly/15sZuG for its use in BHL.
2. 2 page per image files
Should we specify 1 page per image? I vote yes because if BHL takes on the cropping of double-page images, that is
likely to be manually time-consuming or lead to poor quality results using a highly automated model.
[end Bernard 25 June]

[Francisco 10 July] I went through the document. I have no objections. Very well done, good work.
Just some suggestions. Page type description: you could insert "Pretitle" and "Plate". Plate would be good, because in BHL works they call them Text, and there is usually no text on a plate. Blank is a good idea too.
You could also insert a note in the obligation column, Page number required for paginated monographs and serials, or something. I saw you inserted specifications above too, so this could be an option.
No objections against a specification 1 page per image. This should be usual standard with modern scanning machines.

*

Discussion 20 May to 30 June

Comment from Francisco Welter-Schultes:
p. 6 "Page": Page number and page type is set to be optional. If I understand this correctly, this means that books can be submitted without pagination, regardless of whether the original is paginated or not. This is insufficient for us. Submitting a paginated book with 200 pages without that the pages can be accessed by using the scrollbox in the viewer should not be possible.
I know that for example Missouri Botanical Garden has already submitted many books without any documention of page numbers for the scrollbox. This is extremely useless for the scientists who need to consult information from a particular page. Digitising is more than only scanning millions of pages. It is much more useful for a scientist to have a bitonal 100 dpi lowest-quality-of-all scan of a book than having a 600 dpi highest-quality OCR'd colour scan without that the page numbers were recorded for the scroll box. Taxonomists consulting early works work much more with page numbers than with OCR.
Also the documentation of Roman page numbers (often used in the introductory sections of monographs) should be well done, so that also these will be shown in the pagination scroll boxes. Many early scientific have plates with figures, these plates have numbers and should show up in the scroll box too. This should be reflected in BHL standards.
For those works which have no or not sufficient page numbers recorded for the scroll box, it should be made technically possible that the missing metadata can be added, for example by users. This would require at least two more fields on p. 5 section 2 "Item", where not only the scanning institution and date should be recorded, but also, if different, the metadata (page numbers) adding institution (or person, if an open source system for this purpose can be created) and date.
Having a scanned book without recorded page numbers is so unacceptable that we have already discussed scanning such works again.

[Chris] The books that Francisco mentions above were in the BHL portal scanned by Internet Archive, and we just completed a retrospective cleanup of those books whose page numbers were missing - we missed the files in an early first load of the portal. I agree that page numbers are incredibly important, but this was originally set to optional because a) not all books have page numbers b) not all pages are paginated and c) there's no standard to use for the description of page numbers (to my knowledge and I am eager to be proven wrong here).

The process by which page numbers are applied to scanned books is not as straightforward as one would hope, for the reasons described above. MOBOT built a purpose-driven application called the Paginator to aide in the application of page numbers. But, it's hooked into Botanicus and doesn't work standalone. MOBOT developers have good experience & a viable application model that should be considered for reworking to meet the needs Francisco outlines above (and to be made into a standalone product).

[Bernard] The BHL model has been better to have a scanned image which can then be enhanced with metadata than not have
the scanned image. Reason: The OCR is performed and added to the species index so that species is discoverable
to the reseacher. BHL has developed a pagination tool to retro-add page numbers. We may want to think of
developing this further. In theory, add page 1 at scan n, populate through and job is done. That may be crude,
but my point is there are ways and means and I would prefer to leave this in. Anyone else have a view?

[Richard] Yes, I would have one. I would definitely not let the users of the database add the page numbers to the existing images as there is high risk of incorrect pagination and, therefore, high risk of necessary corrections afterwards which can be very difficult and tedious. There ought to be just one person from every institution delivering the scanned images responsible for adding the metadata, including the page numbers.

[Tom] Although I understand Francisco's concerns, I agree with Richard: adding page numbers can sometimes be quite complicated. Especially in the case of old books. What is important is that a researcher is able to establish the page number him (or her) self (E.g. by flipping back and forth a few pages). This may be inconvenient, but I think to be preferred above supplying a wrong page number.

[Francisco] Well, I just would like to see BHL being useful. It would be great if we as taxonomists could work efficiently with BHL files.
Usual standard in BHL is (or seems to be) that title page and main page blocks are recorded (and show up in the scrollbox). Only very exceptionally I saw works with totally unrecorded page numbers (and I do not remember them off hand). A big problem taxonomists have with BHL files is that plate numbers do not show up in the scrollboxes.
Please try to verify the following statement:"Shuttleworth (1878: plate 5 fig. 3) gave a figure of Spiraxis mitraeformis Shuttleworth, 1852."
True or false?
http://www.biodiversitylibrary.org/item/41550
This is an example for the long time you need to find such a plate if its number does not appear in the scrollbox, and for my statement that OCR does not help us at all. OCR is not used as a tool by the AnimalBase team in our work because the Latin names are often incorrrectly OCR'd, and generic names were often abbreviated or not cited, both problems can be seen in the example (arbitrarily selected - it was just the first digitised work that I opened to look for such an example, this is a very very usual problem).

Another point in favour of obligatory page numbers is quality control by the user.
If page numbers do not show up in the scrollbox, a user cannot see if 10 pages in the middle of a book are missing, appear doubly or were mixed up.
An example for such an error in a BHL digitised file is Forskål 1775.
http://www.biodiversitylibrary.org/item/18564
Forskål 1775 (a very important early work with hundreds of new zoological names) was digitized in 2008 by Missouri Botanical Garden. When I saw it online at BHL, I decided not to digitize this work in the AnimalBase project. Last month our team had to work with it and realized that something was wrong with the file, we compared the BHL files with our original print in our own library and realized that pages were not scanned or did not show up. 4 important pages (pp. XVI, XVII, XX, XXI) are missing in the digitized file. We lost time and money with this research, but without page numbers in the scrollbox we would have lost much more time.
Maybe someone knows who is in charge at Missouri to fix the problem and can inform that person.

[Chris: Francisco, every book in BHL is attributed to the scanning library in the lower right corner of the page. This issue with Forskal is a problem with a book from MBLWHOI Library, not MOBOT. This book was scanned by the Internet Archive. I will inform them & MBLWHOI of the problem.]

Yes I would also prefer that the library staff and not users add metadata including page numbers. If the Goobi workflow administration system was used, the person who added the metadata would be recorded automatically.

[Antonio] I have done as a personal try, a preliminary scan of an edited book with chapters by several authors. It seems to me that if you are interested in one specific contribution of this book, indexing is a must or you will have to look page by page until you reach to your target. I do not know if this comment is done too late, but there are many different situations besides this example, where an index can save you a precious time.

p. 5 "Item": As a taxonomist I am not always familiar with the technical expressions, but I see that Start Volume and Start Date is required for serials. This might perhaps solve a serious problem for future scans. I observed that many journal volumes were submitted to BHL without that the volume number and year are visible at the bibliography page. Instead of 65 (1876) - 66 (1877) - 67 (1878) etc. it simply says 1 - 1 - 1... (I saw journals with more than 20 volumes recorded this way), and the reader is forced to open every single volume, search the title page and verify on the scanned page year and volume number. We should regard this as absolutely unacceptable.
I understand that with the currently proposed standard submitting such badly recorded journal volumes will not be possible any more.
Also here my question, who will add the corresponding metadata for journal volumes that are already contained under the described low standard in BHL? I guess we will need tools.

[Bernard] Tricky. There is such a diversity in these between journals so not always easy to find. Our standard will help.
I think at least one of start vol or start date should be "required". BHL USA has an admin interface for correcting
the poor metadata, just no time to do it !

[Francisco] If nobody has the time to do it, it is justified to develop a wiki-like tool that volunteers can add these data.
I will see if I find some examples. Maybe they corrected some items in the meantime and the problem is solved.

[Francisco 1 month later] It seems that this problem is solved. I did not see any such examples any more in the meantime. Perhaps someone has found the time to fix the bugs and add the metadata.

p. 3 "Title": you gave Start Date Published and End Date Published as optional, without distinction serials and monographs. I am used to a standard Start Date Published and End Date Published only for serials, while for monographs I would rather say "Date Indicated on Title Page" - because this is the date a monograph is usually recorded in German library catalogues. Scientists also use this date, and I would also suggest to make all these dates an obligatory requirement. If the date was not recorded on the title page (this happens occasionally in monographs), library catalogues use solutions like [1801] or [?1799], only very rarely I observe bibliographical records really without dates. Note that if a year is given on the title page, and it later turned out to be incorrect, the date of the title page remains the bibliographically relevant date, and the true date of publication is not recorded by libraries. So the term "Date Published" is misleading, this is also true for serials. A journal volume recorded as vol. 15 (1854) might be published in 1855, the library does not record this if 1854 is indicated on the title page. This is why in AnimalBase we use the term "Bibliographically Relevant Date" instead of "Date Published".
For current BHL items that have no date recorded in their metadata, also here my proposal to add a tool allowing external users to add the date manually.

[Bernard] yes: Date indicated on title page (monographs) required

p. 7 "Creator": Name is required, in the description it says "last name, first name". Taxonomists usually only work with "last name, first name initial s". In AnimalBase we use 3 fields: (1) first author last name, (2) first author initial(s), (3) all authors with last names and first names initials, all separated by commas, except that the last author's last name is preceded by a &.

Comment from Richard Sipek:
1) Why is it important to have 600 dpi bitonal TIFF images? Isn’t it unnecessary to have black and white pictures in such a high resolution? On the other hand I would increase the resolution of the colour pictures up to 400 dpi. I have read the note as well, however, why is it preferable to have a black and white copy in 600 dpi?
[Bernard] change bitonal to 300dpi. I would prefer to leave colour at 300 as a minimum as our primary interest is
OCR (readable copy) in BHL and I don't want to exclude legacy material which meets that.

[Francisco] I agree, that's fine. No need for 600 dpi everywhere.

[Tom] Although I think it's a good thing to define the minimum requirements, I feel there's a risk that these might become the de facto standard. Might it not be an option to define also a preferred set of requirements? For instance:Bitonal: 300 dpi, 1-bit or bitonal TIFF imagesGrayscale: 600 dpi, 8-bit grayscale uncompressed TIFF, or lossless compressed image (e.g. LZW, JPEG2000 [*.jp2]).Color: 400 dpi, 24-bit color uncompressed TIFF, or lossless compressed images (e.g. LZW, JPEG2000 [*.jp2]).This way you express what level of quality you prefer (and hence do not 'over'-ask partners) but - because of the minimum set - you also do not exclude material you might want to have.

[Tom] Furthermore, I remember from our meeting that Chris mentioned that BHL is changing towards using JPEG2000 instead of TIFF as the preferred image format because of the more compact file sizes. (Side note: our national library has recently also adopted JPEG2000 as the standard file format for digital images). Can't we change the standard to JPEG2000 as well and change TIFF to an accepted format?

[Henning] The question for me is: JP2000 for storage as well or only for display? Currently, our national funding body (DFG) request TIFF as the format for storage. The reason is: it is well accepted and widely used and is also exisiting for many years already. The chances TIFF files will be lost are considered to be low as many archives are working with TIFF and migration procedures will be available in case new established formats are available.

[Francisco, 24 June] Henning refers to DFG (German Science Foundation) requirements. These are published here:
http://www.dfg.de/forschungsfoerderung/wissenschaftliche_infrastruktur/lis/download/praxisregeln_digitalisierung_en.pdf
In UGOE (Goettingen) these are taken as the Magna Charta of all technical requirements for digitisation.
This is a very interesting document.
"2.2.2. (...) Given that popularity and software support are key criteria when choosing a master format, JPEG2000 cannot currently be recommended for archiving purposes."

[Tom] The document doesn't mention cropping and/or deskewing of the scans. I'm not sure how this is treated in BHL at the moment. Perhaps Bernard can shed some light on this? I'm a bit hesitant on this point myself. From preservation point of view, you should not accept cropping nor deskewing. From a presentation point of view (user perspective?) however, it is highly recommended I think.
Another issue may be colour management. Again, from a preservation perspective this is highly recommended. But it might also be needed by specialised user groups: colour can be very important in determining species. Note: this obviously doesn't not concern the set of minimum requirements. It could however be included in the recommended set I proposed earlier.

[Richard] As for cropping, there should definitely be recommendation (for materials to be scanned perhaps even rule) not to crop and not to change the scanned images in any way. This recommendation/rule ought to be applied at least on the materials of historical value where there is not only the textual importance but even the importance of the volume itself. Shortly, where there it is worthwhile to preserve an image of the physical medium of the text and pictures as well as the information born by the text.

[Bernard] cropping/deskewing: Internet Archive retains the uncropped/undeskewed as well as the modified. I agree that the
unaltered images should be preserved, without any further colour management applied following the scan.
For cropping, two things:
1. If we only accept uncropped, we have to build in the cropping system for the page turner. Our experience is it
is very difficult to achieve an aligned page-turner without cropping (black outlines showing, relative size
problems with roughly cut paper etc..)
2. If we only accept cropped, we have preservation + loss of info problems.
I vote we mandate uncropped but ask for cropped to if possible. For those who do not supply cropped, we would also
need to build this function into the ingester process.

[Chris] I agree with Bernard - if we accept uncropped we'll have to build an app, but I'd suggest that it be separate from the page turner. This is an ingest process used by select members of BHL-Europe institutions (and hopefully others in the future), not a display mechanism used by all BHL users. There are open source components that could be used/integrated here (like ImageMagick, others). To get this right we'll need a an interface for setting crops on a bookful of images, a queueing mechanism to accept those jobs, a processing engine to perform the crop & prepare derivative copies, and a way to recrop or adjust on the individuals that inevitably fail. The big crunch here will be the processing power required to run the cropping jobs, as those machines will need to be fairly robust.

[Tom] "Mandate uncropped but ask for cropped too if possible" seems a reasonable solution to me. I agree with Chris that, when no cropped images are provided, these ideally should be added during the ingest process. I'm hesitant about the feasibility of it though. On the fly cropping might be more practicable?
May I perhaps again suggest that it might be appropriate to create also a set of preferred specs? It's OK to define the 'bottom line' but why not define a standard set of specs we wish everyone (on average) to meet? Consider it as the standard level of quality for contributions to BHL.

2) „Do we mean an authority record for the authors of the title?!“ There could be established connection between BHL and an authority database of one or more national libraries. The records would be compared and, if needed, corrected automatically. This solution works in many libraries.

[Bernard] Yes. Do you know of a freely accessible catalogue we could refer people to for that?

[Richard] Well, e. g. British Library uses the database NACO (Name Authority Cooperativ Program / http://www.loc.gov/catdir/pcc/naco/naco.html) run by Library of Congress. I believe the use of the database is free as long as the users contribute to the database.

[Francisco, 26 June] In Germany they use a database called Personen-Namen-Datei PND
http://de.wikipedia.org/wiki/Personennamendatei
created by the German National Library in Frankfurt which has an open PICA interface:
junkjunkjunk (sorry, this old link was copied from within a session, see below Henning's link)
The shortcoming is that there is no standardized Application Programming Interface API which would allow an automatic query by the Goobi workflow system.
Ralf Stockmann (UGOE) considers this database as "presumably the best one of the world". The Pica interface is only in German, as far as I could see. For creating an API they would need write a DFG application, and then this could be an internationally very well usable database. As far as we know nobody is working on this. But it is possible that BHL is already able to work with the open z3950 Pica interface.

[Henning, 26 June] Concerning PND I was not able using Francisco's link, but this one works for me:
http://z3950gw.dbf.ddb.de/z3950/zfo_get_file.cgi?fileName=DDB/searchForm.html
Hope that works. You can look for "Autor" (author) or "Personen" (persons), for example.

[Francisco, 27 June] Henning's link is good.
Example to use it: insert "linnaeus" to the field "Personen" (= person names), then click Suchen (= search), then you will get 41 results. Click on result No. 3 ( Linné, Carl von (Biologe, 1707-1778) (|1707-1778)). There you will get the list of synonymous names of this author. Each person name has an ID (here 118573349). You can also try this with your own name, if you have published a monograph and are in the database.

3) „4. Creator A “Creator” is defined as a person or company responsible for the creation of the Title.“ I am not quite sure whether the field should be always „required“ as there are books where there is neither personal nor corporate author. In the case there should be an allowance to fill in e. g. "[Anonymous]" or [S. n.]
[Bernard] Add (optionally you may use "[Anonymous]" or [S. n.] ) to this field

[Francisco] Anonymous is necessary, of course. In AnimalBase we call the creator last name "Anonymous" and first name initials "none", this works perfectly.

Comment from Uwe Müller:
1) Metadata Encoding: If I understand the document correctly it is assumed that the metadata are not provided / recorded by the content providers themselves but by BHL. The given MARC21 example file does not contain any page level information for instance. This leaves most of the bibliographic work to the central level of BHL. Shouldn't we at least provide a recommendation how to encode metadata information prior to the ingest process including relations between the logical structures (volumes, issues, chapters, ...) and their physical representations (pages)? I would propose to use METS here as it is a self-containing metadata structure referencing all logical and physical parts of a scanned object in one single file. I understand that requireing METS files as a prerequisite would build up a high hurdle for many conent providers within BHL-Europe. But including a recommendation in the mentioned sense could help standardise the whole digitisation process.

[Bernard] Yes, I think we should explicitly state METS as an advisory and build our ingest to be able to easily import
that.

Comment from UH-Viikki, Helsinki
Concerning file submission guidelines and BHL metadata: we suggest the use of the OAI-PMH interface as an option for submitting metadata. This implies using Dublin Core metadata for the items. Dublin Core does not provide fields for Page numbers and Scanning institutions and Scanning contributors, so the application of these fields should be defined and standardised by BHL partners. This would enable the harvesting of metadata in a similar way that is planned for Europeana.

[Uwe] OAI-PMH is a good idea. As for the metadata format we are not at all restricted to DC and can define and use our own metadata schema. But: OAI-PMH is usefull only to exchange metadata. As the spec proposes the digitised material itself should be transported via mail on either harddrives or DVDs. While this is a pragmatic solution it surely is not a very nice one ...

[Bernard]: If OAI + METS is possible, I think we should offer as an option for metadata upload.

[Chris] I am agreed here that we should use METS as the "container" object by which we transport page-level & other structural metadata as suggested by Uwe. I also agree that by making that a prerequisite will build a hurdle for contributors. Does anyone know of tools already available that could do this page-level annotation and export out a METS file (or any XML file that could be transformed to METS)? Also, METS has so many ways of being used since it's a generic container. Who among us in BHL/BHL-E has experience generating & using METS files?
[Bernard]: MODS may be useful http://www.loc.gov/standards/mods/ within the METS file. MARCEdit (brilliant and free tool!) can probably convert marc exchange format to this, it certainly will to MARCXML.

[Tom] I'm not an expert on this but I understand that DRIVER (EU project: Digital Repository Infrastructure Vision for European Research) advises to use MODS. -> Use of MODS for institutional repositories
Perhaps MODS should at least be included as an option for metadata upload?
Correction: This should have been: Driver advises to use DIDL / MODS, so perhaps this can be taken into account next to METS / MODS?

Comment from Graham Hardy, RBGE
I have a question about the format of files contributed to BHL. The question comes from discussion of the specification with RBGE’s Head of ICT.
Using specific existing RBGE hardware and software we can supply JPEG files. However, we would prefer to be able to submit scans in multi page rather than single page files.
Looking at the way BHL displays (separate pages in a drop down structure), I’m not sure that multi-page files are something we can submit. Nevertheless I am asking just in case there is some way to accommodate this. Please pardon my technical incompetence.

[Bernard] We have the same issue Graham and I'm sure others will as scanning standards have been refined over the years.
Possibly using Imagemagick to split (crop) the pages into singles might help. We have tried using Imagemagick batch cropping
with some success and could share the pain.

BHLE_WP2_D2point1

Table of Contents

[Bernard 10 Jul 2009]

Discussion 20 May to 30 June