BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

rotating pages experiment

Smithsonian Libraries noticed that in some books there were tables with lists of scientific binomials that were oriented perpendicular to the page. Some of these tables had names which did not appear elsewhere in the text.

We conducted an experiment to scan those pages as is, then scan them again and rotate the image so it would (hopefully) be OCR'd with the other text. We scanned it as a foldout (many in the one title below were in fact foldouts anyway) but rotated the image before uploading it to IA.

TL;DR version:

It didn't work reliably. Sometimes OCR would try to re-reorient the rotated page (possibly based on detecting the rotated header/pg number data.) Sometimes it would try to OCR the page but it just came out as garbage. On the rare occasion, it would actually OCR the names. It is also possible that part of the problem is the low resolution of the foldout station camera which was used to shoot most of these pages. Many pages were below 300ppi, as low as 150ppi for larger foldouts.

if you have questions about this process, feel free to contact ~~thompsonkeri Keri Thompson at Smithsonian Libraries thompsonk@si.edu


external image n174_w248


Example of a book with searchable scientific names that appear in the rotated shot and nowhere else:

http://www.biodiversitylibrary.org/ia/palaeontol43419611962pala#page/385/mode/1up

Here’s another example of a book in which the OCR worked, though there is some garbling (perhaps due to the low resolution of our foldout camera):

http://www.biodiversitylibrary.org/ia/palaeonto373419941995pala#page/405/mode/1up

Here is another example in which the OCR successfully picked up text, but it is heavily garbled (again, probably due to low resolution), making it highly unlikely that anyone would actually be able to search for and find any scientific names:

http://www.biodiversitylibrary.org/item/196338#page/175/mode/1up

The garbling issue raises questions about whether or not the foldout station camera is adequate for shooting anything with small text.

Here is a comparison in image quality between a page shot at the scribe:

http://home.archive.org/bookview.php?identifier=palaeontology41561998pala&mode=book&leaf=0466

and the same page shot at the foldout station:

http://home.archive.org/bookview.php?identifier=palaeontology41561998pala&mode=book&leaf=0464

Here is another example of blurriness, this one unrelated to the “rotated tables” issue. This is a normal foldout that was shot according to normal foldout procedure. However, the text is so blurry that it is entirely unreadable, rendering the foldout image useless:

http://home.archive.org/bookview.php?identifier=palaeontol53419621963pala&mode=book&leaf=0179

In addition to these resolution/camera quality problems, there were some more perplexing OCR problems. I found several examples in which the OCR gave us nothing, not even gibberish. Whether or not this problem is related to the resolution/camera quality issue is up for debate.

In the following image, the table text does not seem to be significantly blurred, but the OCR is still unable to pick up anything but “Septal Spacing (per 5 mm)” from above the chart (and, mysteriously, “PALAEONTOLOGY, VOLUME 41”, which only appears sideways in this image. This is a recurring issue and one of the most confusing parts about this. It calls into question my very understanding of how OCR works).

http://www.biodiversitylibrary.org/ia/palaeontology41561998pala#page/464/mode/1up

In this instance, too, the table text doesn’t seem to be too small or blurry, but still OCR gives us nothing (and again, the sideways text is bafflingly present):

http://www.biodiversitylibrary.org/item/197382#page/88/mode/1up

Here is an example in which scientific names were picked up by OCR, but ONLY in the sideways version of the page:
http://www.biodiversitylibrary.org/ia/palaeontology41131998pala#page/390/mode/1up

The rotated version of the page gives us nothing:

http://www.biodiversitylibrary.org/ia/palaeontology41131998pala#page/393/mode/1up

In conclusion: I would recommend that we cease shooting any rotated images for the sake of OCR. The process is too unreliable. The second issue raised is whether or not our foldout station camera is of sufficient quality for even our normal foldout workflow. I would like to get a camera that is of comparable quality to the scribe cameras. If that is not possible, I would recommend that we change our standards for what warrants sending a foldout to Pennsy to be shot, especially when small text is involved.