NHM Internet Archive Work Documen
Workflow (QA process) in use by IA at NHM London 20090203
*
Each day, the scanning centers review a set of books from the previous business day. The number of books to QA depends on the total number of books in the set.
books in set 9-15 16-25 26-50 51-90 91-150 151-280
number to QA 3 5 8 13 20 32
In general, QA should include either 2 books per scribe or 2 books per scanner. When new scanners are hired, QA should look at more books done by them (more experienced scanners can be excluded).
Procedure
1. Open the Meta-manager URL in a Firefox browser window.
UOFT:
__http://www.us.archive.org/metamgr?uoft__
RICH:
__http://www.us.archive.org/metamgr?rich__
IALA:
__http://www.us.archive.org/metamgr?la__
LOND:
__http://www.us.archive.org/metamgr?lond__
NYC:
__http://www.us.archive.org/metamgr?nyc__
ILL:
__http://www.us.archive.org/metamgr?ill__
BOSTON:
__http://www.us.archive.org/metamgr?boston__
WASHINGTONDC/SMITHSONIAN:
__http://www.us.archive.org/metamgr?washingtonDC__
CAPITOLHILL/LC:
__http://www.us.archive.org/metamgr?capitolhill__
MARYLAND/JHU:
__http://www.us.archive.org/metamgr?maryland__
CHAPELHILL/UNC:
__http://www.us.archive.org/metamgr?chapelhill__
INDIANA/FORTWAYNE:
__http://www.us.archive.org/metamgr?indiana__
NJ/PRINCETON:
__http://www.us.archive.org/metamgr?nj__
RALEIGH/NCSU:
__http://www.us.archive.org/metamgr?raleigh__
2. We inspect all the fields that our financial partners see in their sponsor-view of Meta-manager, plus a few others required for IA, so the following fields should be selected. (Using the links above will automatically load these fields into your view of Meta-manager.)
- date
- identifier
- title
- collection
- call_number
- collection_library (UC only)
- imagecount
- contributor
- sponsor
- posscopystatus
- scandate
- scanner
- scancenter
- operator
- curatestate
- curatenote
3. Filter the result set for the date you want to check:
In the "scandate" filter box, type the date in string format. For example, to filter for October 13, 2006, type 20061013*.
Click the "filter" button.
4. You should now see a result set of ~75 - 250 books, depending on how productive the scanners were that day. ;)
Click the "show all" link to view the results in one table.
5. Scan this table for obvious errors, anomalies, and gaps in the metadata.
- Missing title
- Wrong collection name
- Dates past 1923-01-01
- A missing local number in call_number field
- Wrong contributor
- Incorrect or missing collection_library (UC only)
- Wrong sponsor
- No or Duplicate copyright evidence
Correct what you can in Item Manager using the Modify_XML tool, and mark that book with the appropriate error code.
6. Check the posscopystatus fields. Make sure that every book is marked NOT_IN_COPYRIGHT (unless there is some unusual circumstance where UNCLEAR or IN_COPYRIGHT books have been allowed). If copyright is UNCLEAR or does not appear please report this to BooksQA?
7. Sort on the scanner or operator columns (by clicking on the column
title) to choose a sample set. e.g., choose two books from each scribe, or two books from each scanner, etc.
8. Click on a bookid URL to launch the book's details page.
9. Verify that all the access formats (djvu, pdf, flip book) are available.
10. Note the posscopystatus and date values that are displayed on the details page, as well as title and author.
11. Open the flipbook and verify that the copyright information matches the report on the details page, and the bibliographic data matches the book.
12. Check every page of the flip book, looking for cropped text, washed out text, blurred text, missing pages, double-scanned spreads, and any other problems that affect readability. (Make a note of all errors found so they can be coded into Meta-manager.)
13. Return to the books details page. Click on BookView and open the WebBook.
14. Verify that the following pages have been asserted. These will be indicated by words in black type next to the blue hyperlink page number.
e.g., 0007 Title page.
- front/back cover
- title page
- copyright page
- first page of table of contents
- all numbered pages
- two color cards
- two white cards
15. Use error code numbers for books (see below) to record errors in Meta-manager.
- if a book is ok, with no errors noted, mark it 'approved' with code 199
- if a book has minor errors, mark it 'approved' with appropriate codes. Use as many codes as necessary, separated by spaces, to indicate the issues with the book.
- for any freeze or dark errors, mark the book 'freeze' with the appropriate codes. Use as many codes as necessary, separated by spaces, to indicate the issues with the book.
- if you know the book will be rescanned, mark it freeze/195.
16. If you come across a book that would be a good spotlight / display book (prominent author, lots of illustrations, perfectly scanned) mark it with code 198.
17. Create a QA report. From the Meta-manager page:
- click "select columns"
- make sure 'scancenter' and 'curatedate' are checked
- press submit
- put the desired 'scancenter' in its filter box
- put the desired 'scandate' in its filter box
- remember that you can use * as a wildcard
- press the "filter" button
- if this is the result set you want, press "qa report"
Codes
see CurateCodeHistory.
DARK CODES.
-- Formats --
Books with errors 101-110 can be darked in the scanning centers.
- Make sure your QA staff know to check the task history when they suspect an access format is missing -- it might just not be finished deriving yet.
101 Test book
102 DjVu is missing or corrupt
103 PDF is missing or corrupt
104 Flip book is missing or corrupt
105 Text file is missing or corrupt (gutenberg)
106 Orphan bookstub that was never scanned
107 Yearbook scanned & downloaded by DDO
108
109 Item's condition makes it unscannable
FREEZE CODES.
-- Uploading or piping problems --
110 Truncated file(s)
111 Book uploaded from scribe before completed. Incomplete Scan
112 Missing files(s)
113 Cr2.tar file is malformed
114 Cameras assigned incorrectly
115 - 119 not used
-- Metadata --
120
121
123 Possibly not in public domain. Can be darked by QA, loaders, or coordinators.
124 Removed by request of copyright holder or library. Can be darked by QA, loaders, or coordinators.
125 - 129 not used
-- Images --
130 Cropped text
131 Blurred page(s)
132 Missing page(s)
133 Front/Back cover missing
134 White streak in scan that obscures text
135 Book was scanned twice; this copy to be darkened
136 Text is washed out or overly dark --This should be used when the lighting is so bad that it affects human readability and/or OCR-ability.Books with this code will be dark'd.
137 Evidence of scanner (fingers/shadows/etc) visible on page
138 Glass not centered in gutter; text is distorted or cropped
139 Foldout scanned as a normal page, i.e., folded up --More specifically this means, "a foldout was scanned folded up, i.e., as a normal page".
140 Book and metadata do not match --Books with this problem should be fixed immediately in the scanning center (i.e., "post-Biblio'd"
("post-Metafetched" for you old-timers). The 140 code should only be used when something prevents this from being done right away, i.e., as a flag to fix the problem later.
141 Call Number is missing or incorrect*
142 Tissue pages marked incorrectly
143 Anomaly in image format is under investigation
144 Left/right pages are reversed
145 - 149 not used
INFORMATIONAL CODES
These elements do not prevent a book from being approved, but are helpful in improving the process.
150 Bibliographic data missing
151 Bookplate or watermark missing or corrupt*
152 Copyright evidence was reported incorrectly*
153 Bibliographic record from library is truncated
154 Possible error in bibliographic record from library
155 Foreign language character encoding is incorrect
156 - 159 not used
160 Light/dark pages (intermittent)
161 Light/dark pages (throughout)
162 Pages skewed
163 Color cards show in access formats
164 White cards show in access formats
165 n/a
166 Image of cradle is visible at front or back
167 Different crop-box sizes in same spread
168 Bad crop at page edges/gutter
169 Duplicate page spreads scanned
170 Page types not marked or marked incorrectly
171 Title page not marked b/c book does not have title page
172 Scan factors not marked or noted
173 - 194 not used
195 This book will be rescanned -- it should not be darked. Use with FREEZE.
197 This book was checked out and gutted
198 This would be a good display book
199 Approved with no problems noted
201 Google quality problems
2009 Un-dark in 2009
2010 Un-dark in 2010
2011 Un-dark in 2011
Publication dates
When entering copyright / publication dates into Biblio, follow these general guidelines.
- First choice of date is the year following the word copyright (or the copyright symbol). This is usually found on the reverse of the title page, which we refer to as the "copyright page."
- If there are multiple copyright dates, use the most recent one.
e.g., "Copyright 1894, 1897" -- use 1897.
- If the copyright word or symbol are not present on the copyright page, but a date is, use that date. Sometimes this will follow the
phrase: "Entered according to Act of Congress..."
- If there are no dates on the copyright page, use the date (if
present) from the title page.
- If no dates are found on the copyright page or the title page, leave the date field blank in Biblio.
- If any of these dates are 1923 or later, do not scan the book.
Examples of possible scenarios:
1. the word copyright is present, but no date:
- set the visible notice of copyright to "yes"
- set year to /blank/ (this will generate the "exact pub date unknown"
wording in posscopystatus)
- on the final Biblio page, double-check the date field from the MARC record to make sure it's pre-1923
2. no "copyright" word or symbol and no date:
- set the visible notice of copyright to "no"
- set year to /blank/ (this will generate the "exact pub date unknown"
wording in posscopystatus)
- on the final Biblio page, double-check the date field from the MARC record to make sure it's pre-1923
3. no "copyright" word or symbol and publication date of 1901 on title page:
- set the visible notice of copyright to "no"
- enter date as 1901.
Reprints
My understanding of the rules regarding reprints is this:
If the book is a simple reprint with no edits or additions, it is ok to scan. These are usually books that were deteriorated or only available on microfilm and reproduced on paper for preservation purposes by the library.
If the book was reprinted after 1923 with additional material (e.g., foreword, illustrations), it is not ok to scan. This will usually be noted on the copyright page. e.g.,
'Copyright 1919. Illustrations copyright 1956 by E.H. Shepard'
'Reprinted 1956 with additional material by A.J. Fowler'