SIL Conversation with JStor

9/23/05
Notes from the conversation with JSTOR’s Director of Production, John Kiplinger.

Production staffing:

23 positions
- Handle the pre-digital preparation of paper material
- Quality control of post digital prodcution
Metadata collection and OCR is contracted out (off shore vendors)
Production Librarians
- 3 to 4 positions with MLS degrees
  - Title History Record
  - Research title background for the Title History Record
    - What belongs with a title
      - Reviews LC records/ NLM records
      - Non-US research the countries national library
      - ISSN Online for more information than is in normal catalog records
    - Previous titles
    - Related titles
    - Documents relations
    - Definitive title
      - Beginning article
      - ISSN of electronic and/or paper
      - Previous title ISSN, range of volumes and dates, etc. Information from catalog records
  - Data is put in a local database based on FileMaker Pro but customized
  - Re-key of all data (not harvesting MARC or other metadata from researched sources)
  - JSTOR is actively trying to update this system of workflow
  - Estimation on number of pages to be scanned based on a formula: shelf space (in inches) multiplied by 340 gives a fairly accurate number. This physical extent is included in title record.
  - Review titles spot checking for the “type” of material included:
    - Heavy use of images
    - Non-English
    - Non- ASCII
    - Kinds of articles
      - Newsletters
      - Bibliographies
      - Research Articles (more of what they are looking for)

Title is then sent to other staff who use Title History Record data to contact publishers to arrange for inclusion in the JSTOR database. Titles are generally not put into production unless JSTOR has a complete run.

Title History Record is basis for acquisitions record and provides information for deciding work load of production for quota per month. (250,000 pages per month)
(Continued Production Librarians duties)

Create for contractor general Metadata/Indexing Guidelines for each journal title
Create for contractor specific guidelines for each journal run (reviewed issue by issue! N.b. most time spent on back-matter, indexes, stray articles)
- Note specific patterns and layout
- Note exceptions and peculiarities and give instructions

Guidelines are in a word document that could range from 4 or 5 pages to up to 70 or 80 pages with examples

Technicians
- 8 Staff positions
- Collation – data kept in yet another database
  - Page by page review
  - Recording page numbering
  - Looking for digitizing problem pages
- Scanning specifications guidelines
  - Word document that hightlight specs for scanning by journal run

Communication with vendor via shared Intranet to which the documents are posted.
Digital scanners can give feedback on problems they may have as they scan page by page. (Alerting JSTOR to find replacement pages, etc.)

Revising/Reviewing the return from the vendor
- First returns of example of the meatadata/indexing files
- Datasets are grouped and when full journal run is returned then ready for authentication
- Loaded into system for verification
  - Automated - Checks for mandatory requirements
    - Metadata fields required
    - Scan resolution requirements
    - Other rules
  - Data rejected or accepted.
- Accepted data
  - Technicians sample 10% for review of page images
  - Follow specific criteria for acceptance.
  - If rejected, sent back with notes and asked to correct specific problems and vendor is to review the entire journal run for other places the error may have occured
    - Returned corrections and different sampling is reviewed
- Accepted scanning and payment is made

In-house automation steps that is about to be contracted / outsourced
- OCR
- Tagging for accurate search results (automated)

Metadata review
- Review each issue (note: not comparing metadata to page image contents)
- Look for special exceptions to see if they are correct

Metadata deliverable: flat file with the Tulip EFFECT tagging. Use of basic text editors. Review and make corrections where needed. Errors spotted by users corrected within 24 hours.

Draft cataloging record for OCLC
- With the help of U. Michigan staff.
- If aggregate neutral record found, add to the record with information from Title History Record
- Create original record if need be based on paper version.

Specific questions answered in email:

What is your method of assigning a unique identifier when there is no ISSN?

It is not unusual for JSTOR to work with a previous title that is old enough not to have been assigned an ISSN. When this happens, JSTOR applies to the relevant national ISSN center for an ISSN assignment. In particular, we have found the US and UK centers to be very receptive to our questions and applications. If a currently publishing title does not have an ISSN, then we work with the publisher to get an ISSN assignment.

How do you link journal titles that have split, merged, etc.?

Currently, JSTOR has an electronic form that allows staff to input a limited amount of information regarding title relationships which is then pushed out to the public interface. We are able to handle "continues/continued by" and "absorbs/absorbed by" relationships, but, although we archive titles with cataloged relationships such as "supplement/supplemented by", "companion publication to", "continues in part", "merges with Y to form Z / created by the union of X and Y", etc., our current system/public interface is limited to showing the two aforementioned relationships. We hope to correct as part of an overall systems rebuild that we are currently undergoing.

What is your numbering method for journals that have supplemental material published out of sequence, etc? How does this effect the structure of your journal set?

I believe that this has happened, but only rarely in the titles that JSTOR has archived. We try to display journal issues in numerical order according to their designations. For example:

Vol. 20 (1995)
Issue 1, Jan. 1995
Issue 2, June 1995
Issue 3, April 1995
Issue 4, Sept. 1995

For issues that are supplementary to a specific numbered or dated issue, we display the supplementary issue immediately following the supplemented issue.

For issues that do not have numbering and are just labeled "supplement" or something similar, we try to place them in the volume according to their date. For example:

Vol. 20 (1995)
Issue 1, Jan. 1995
Issue 2, April 1995
Supplement, May 1995
Issue 3, June 1995
Issue 4, Sept. 1995

We have not yet run into a case where the numbered issues were out of date order and there was an unnumbered supplementary issue in the volume.

If you are also referring to additional content to a specific issue that is published subsequent to the rest of the issue (e.g., microforms, computer disks, corrected pages), JSTOR will add the additional content to the end of the relevant article(s) if it is in a format that we can support. We can digitize microforms, but we can't yet archive content from CD-ROMs, for example. We may also place a JSTOR-created informational "insert" at the beginning of the article explaining what the additional content is.

How do you format the date information, especially with journals that differ in their date designations (i.e. Spring 1845, 1st quarter, etc.)?

JSTOR uses a numerical coding system for issue dates. Dates are entered as YYYYMMDD such that Jan. 1, 2006 would be captured as 20060101. Month codes can be replaces with two-digit codes for seasons and quarters. We can also accommodate date ranges. We've also compiled a set of digital tables that allow us to replace standard English terms with their non-English equivalents (e.g., January = janvier).

In at least one field (_jn Journal Name) you mention the Journal-Specific Indexing Guidelines. Is this something that you can share?

Unfortunately I cannot share this documentation, but I am happy to answer specific questions and try to relate my answers to how this documentation was compiled, formatted and used.

The _xt (Extra Title field) is not repeatable. How do you handle multiple parallel titles and alternative titles?

Are you referring to multiple journal titles or issue titles?

Happily, JSTOR has not yet encountered journal issues that are a part of multiple serial titles. As a former serials recorder, I am well aware of this situation. If, in the future, we could not capture more than one _jn field, we would probably go with the journal title with which we had negotiated the licensing agreement. Of course, for journals with multiple serial titles, we'd need to make sure that there weren't any outstanding rights issues before digitizing and displaying the content.

If you're referring to issue titles, we would most likely try to fit it all into the single _xt field either through the use of an "=" sign or through the use of "//" to denote a separation between two sections of the data in the _xt field.

Have you ever needed a level beyond the _t4?

Not yet, but although the _t1-_t4 levels of metadata correspond with fairly standard ways of categorizing journal levels (i.e., journal, issue, article, subarticle), they are not hard and fast. We could potentially divide the presently captured metadata into additional levels. It can be arbitrary. Are reference citations at the same level as illustration captions, or is there a difference? As we move into XML, the dividing lines between some of these levels will become blurred or will express themselves in different ways.

Exploring XML and move away from the Tulip EFFECT schema

Reviewing DTD and Schemas and believe they will have to adapt something to fit their needs
Goal to be easier to share within the organization and outside JSTOR (with publishers and other partners)
More nimble with easier ways to update, revise, etc the schemas. Current it is with extreme difficulty can they add a field.
Basic changes for better use of data with the prime example of being able to separate the author names (first name from last name, etc.)
NLM
- NLM Metadata Schema – archive tagging but with a focus on the hard sciences with things that JSTOR does not need
LC
Cross Ref DTD – citation software
Looking at tag elements
- What to cover – what is covered – what is not covered
- What is different from what JSTOR has now
- How well does the current information transfer/ translate
- Looking for others that have already done this
Currently not capturing any preservation or technical specs

Tool sets being developed by a vendor

Tools for review and editing
Tools for sampling
Tools for manipulating

JSTOR - Divisions

User Services – searching
Technology
IT Staff
Legal
Library Relations
Public Relations