Serials Analysis
FINAL DATA SET:Rod Pages Data Analyzed_Final.xlsx
SUMMARY/GOAL STATEMENT:
Any time there is a list of titles we always need to ask ourselves:
- Is the title in BHL??
- If so, is it complete?
- If not, can we scan it? ==>
- does a member library who is actively scanning hold the title?
- do we need permission before we can proceed?
Collections Committee performed some
serials analysis in the past. Rod Page has performed his own
analysis of zoological serials.
Based on initial analysis (Rounds 1 & 2 below) we were able to determine that there were 249 overlapping titles comparing our lists and Rod's. There were ~1875 titles that were in Rod's list that were not in our lists.
Additional rounds of analysis needed (see 3-5 below) to normalize data so that we can truly ask question #1 more fully. The biggest problem with the data was that they only had one field in common: TITLE. To remedy this lack we will need to get ISSN information for the collections committee list. We will try using
SHERPA/RoMEOapi to accomplish this task. We will then run all of the titles against OCLCs xISSN API to get
OCLC numbers , publisher and chronology information. From there, we will use the list to query the BHL corpus and get
BHL ID responses. This will give us a list of titles that do not have a presence in BHL.
Notes from March 2015
1) This data is accurate as of the date the data was pulled, which was around Late summer ’14. However, by now I assume more titles may have been scanned which would obviously alter the data in the report.
2) This last stat is more of an approximation:
Of the 106 (7%) titles that are pre-1923 on Rod's list, 57 of those have not been digitized for BHL. "
This is because of the nature of the data extraction for the chronology field which comes from the MARC 362 field. I was parsing data that looked like this “Vol. 7, no. 1 (Feb. 23, 1854-Mar. 16, 1854)-v. 75.” OR THIS “Began with Feb. 1862 issue.” (As you can see this is very non-standard data, meaning it was hard to tease all of the dates out and get an accurate count; I would actually err higher for the amount of titles that are pre-1923, we can discuss the reasoning on the call today if you like but if you look at the data closely you will see what I mean)
Soo, for presentation purposes, it might be safer to say something like “We found that approx. ~10% of the titles on R.Pages list were pre-1923 and approx. ~5% had no BHL presence making them good scanning candidates.
Original Data Sets
Rod Page List.xlsxZooPre23.xlsxIPNIpre1923titles.xlsxcombined through-23 zoology priority titles.xlscombined post-23 zoology priority titles.xls
Round 1: Combining data sources and comparing them - Staff call 12/2/13
1)Combined collections committee spreadsheets & Rod Page’s list to form one master list.
2)Tagged each lists’ source for easy comparison.
3)Highlighted dupes by title.
4)Sorted the list alphabetically and looked for matches between our list(s) and Rod's list by highlighting the duplicates.
5) DUPE COUNT: 34 overlapping titles.
CombinedMasterList.xlsx
Round 2: Identifying data integrity problems - Staff call 1/27/14
- Problem: Diacritics that didn’t encode properly.
- Problem: Might be alternate forms of the same titles.
- Problem: Missing title data for 100+ ISSNs
- *These are all problems that would mess up Excels’ dupe highlighting function.
- Problem: Disparate fields from data sources. Even though we had 16 fields across the aggregate the only field that all the data had in common was TITLE. It would cut down BHL workflow time substantially if we had fields for ISSN, OCLC # & Chronology, and BHL title ID (if scanned) for all the data.
Round 3: Clean data, get more useful data.
1)Went through quickly* cleaned the data semi-manually. Probably missed some things since this was done by a human e.g. JJ
- NEW DUPE COUNT: 249 overlapping titles; ~1875 that were in his list that were not in our list(s).
2)Got most missing title data using OCLCs xISSN API and OpenRefine. Merged missing title data into our master list.
API:
http://oclc.org/developer/documentation/xissn/using-api.
Missing ISSN.xlsx
3) Realized that OCLCs xISSN API gives us both
OCLC # & Chronology Information +
Title information sans diacritics! &
Publisher Info. NICE! (See XML dump column for data that was parsed.)
Diacritic-Clean-Get-ISSN-data.xlsx
Round 4: Figure out which titles in Rod Page's list are in the BHL already
Steps required:
1) Collapse duplicates in Rod Page's original list X
2) Append the data with OCLC fields:
OCLC title, OCLC #, Chronology, Publisher X
3) Append the data with BHL fields to show which titles are in the BHL: Title ID X
4) Create filterable spreadsheet X. Results: data below
Rod Pages Data Analyzed_Final.xlsx
Round 5: Figure out which title in the Collection Committee Lists are in the BHL already? (if possible step)
Steps required:
1a) JJ to try getting BHL ID's using a fuzzy title search query. Results: Data below.
1) Get ISSNs for the Collection Committee Lists -
William or someone else/developer? CLIR fellow?
2) Append the data with OCLC fields:
OCLC title, OCLC #, Chronology, Publisher
3) Append the data with BHL fields to show which titles are in the BHL: Title ID
4) Create filterable spreadsheet combining all the data, use it to assign Gemini issues..
BHLlistsw_BHLpresense.xlsx
This provided a dataset that was correct in matching title to BHL ID 50% of the time. (Quick and dirty, random sample of 10.)