TechCall_26june2017
Agenda
1. Slack check-in
2. Publication data quality (see message from Mike below)
3. Other updates / questions
From Mike, re: Publication (publisher name; publisher place) data quality in BHL
In the BHL database, the values for publisher name and publication place are taken as-is from MARC (260a and 260b). That means many of the values end up with non-alphanumeric trailing characters, and some include square brackets around the values. For example:
NAMES
W.B. Shaw Aquatic Gardens,
Derby, Miller & Co.,
G.M. Bacon,
PLACES
[Argentorati [Strasbourg] :
San Franciso and New York,
[Danville, Ill.] :
Spartanburg, S.C. :
In a call with Henning Scholz (now with Europeana, previously with BHL-Europe) I saw some an example of how that extra non-alphanumeric stuff can cause problems.
The Name and Place values are highlighted in the attached screenshot from a Europeana preview site. The “Paris :” value in particular is problematic. Because of the extra “ :” characters, the place was not recognized as “Paris”, and therefore is not linked to other material with a Location of “Paris”. Presumably, the messiness of the data could cause similar problems in other places (not just Europeana).
On the other hand… to my knowledge, no one has ever made an issue of this before. There is no guarantee that the Europeana problem is also a problem in other places. In addition, Bianca was also on the call, and thought it was acceptable (if not ideal). Finally, I believe there are places in BHL where these fields are concatenated together, so losing the non-alphanumeric characters could have some adverse effects.
So, in the big picture, is this worth worrying about? Do we want to try cleaning up these values?
This came up during the BHL-DPLA mapping discussion too. DPLA concatenates the publisher name and the publisher place (with existing BHL punctuation) and the result is reasonably good for human consumption. It is not however, good for geocoding. I'm starting to think that we need a field for human consumption and a field for programmatic use. Susan
NOTES
Server interruption - Seems to happen Saturdays. Joel will contact security person, and see if can switch midweek maybe.
OCR files still copying to new server. Rsync process. Probably 90% complete.
Slack app; desktop notifications; can also have on phone
Publisher data includes random punctuation from MARC records; never really cleaned up. When ingested into Europeana, place names not recognized as such because of that extra punctuation.
DPLA - cleans up for us. So they weren't too concerned.
Europeana doesn't though.
Joel has done cleanup for another project; 'regular expression' to cut off those extra spaces and colons
They're using APIs to grab it.
Does it make sense to clean up in API?
We use that to concatenate for some displays so doesn't make sense for us
For BHL version 2.0, we may be doing more data cleanup
2 servers, 1 invoice. Now what?
The 1 we ordered is installed. 12TB RAID 5; 11TB allocated to OCR and search index.
Macaw updates; staff list shortly.