DeDupe Group Page

Action Item: Small workgroup/ task force to examine requirements for a dedupping and bid list. Report to Henning. Executive Committee designed. Review Open Library’s dedupping algorithms.

Assigned: Bianca Lipscomb, Diane Rielinger, Matt Person, Bernard Scaife, Ryan Schenk, Joe deVeer, John Furfey advise the BHL Europe
Date: April 30

Dedupe Group conf. call scheduled for Friday April 24 @ 10am EST | Agenda | Meeting Notes

Deliverable for May 8

Questions

Matches to BHL-E metadata?

Open-Library

Title Matching and De-duplication Tools

Phase 1: Preparation

Phase 2: Build

Phase 3: Test

Phase 4: Add new sites' metadata

Functionality

Discovery

Adding bulk datasets and merging

Deliverable for May 8

Questions

4/2/09 - Questions from Diane R. (MBLWHOI)
MARC Leader. For MBLWHOI, we add 856s after volumes are scanned so our MARC leaders are changing all the time. The MARC leader that is uploaded to the bid list may not be the MARC leader that goes with our scanned volume. Will the MARC leader still be the "match of last resort"?
Bernard: Very important question. Originally when building the bidlist, I asked for unique ids from the suppliers of the ILS metadata. We've never fully used these but these would at least give good matching on vols all supplied by one institution where the MARC LDR has moved.

Would adding MARC 780s to the serials bid list help?

OCLC has a interface which reveals historical ISSN information for a title. Inputting an ISSN in this interface reveals a title ISSN family tree (there are sample ISSNs to play with) http://worldcat.org/xissn/titlehistory and this demo interface reveals associated metadata for a title http://xissn.worldcat.org/xissndemo/index.htm. Nice visualization tools. I know there is a notes field now so people can write down some of their research on a title - could this be expanded a bit to "group" related items more clearly?? Could we take advantage of some of this OCLC data for titles that have ISSNs (I realize that many of our older titles do not have ISSNs) (Many thanks to Matt P. for the heads-up on the ISSN tool).

Fuzzy matching on the monographic deduper is on my wish list. There is some stripping of punctuation, etc. that is done already to help in that regard.

4/13/09 - More from Diane R.
How about the dreaded multi-volume sets? Right now, we have to upload them to the monographic deduper and then check the bid list as well. We have "monographs" with serial information in the MARC 440 or 490. It would be nice if the two tools could communicate with one another so that if you uploaded something on the monographic deduper, it would make a note (even if it is denoted as "possible match") on the serials bid list based on the 400s.

Can the tools communicate with the portal so that what is in the portal is included? That way, bulk uploads to the portal could be automatically updated to the tools. I realize these will not be "perfect" given the variations in cataloging, titles, etc. but at least it would be a big start.

4/24 - Post phone call note from Matt -
Just thinking, as I am faced with processing a difficult serials title as I write this: sometimes we encounter a serial title with: multiple titles, multiple enumeration schemes, and multiple pagination schemes. Makes automation even more difficult, not to mention handling things in the old fashioned way.

Matches to BHL-E metadata?

see 63AD8843d01.pdf
Review results of SSL's deduplication test on BHL data

Open-Library

Title Matching and De-duplication Tools

Phase 1: Preparation

Retrospectively correct diacritics in Serials Mashup from source files (where possible) – utf8 issue.
Retrospectively add MARC Leader in Serials Mashup from source files where possible. This will aid in later matching to BHL portal scanned titles.
Journal Title Authority List (separate and yet crucial)
Mono pre-export issues? – John F
Export test data from Serials bidlist and Mono database

Phase 2: Build

Import and preserve merges to date in new system
Improve merge algorithm (base upon IA model?). Run this to improve matches.
Open Library algorithm?

Phase 3: Test

Phase 4: Add new sites' metadata

(BHL-E, California Digital Libraries etc..)
Involves ingest of records from ZDB?

Functionality

(=Functional Requirements?)

Discovery

Search across all by title, publisher, id
Filter on MONO or SERIAL only.
Filter on contributing library (e.g. NHM London or consortium (e.g. BHL-E)
Filter for subject classification?

Adding bulk datasets and merging

Import new dataset
Merge new dataset (automated routine – normalize, numeric matching and title pattern matching [see separate detailed spec + IA dedup spec])

Bidding on material

Upload a file of potential scannable material from a library. Place automated bids which can be confirmed or rejected by inputter before being committed.
Place single manual bids (as before on Serials bidlist). Ensure fields and normalisation fits our needs properly.

Reporting tools

Report on own bids, others' bids – bids where scanning to be done (partial, none) or where scanning complete.

IPR

Recording what we have permission to scan/not scan at title level.
Recording existence of any publisher/institution agreement(s) signed or in process

Questions:

How do we transition the process already underway from SIL to something more centrally accessible?
Who would manage this?
Should there be a notification system to alert libraries that new titles are available?

Other

Access control per site (serials bidlist model but possibly more hierarchical to allow for operator and supervisor per site?)
Public version website for non-BHL institutions to see and check what’s been done, contribute their own material, suggestions for prioritising
Interface for ingesting pre-scanned material of minimum BHL standard and where suitably formed metadata exists.

This page has been edited 29 times. The last modification was made by
-

chrisfreeland on Jul 27, 2009 7:03 am

DeDupe Group Page

Table of Contents

Deliverable for May 8

Questions

Matches to BHL-E metadata?

Open-Library

Title Matching and De-duplication Tools

Phase 1: Preparation

Phase 2: Build

Phase 3: Test

Phase 4: Add new sites' metadata

Functionality

Discovery

Adding bulk datasets and merging

Bidding on material

Reporting tools

IPR

Other