This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

DeDupe Group Page

Action Item: Small workgroup/ task force to examine requirements for a dedupping and bid list. Report to Henning. Executive Committee designed. Review Open Library’s dedupping algorithms.

Dedupe Group conf. call scheduled for Friday April 24 @ 10am EST | Agenda | Meeting Notes

Table of Contents

Deliverable for May 8
Matches to BHL-E metadata?
Title Matching and De-duplication Tools
Phase 1: Preparation
Phase 2: Build
Phase 3: Test
Phase 4: Add new sites' metadata
Adding bulk datasets and merging
Bidding on material
Reporting tools

Deliverable for May 8


4/2/09 - Questions from Diane R. (MBLWHOI)
MARC Leader. For MBLWHOI, we add 856s after volumes are scanned so our MARC leaders are changing all the time. The MARC leader that is uploaded to the bid list may not be the MARC leader that goes with our scanned volume. Will the MARC leader still be the "match of last resort"?
Bernard: Very important question. Originally when building the bidlist, I asked for unique ids from the suppliers of the ILS metadata. We've never fully used these but these would at least give good matching on vols all supplied by one institution where the MARC LDR has moved.

Would adding MARC 780s to the serials bid list help?

OCLC has a interface which reveals historical ISSN information for a title. Inputting an ISSN in this interface reveals a title ISSN family tree (there are sample ISSNs to play with) and this demo interface reveals associated metadata for a title Nice visualization tools. I know there is a notes field now so people can write down some of their research on a title - could this be expanded a bit to "group" related items more clearly?? Could we take advantage of some of this OCLC data for titles that have ISSNs (I realize that many of our older titles do not have ISSNs) (Many thanks to Matt P. for the heads-up on the ISSN tool).

Fuzzy matching on the monographic deduper is on my wish list. There is some stripping of punctuation, etc. that is done already to help in that regard.

4/13/09 - More from Diane R.
How about the dreaded multi-volume sets? Right now, we have to upload them to the monographic deduper and then check the bid list as well. We have "monographs" with serial information in the MARC 440 or 490. It would be nice if the two tools could communicate with one another so that if you uploaded something on the monographic deduper, it would make a note (even if it is denoted as "possible match") on the serials bid list based on the 400s.

Can the tools communicate with the portal so that what is in the portal is included? That way, bulk uploads to the portal could be automatically updated to the tools. I realize these will not be "perfect" given the variations in cataloging, titles, etc. but at least it would be a big start.

4/24 - Post phone call note from Matt -
Just thinking, as I am faced with processing a difficult serials title as I write this: sometimes we encounter a serial title with: multiple titles, multiple enumeration schemes, and multiple pagination schemes. Makes automation even more difficult, not to mention handling things in the old fashioned way.

Matches to BHL-E metadata?


Title Matching and De-duplication Tools

Phase 1: Preparation

  1. Retrospectively correct diacritics in Serials Mashup from source files (where possible) – utf8 issue.
  2. Retrospectively add MARC Leader in Serials Mashup from source files where possible. This will aid in later matching to BHL portal scanned titles.
  3. Journal Title Authority List (separate and yet crucial)
  4. Mono pre-export issues? – John F
  5. Export test data from Serials bidlist and Mono database

Phase 2: Build

  1. Import and preserve merges to date in new system
  2. Improve merge algorithm (base upon IA model?). Run this to improve matches.
  3. Open Library algorithm?

Phase 3: Test

Phase 4: Add new sites' metadata

(BHL-E, California Digital Libraries etc..)
Involves ingest of records from ZDB?


(=Functional Requirements?)


Adding bulk datasets and merging

Bidding on material

Reporting tools


  1. How do we transition the process already underway from SIL to something more centrally accessible?
  2. Who would manage this?
  3. Should there be a notification system to alert libraries that new titles are available?


This page has been edited 29 times. The last modification was made by
- chrisfreeland chrisfreeland on Jul 27, 2009 7:03 am