DeDupe Group Page
Action Item: Small workgroup/ task force to examine requirements for a dedupping and bid list. Report to Henning. Executive Committee designed. Review Open Library’s dedupping algorithms.
- Assigned: Bianca Lipscomb, Diane Rielinger, Matt Person, Bernard Scaife, Ryan Schenk, Joe deVeer, John Furfey advise the BHL Europe
- Date: April 30
Dedupe Group conf. call scheduled for Friday April 24 @ 10am EST | Agenda | Meeting Notes
Questions
4/2/09 - Questions from Diane R. (MBLWHOI)
MARC Leader. For MBLWHOI, we add 856s after volumes are scanned so our MARC leaders are changing all the time. The MARC leader that is uploaded to the bid list may not be the MARC leader that goes with our scanned volume. Will the MARC leader still be the "match of last resort"?
Bernard: Very important question. Originally when building the bidlist, I asked for unique ids from the suppliers of the ILS metadata. We've never fully used these but these would at least give good matching on vols all supplied by one institution where the MARC LDR has moved.
Would adding MARC 780s to the serials bid list help?
OCLC has a interface which reveals historical ISSN information for a title. Inputting an ISSN in this interface reveals a title ISSN family tree (there are sample ISSNs to play with)
http://worldcat.org/xissn/titlehistory and
this demo interface reveals associated metadata for a title
http://xissn.worldcat.org/xissndemo/index.htm. Nice visualization tools. I know there is a notes field now so people can write down some of their research on a title - could this be expanded a bit to "group" related items more clearly?? Could we take advantage of some of this OCLC data for titles that have ISSNs (I realize that many of our older titles do not have ISSNs) (Many thanks to Matt P. for the heads-up on the ISSN tool).
Fuzzy matching on the monographic deduper is on my wish list. There is some stripping of punctuation, etc. that is done already to help in that regard.
4/13/09 - More from Diane R.
How about the dreaded multi-volume sets? Right now, we have to upload them to the monographic deduper and then check the bid list as well. We have "monographs" with serial information in the MARC 440 or 490. It would be nice if the two tools could communicate with one another so that if you uploaded something on the monographic deduper, it would make a note (even if it is denoted as "possible match") on the serials bid list based on the 400s.
Can the tools communicate with the portal so that what is in the portal is included? That way, bulk uploads to the portal could be automatically updated to the tools. I realize these will not be "perfect" given the variations in cataloging, titles, etc. but at least it would be a big start.
4/24 - Post phone call note from Matt -
Just thinking, as I am faced with processing a difficult serials title as I write this: sometimes we encounter a serial title with: multiple titles, multiple enumeration schemes, and multiple pagination schemes. Makes automation even more difficult, not to mention handling things in the old fashioned way.
Matches to BHL-E metadata?
Title Matching and De-duplication Tools
Phase 1: Preparation
- Retrospectively correct diacritics in Serials Mashup from source files (where possible) – utf8 issue.
- Retrospectively add MARC Leader in Serials Mashup from source files where possible. This will aid in later matching to BHL portal scanned titles.
- Journal Title Authority List (separate and yet crucial)
- Mono pre-export issues? – John F
- Export test data from Serials bidlist and Mono database
Phase 2: Build
- Import and preserve merges to date in new system
- Improve merge algorithm (base upon IA model?). Run this to improve matches.
- Open Library algorithm?
Phase 3: Test
Phase 4: Add new sites' metadata
(BHL-E, California Digital Libraries etc..)
Involves ingest of records from
ZDB?
Functionality
(=Functional Requirements?)
Discovery
- Search across all by title, publisher, id
- Filter on MONO or SERIAL only.
- Filter on contributing library (e.g. NHM London or consortium (e.g. BHL-E)
- Filter for subject classification?
Adding bulk datasets and merging
- Import new dataset
- Merge new dataset (automated routine – normalize, numeric matching and title pattern matching [see separate detailed spec + IA dedup spec])
Bidding on material
- Upload a file of potential scannable material from a library. Place automated bids which can be confirmed or rejected by inputter before being committed.
- Place single manual bids (as before on Serials bidlist). Ensure fields and normalisation fits our needs properly.
Reporting tools
- Report on own bids, others' bids – bids where scanning to be done (partial, none) or where scanning complete.
IPR
- Recording what we have permission to scan/not scan at title level.
- Recording existence of any publisher/institution agreement(s) signed or in process
Questions:
- How do we transition the process already underway from SIL to something more centrally accessible?
- Who would manage this?
- Should there be a notification system to alert libraries that new titles are available?
Other
- Access control per site (serials bidlist model but possibly more hierarchical to allow for operator and supervisor per site?)
- Public version website for non-BHL institutions to see and check what’s been done, contribute their own material, suggestions for prioritising
- Interface for ingesting pre-scanned material of minimum BHL standard and where suitably formed metadata exists.
This page has been edited 29 times. The last modification was made by
- on Jul 27, 2009 7:03 am