MonographicDeDuping
Additional DeDuping considerations:
We also need to de-dupe based upon volume and/or year for serial monographs and multi-volume sets. That will mean including individual item information as well as the MARC fields below.
DeDuping Fields to try:
100 $a
700 $a
035 (OCLC number)
260 $a and/or $b and /or $c
245 $a and $b (not sure what to do about $b. I'm thinking that we might want to see what happens and contemplate "fuzzy" matching on all the terms in $a/$b?)
NOTE: We decided NOT to use ISBN's. (SIL didn't have the data accessible this time 'round and there were so many titles that do not have ISBNs that it was deemed not worth the effort right now.)
RefWorks Project:
To access your account, just go to
http://www.refworks.com/refworks, click on the Individual Log-in Tab, and enter:
Log-in Name: tgarnett
Password: biodiversity
10/2/07 - First experiment with RefWorks de-duping
MBLWHOI uploaded 5441 references to Refworks from a picklist Diane created. Our method was to start with Diane's picklist, which is in excel format, and parse it into Refworks XML and then import. FYI, OCLC number and barcode went into the descriptor field. SI put 302 items from their database into a separate folder. We knew there were some duplicates in there. We then asked for "exact match" and "close match" duplicates.
RefWorks currently de-dups based on title and author only. Therefore, all of the monographic series and multi-volume sets in the folders were considered dups - it brought back over 4000 results!
Another issue we noticed was that the SI records contained MARC subfields (now converted to $a, $b, etc.) while the MBLWHOI didn't - meaning an exact match would not be possible. The SI records also included articles at the beginning of titles, such as "The" and "a". MBLWHOI records did not. "The" in front of a title, an extra space in the author, subfield marks, etc. were enough to not get the records recognized as duplicates, even with the "close match" feature. It really appears that the match must be absolutely exact. We don't know at this stage what happens with diacritics. I'm not very optimistic about the matching because things are likely to come in slightly differently, depending on the source.
1st wish list:
- identification of duplicates needs to be WAY less picky
- de-duping on more fields, especially volume and year to eliminate multi-volume sets being considered duplicates
- consistency in importation (e.g. MARC subfields, diacritics, etc.)
- ability to sort so that the "true duplicates" between organizations appear at the top and your "internal" duplicates at the bottom
In our minds, #1 and #2 are show-stoppers. If the match has to so exact and we cannot de-dupe on volume or year, it's not going to work at all.