Deduping Enhancements
Back to
Dedupe Group page
Vision
In order to better coordinate the scanning workflow for the libraries involved with the Biodiversity Heritage Library (BHL), avoid duplication, and allow for more participating institutions, it is necessary that the, now separate, monograph de-duplication tool and the serial bid list are combined to form a master list of monograph and serial titles. Such a master list would allow users to work with one discrete title set at a time, i.e. a union catalog with artificially separate sections for monographs and serials. Within each section users should be able to place "bids" on titles to claim them for scanning, manually merge like titles together to form a single record, and identify duplicate volumes before scanning, hereby referred to as the process of "deduping". The system should be designed to absorb fresh bibliographic data from contributing institutions at different times without jeopardizing the integrity of the existing database and invalidating bids or merges previously placed.
Rationale
- The system shall allow users to place bids on serial titles they intend to scan
- The system must automate title deduping as much as possible.
- The system shall identify potential monographic and serial title duplicates according to various bibliographic criteria (OCLC number and title; title and volume; etc.)
- The system shall predict likely matches for monographs that can be later verified manually.
- The system shall allow users to manually merge records for the same title where automated title merging fails.
- The system shall allow users to upload a list of bids for monographs and serials in .xls file format (and permit other formats as well).
- The system shall report exceptions in the process of batch bidding where no bid could be successfully placed on a given title as derived from the file uploaded.
- The system shall allow for singular deduping of monographic series by using MARC fields to trace monograph and serial titles
- The system shall allow users to add or subtract single volumes or monographs to/from tool (e.g. remove a volume that was rejected by the scanning center)
- The system shall communicate with the BHL Portal so that items scanned and ingested into the Portal from non-BHL partner institutions are included and marked appropriately within the master list.
- The system must assimilate the work that has already been completed on the serial bid list and monograph deduplication tool as they exist in their present state. Within each of the current tools, significant progress has been made to merge like titles and clean up records.
Glossary
Bid: a claim placed on a title, or series of volumes within a title, indicating that a given institution will scan the item(s). Bids can be revisited after the process of scanning to indicate what intended claims were not fulfilled.
Bid List: Union catalog of serial titles assimilated from each of the BHL participating institutions.
Deduper: The tool used to identify potential duplicates in the monograph scanning workflow.
Deduping: Identifying potential duplicates before they are sent to be scanned.
Full Bid: this is finite; for serials, the run of a serial title is finite and closed, you have the entire run, and you will bid and scan it.
Merging: manually merging records together that represent the exact same title - due to cataloging inconsistencies between institutions, these records could not be automatically merged.
Partial Bid: this enables the user to claim a portion of the title, and to edit and close out the bid after their scanning process is complete.
Picklist: a list of items selected from a given institution’s collection that are potential candidates for scanning.
DRAFT Use Case Scenarios
Serial
The user has selected the
Journal of Generica as a potential candidate for scanning. The user opens the master deduplication tool and logs in. Then the user filters the selection for serials. The user searches for the
Journal of Generica within the list of serials by entering the full or partial title into the search box, in this case "Generica". The system returns the search results in a brief list format showing the title ID, title name, publisher/creator, abbreviated title (if any), the institution(s) where holdings exist, and the "bid" status for each record. For records with full or partial bids, the institution that placed the bid will be indicated as part of the record. Also within the brief list of search results, the system allows the user to select multiple titles for merging, or deduping, and allows the user to place a full or partial bid on the record that matches their scanning selection. The user identifies 3 records that match the
Journal of Generica which, due to cataloging inconsistencies among various institutions, were not automatically merged upon inclusion into the union catalog. The user selects record 1 of 3 for further investigation and the system displays a detailed view of the bibliographic information associated with that record. The record describes volumes 6-10 of the
Journal of Generica published from 1895-1905. The MARC 780 and 785 fields describe the preceding and succeeding titles
Bulletin of Generica and
Generica Journal respectively. If the system was able to automatically associate the MARC 780 and 785 fields with their respective bibliographic records in the catalog then the user should be able to select a link to the preceding or succeeding title. (The same should be true for the MARC 247, former title field as well). The system then generates a new window that displays the record which corresponds to the preceding or succeeding title. If the system was not able to automatically link the MARC 780 and/or 785 fields, then the user is able to select an option to manually search the catalog for the preceding or succeeding title. The system generates a new window allowing the user to search the catalog while maintaining access to the original record selected in the search for the
Journal of Generica. The user locates records for the associated title,
Bulletin of Generica published 1883-1894
. The user places a ful bid on the
Journal of Generica and a partial bid on the
Bulletin of Generica because vols. 3 and 5 are missing from the collection. The system allows the user to manually associate the
Journal of Generica and the
Bulletin of Generica through the appropriate MARC fields so that future users, in their search for one title, can see that any associated titles were "also bid upon". The user enters a note to enumerate the volumes in the
Bulletin that will not be scanned and closes the window. The user examines the additional
Journal of Generica records to identify that they match with record 1 of 3. The system allows the user to select the 3 records and merge them into 1. The system also allows the user to chose which record will serve as the representative record, in this case, record 1.
Note: As an option for placing partial bids on titles with volumes that are still within copyright the system shall automatically calculate the enddate of the partial bid from the national copyright date information attached to the user's login profile.
For Uploading an entire picklist into the Deduplication Tool:
A user has generated a picklist containing all of the items they wish to scan and check for duplication. This picklist contains both monographs and serials. The user goes to the deduper tool and uploads the picklist in .xls file format containing the minimum MARC fields for each item, i.e. volume (see below). The deduplication tool compares each of the items in the picklist to other bids that have already been uploaded by other users. These bids include both monographs and serials. The deduplication tool then returns a list of potential duplicates (both for monographs and serials). The list that is displayed is made up of the records from the current user's picklist. In other words, the deduplication tool displays the records that may potentially be duplicates from the picklist that was just uploaded. When a user clicks on one of the displayed records, more information displays below the record from the uploaded picklist. This information details the records from other institutions that match the user's record (essentially the records that may be the duplicate(s) to that record). The user can then see all of the potential duplicate records at one time on one page. If the user ascertains that their record is a duplicate to something that has already been scanned, they can delete their record from the uploaded picklist. Once this process is complete, the user can click "confirm", and the deduplication tool includes these records in its database to compare future uploaded picklists to. The key here is that users would be deduping monographs and serials at once, from one interface. That way, regardless of whether one institution catalogs a set of items as a monographic series, and another catalogs it as a serial, this will not make a difference when de-duping. All items will compare to all other items - monographs to monographs, serials to serials, and monographs to serials. The unity of the serial bid list and monograph de-duplication tool also allow users to check one, or a handful, of items for their scan status. One stop shopping for the user!
System loads bids, flags a bid date and institution ID linked with login of person who uploaded it. Exceptions report is generated for any bids which could not be placed because title could not be found or existing bid with overlapping years (or full) already exists.
*Minimum fields for upload xls file:
ILS bib ID as column 1, year start (2), year end (3), Partial/Full (4)
Monographic series
A user has 3 volumes of a monographic series. Each volume has a separate title (e.g. v.1 "The relationship between shoelace length and intelligence") but is part of the
Society of Irrelevancy Series on Rumors. The volume title is in the MARC 245 field and the series title is in the MARC 440 field. The title is uploaded to the deduper which checks both. The monographic title is automatically added to the database if no duplicate is found. The volumes are automatically added to the serials bid list as a bid for volumes 1-3 of the series if no bids exist already for the volumes. (Perhaps we should have an intermediate page for the serials bid that shows the uploaded 3 volumes, the series records in the bid list and the bids on those records. For example, it would show the series records for the
Society of Irrelevancy Series on Rumors and indicate that volumes 5-10 have been bid. Since the user has v. 1-3, the user could click on "confirm bid" and the bid would be entered in the database.)
Inserting a new partner's bibliographic dataset
System administrator ingests file to dedup tool using MARCEDIT plugin tool or similar
Following database backup, system administrator configures a screen which fine tunes how they wish title matching to occur.
Matching process occurs in report mode and outputs in Excel format.
System administrator re-runs the script as update mode if satisfied with matching. Matches are flagged with update date for rollback potential.
Note: Matching process considers current set to be set 1 (unaltered) and new set as the comparison set (set 2) in order to preserve integrity
Functional Requirements
MARC Issues
One master list is needed to automate title matching on MARC fields: 247 (former title), 440 (series statement - title added entry), 490 (series statement), 780 (preceding title), and 785 (succeeding title) as well as the key MARC fields in use, 022 (ISSN), 020 (ISBN), 001 (OCLC number), 245 (title) and 260 (publisher).
Note: The MARC Leader changes each time a catalog record is updated. For this reason a persistent identifier is needed in order to link back to an ILS record. As the BHL Portal needs to be linked to the master list / de-duplication tool, there is an opportunity to retrieve the persistent ILS ID and stabilise this problem.
Discovery
- Diacritics to be fixed to utf-8 (Note: some of this work will be done prior to migration from current serial system)
- Improve journal metadata indexing and retrieval, allow searches by keyword for example
- Improve filtering capability, allow users to filter search results by institution for example
- Improve sorting capability, allow users to sort search results by bid status for example
Dataset Manipulation
- Tool must be able to import datasets like those that will be used to support material ingest, California Digital Library (CDL) for ex.
- Allow for automated normalisation and merging of duplicate records post ingest (based on OCLC algorithm – further detail to be supplied)
- Ability to upload bids in bulk (.xls file format) and/or place single manual bids
Interoperability
- It is necessary that a mechanism is put in place to allow the BHL Portal to communicate with the master list. In other words, there must be a way to align the volumes that have actually been scanned with the full and partial bids that have been placed on titles for scanning -- this requires an ID that maintains throughout the whole process between the master list and the BHL Portal.
Reports
- The system shall produce inventories of the bids placed by a given institution, including the status of bids pending and fulfilled, i.e. scanned.
- The system shall produce lists of titles without bids
- The system shall generate a list of scanned items by institution.
Intellectual Property Rights
Currently there is a permissions databases that keeps track of the title metadata and scanning status for all copyrighted materials where BHL has received express permission to scan the titles. There is also a database under development to track the title metadata and scanning status for materials requested for inclusion into the BHL Portal but have not yet been scanned. The portal requests database may also include information about "gap-fills", meaning volumes within a title run that could not be scanned by the bidding institution for whatever reason and must be addressed by another institution.
- Short-term goal: To establish communication between the external permissions and portal requests databases and the master list that communicates via universal identifiers to highlight priority titles for scanning. A system must also be in place to announce new additions to the permissions and/or portal requests databases.
- Long-term goal: To integrate the permissions and portal requests databases with the master list to not only streamline communication about which titles need to be prioritized, but also allow institutions to place bids and post feedback about the titles.
Ingest sync notes
- The system must address the management of titles that have already been scanned by new BHL institutions or non-BHL participating institutions, such as the California Digital Library. It should be possible to automatically absorb the bibliographic data associated with these titles into the master list upon ingest into the BHL Portal. The system should allow for the synchronization of the titles ingested with the existing titles in the list so that scanning efforts are not duplicated.
Manual Synchronization of Bid List & De-duplication Tool ("Deduper")
- Would be helpful to see vols. & years of scanned materials in bid list to allow for quick checking when considering new bids
- Item level chronology & enumeration correction?
- Can we better granularise itemize in bid list to clarify what is being scanned, i.e. which volumes? Communication is needed between portal and bid list for ex. “The following vols. have been scanned”
Other
- Access control per site to allow for profile and operator and supervisor role with access to different functions according to position.