BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

Deduping Enhancements

Back to Dedupe Group page

Table of Contents

Vision
Rationale
Glossary
DRAFT Use Case Scenarios
Serial
ILS bib ID as column 1, year start (2), year end (3), Partial/Full (4)
Monographic series
Functional Requirements
MARC Issues
Discovery
Dataset Manipulation
Interoperability
Reports
Intellectual Property Rights
Ingest sync notes
Manual Synchronization of Bid List & De-duplication Tool ("Deduper")
Other

Vision

In order to better coordinate the scanning workflow for the libraries involved with the Biodiversity Heritage Library (BHL), avoid duplication, and allow for more participating institutions, it is necessary that the, now separate, monograph de-duplication tool and the serial bid list are combined to form a master list of monograph and serial titles. Such a master list would allow users to work with one discrete title set at a time, i.e. a union catalog with artificially separate sections for monographs and serials. Within each section users should be able to place "bids" on titles to claim them for scanning, manually merge like titles together to form a single record, and identify duplicate volumes before scanning, hereby referred to as the process of "deduping". The system should be designed to absorb fresh bibliographic data from contributing institutions at different times without jeopardizing the integrity of the existing database and invalidating bids or merges previously placed.

Rationale

  1. The system shall allow users to place bids on serial titles they intend to scan
  2. The system must automate title deduping as much as possible.
  3. The system shall identify potential monographic and serial title duplicates according to various bibliographic criteria (OCLC number and title; title and volume; etc.)
  4. The system shall predict likely matches for monographs that can be later verified manually.
  5. The system shall allow users to manually merge records for the same title where automated title merging fails.
  6. The system shall allow users to upload a list of bids for monographs and serials in .xls file format (and permit other formats as well).
  7. The system shall report exceptions in the process of batch bidding where no bid could be successfully placed on a given title as derived from the file uploaded.
  8. The system shall allow for singular deduping of monographic series by using MARC fields to trace monograph and serial titles
  9. The system shall allow users to add or subtract single volumes or monographs to/from tool (e.g. remove a volume that was rejected by the scanning center)
  10. The system shall communicate with the BHL Portal so that items scanned and ingested into the Portal from non-BHL partner institutions are included and marked appropriately within the master list.
  11. The system must assimilate the work that has already been completed on the serial bid list and monograph deduplication tool as they exist in their present state. Within each of the current tools, significant progress has been made to merge like titles and clean up records.

Glossary

Bid: a claim placed on a title, or series of volumes within a title, indicating that a given institution will scan the item(s). Bids can be revisited after the process of scanning to indicate what intended claims were not fulfilled.

Bid List: Union catalog of serial titles assimilated from each of the BHL participating institutions.

Deduper: The tool used to identify potential duplicates in the monograph scanning workflow.

Deduping: Identifying potential duplicates before they are sent to be scanned.

Full Bid: this is finite; for serials, the run of a serial title is finite and closed, you have the entire run, and you will bid and scan it.

Merging: manually merging records together that represent the exact same title - due to cataloging inconsistencies between institutions, these records could not be automatically merged.

Partial Bid: this enables the user to claim a portion of the title, and to edit and close out the bid after their scanning process is complete.

Picklist: a list of items selected from a given institution’s collection that are potential candidates for scanning.

DRAFT Use Case Scenarios

Serial

The user has selected the Journal of Generica as a potential candidate for scanning. The user opens the master deduplication tool and logs in. Then the user filters the selection for serials. The user searches for the Journal of Generica within the list of serials by entering the full or partial title into the search box, in this case "Generica". The system returns the search results in a brief list format showing the title ID, title name, publisher/creator, abbreviated title (if any), the institution(s) where holdings exist, and the "bid" status for each record. For records with full or partial bids, the institution that placed the bid will be indicated as part of the record. Also within the brief list of search results, the system allows the user to select multiple titles for merging, or deduping, and allows the user to place a full or partial bid on the record that matches their scanning selection. The user identifies 3 records that match the Journal of Generica which, due to cataloging inconsistencies among various institutions, were not automatically merged upon inclusion into the union catalog. The user selects record 1 of 3 for further investigation and the system displays a detailed view of the bibliographic information associated with that record. The record describes volumes 6-10 of the Journal of Generica published from 1895-1905. The MARC 780 and 785 fields describe the preceding and succeeding titles Bulletin of Generica and Generica Journal respectively. If the system was able to automatically associate the MARC 780 and 785 fields with their respective bibliographic records in the catalog then the user should be able to select a link to the preceding or succeeding title. (The same should be true for the MARC 247, former title field as well). The system then generates a new window that displays the record which corresponds to the preceding or succeeding title. If the system was not able to automatically link the MARC 780 and/or 785 fields, then the user is able to select an option to manually search the catalog for the preceding or succeeding title. The system generates a new window allowing the user to search the catalog while maintaining access to the original record selected in the search for the Journal of Generica. The user locates records for the associated title, Bulletin of Generica published 1883-1894. The user places a ful bid on the Journal of Generica and a partial bid on the Bulletin of Generica because vols. 3 and 5 are missing from the collection. The system allows the user to manually associate the Journal of Generica and the Bulletin of Generica through the appropriate MARC fields so that future users, in their search for one title, can see that any associated titles were "also bid upon". The user enters a note to enumerate the volumes in the Bulletin that will not be scanned and closes the window. The user examines the additional Journal of Generica records to identify that they match with record 1 of 3. The system allows the user to select the 3 records and merge them into 1. The system also allows the user to chose which record will serve as the representative record, in this case, record 1.

Note: As an option for placing partial bids on titles with volumes that are still within copyright the system shall automatically calculate the enddate of the partial bid from the national copyright date information attached to the user's login profile.

For Uploading an entire picklist into the Deduplication Tool:
A user has generated a picklist containing all of the items they wish to scan and check for duplication. This picklist contains both monographs and serials. The user goes to the deduper tool and uploads the picklist in .xls file format containing the minimum MARC fields for each item, i.e. volume (see below). The deduplication tool compares each of the items in the picklist to other bids that have already been uploaded by other users. These bids include both monographs and serials. The deduplication tool then returns a list of potential duplicates (both for monographs and serials). The list that is displayed is made up of the records from the current user's picklist. In other words, the deduplication tool displays the records that may potentially be duplicates from the picklist that was just uploaded. When a user clicks on one of the displayed records, more information displays below the record from the uploaded picklist. This information details the records from other institutions that match the user's record (essentially the records that may be the duplicate(s) to that record). The user can then see all of the potential duplicate records at one time on one page. If the user ascertains that their record is a duplicate to something that has already been scanned, they can delete their record from the uploaded picklist. Once this process is complete, the user can click "confirm", and the deduplication tool includes these records in its database to compare future uploaded picklists to. The key here is that users would be deduping monographs and serials at once, from one interface. That way, regardless of whether one institution catalogs a set of items as a monographic series, and another catalogs it as a serial, this will not make a difference when de-duping. All items will compare to all other items - monographs to monographs, serials to serials, and monographs to serials. The unity of the serial bid list and monograph de-duplication tool also allow users to check one, or a handful, of items for their scan status. One stop shopping for the user!
System loads bids, flags a bid date and institution ID linked with login of person who uploaded it. Exceptions report is generated for any bids which could not be placed because title could not be found or existing bid with overlapping years (or full) already exists.

*Minimum fields for upload xls file:

ILS bib ID as column 1, year start (2), year end (3), Partial/Full (4)

Monographic series

A user has 3 volumes of a monographic series. Each volume has a separate title (e.g. v.1 "The relationship between shoelace length and intelligence") but is part of the Society of Irrelevancy Series on Rumors. The volume title is in the MARC 245 field and the series title is in the MARC 440 field. The title is uploaded to the deduper which checks both. The monographic title is automatically added to the database if no duplicate is found. The volumes are automatically added to the serials bid list as a bid for volumes 1-3 of the series if no bids exist already for the volumes. (Perhaps we should have an intermediate page for the serials bid that shows the uploaded 3 volumes, the series records in the bid list and the bids on those records. For example, it would show the series records for the Society of Irrelevancy Series on Rumors and indicate that volumes 5-10 have been bid. Since the user has v. 1-3, the user could click on "confirm bid" and the bid would be entered in the database.)

Inserting a new partner's bibliographic dataset

System administrator ingests file to dedup tool using MARCEDIT plugin tool or similar
Following database backup, system administrator configures a screen which fine tunes how they wish title matching to occur.
Matching process occurs in report mode and outputs in Excel format.
System administrator re-runs the script as update mode if satisfied with matching. Matches are flagged with update date for rollback potential.

Note: Matching process considers current set to be set 1 (unaltered) and new set as the comparison set (set 2) in order to preserve integrity

Functional Requirements

MARC Issues

One master list is needed to automate title matching on MARC fields: 247 (former title), 440 (series statement - title added entry), 490 (series statement), 780 (preceding title), and 785 (succeeding title) as well as the key MARC fields in use, 022 (ISSN), 020 (ISBN), 001 (OCLC number), 245 (title) and 260 (publisher).

Note: The MARC Leader changes each time a catalog record is updated. For this reason a persistent identifier is needed in order to link back to an ILS record. As the BHL Portal needs to be linked to the master list / de-duplication tool, there is an opportunity to retrieve the persistent ILS ID and stabilise this problem.

Discovery


Dataset Manipulation


Interoperability


Reports


Intellectual Property Rights

Currently there is a permissions databases that keeps track of the title metadata and scanning status for all copyrighted materials where BHL has received express permission to scan the titles. There is also a database under development to track the title metadata and scanning status for materials requested for inclusion into the BHL Portal but have not yet been scanned. The portal requests database may also include information about "gap-fills", meaning volumes within a title run that could not be scanned by the bidding institution for whatever reason and must be addressed by another institution.


Ingest sync notes


Manual Synchronization of Bid List & De-duplication Tool ("Deduper")


Other