BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

Ingest Analysis Summary

DRAFT
Please use 4 " ~" if you would like to show your name/time stamp on the edits like - lipscombb lipscombb Sep 17, 2009

Table of Contents

Introduction
Overview of step 2
Analysis of ingest sample
Issues for consideration
Conclusion

Introduction

The following summary provides an overview of the process of “ingest” from the Internet Archive and outlines issues for consideration for the purposes of deciding how to proceed with the inclusion of content by non-member BHL libraries. As an open access repository, the corpus of the Internet Archive exists as an immense opportunity for the BHL to acquire content that is immediately interoperable with the portal and free of charge. The term “ingest” refers to the process of:
  1. Acquiring records from the Internet Archive via an application that downloads EVERY text item identifier and its associated MARC file. (Note: Of the entire IA dataset, approx. 800,000 items do not have MARC records).
  2. Running a complex query (database stored procedure) that selects from the downloaded information the identifiers that meet the established criteria to narrow the corpus of IA records to biodiversity related content.
  3. Pushing the selected pool of IA records to the BHL portal, effectively a modification of the “normal” IA ingest application that brings in items to the portal that are assigned to the <biodiversity> collection during the digitization phase.

This process has taken place for the BHL portal’s beta site http://beta.biodiversitylibrary.org/. Analysis was performed on a small sample, 240 items, to assess content relevancy and quality of the metadata and images present within the California Digital Library (CDL), one of many IA contributors. The proposed plan, with the approval of the IC, is to move the ingested content from the beta site to the production site and execute regular monthly ingests as is currently performed for the inclusion of newly scanned content into the BHL portal.

Overview of step 2

Suzanne Pilsk and Bianca Lipscomb worked closely with the Technical Development team to establish the first phase of criteria selected for step 2 of the ingest process. The methodological approach to identifying biodiversity related content in the MARC files of the entire IA corpus involved an analysis of existing BHL subject headings and classification numbers. By matching IA records to agreed upon ‘650 a’ subject headings, germane classification schedules, and eliminating irrelevant form and topic subdivisions, the IA ingest pool was reduced to a total of 28, 793 items. Please see the attached documentation for details:

Limitations with the current methodology include:
  1. The difficulty of assessing item content by subject headings alone. It must be noted that subjects were matched regardless of underlying controlled vocabulary such as LCSH, MESH, etc.
  2. As every library may potentially assign call nos. differently, matching against call nos. is decidedly less reliable than subject headings. However, it was discovered that 322,716 records from the total IA corpus do not have ‘650 a’ subject headings (144,540 of the 322,716 have no subject headings at all). This being the case, it is therefore necessary to run matches against call nos. to tap into this pool of items.
  3. Only LC class nos. were used. The ingest query in the future may be adapted to accommodate matches against Dewey numbers for example.
  4. The subject list established for the initial ingest query reflects the majority of ‘650 a’ subject headings that are currently in the BHL. Some of these headings may not, in fact, be headings that we would proactively choose to pursue in matching against IA items. It is recommended that the “agreed upon” list of subject headings undergo further review.

The methodology for ingest based on subject heading and call no. analysis is certainly more art than science. The ability to cull irrelevant content from the ingest pool is only as good as the metadata associated with the items. It is therefore easier to include than it is to omit. Moving forward, we have the opportunity to tweak the subject and call no. lists used for step 2 of the ingest query.

Analysis of ingest sample


Total items selected from IA pool = 28,793
Total items in sample = 240 Note: items selected from CDL only; CDL collection at 10,055 items

Sample assessment:

Relevance
Duplicates
Image Quality
Metadata Quality
Post- 1923
Potential gap-fill
High/Yes
153 | 64%
53 | 22%
178 | 75%
154 | 65%
0
23 | 10%
Low/No
48 | 20%
178 | 74%
17 | 7%
10 | 4%
236
181 | 83%
Medium/Maybe
39 | 16%
9 | 4%
43 | 18%
74 | 31%
3
14 | 7%
No. of items NOT assessed*
0
0
2
2
1
22

Limitations: This was a very small sample used for analysis, only 0.8% of 28,793 items ingested. The decision to examine only the CDL items closely was a direct result of the fact that CDL comprises a significant portion of the ingested items, 35%. Finally, the assessments based on the high/low, yes/no scale were largely subjective. In assessing relevance, for example, there was much discussion among the reviewers as to what kinds of content should be considered irrelevant with respect to biodiversity subject matter. Comments were collected in addition to the data described above. Reviewers: Bianca Lipscomb, Suzanne Pilsk, Keri Thompson, Grace Duke, Erin Thomas, Diane Rielinger, Matt Person, Don Wheeler, and Lisa Studier. Please note that all reviewers volunteered for this task.

Issues for consideration

follow link to add your comments, suggestions, etc. Thank you!

Conclusion

The ingest of content in IA contributed by non-member BHL libraries presents both opportunities and challenges for the overall goal, or mission, of BHL and its identity as a digital library. To date BHL has existed as a vetted, more or less, collection of natural history materials. Content included via the IA ingest is only vetted as well as can be achieved through the subject and call no. criteria executed in step 2 of the ingest process. As explained, this process is not perfect but for all the ingested content that may be considered “irrelevant” there is a wealth of relevant content, freely available for acquisition that many of our users will find useful.

As a result of the work completed thus far on the IA ingest task, it is highly recommended that the BHL more clearly document its role as a digital library, identify its core user groups and define its scope in a way that maximizes its potential to reach the widest possible audience with the most extensive biodiversity collection possible.

- chrisfreeland chrisfreeland Oct 28, 2009XLS files of ingested titles in form suitable for Monographic Deduping Tool: IAIngest.zip