Ingest Issues for Consideration
See the BHL Beta site for the ingest test
Please use 4 "~" (tildas) in a row to create signature for comments
Consideration
|
Benefit
|
Problem
|
Duplication of content in BHL
|
duplicates in collection provide more opportunities for users to find what they need -- more titles/author metadata to search on; marginalia ok; different ed.
|
Users frustrated by too many duplicates of the same title; how many duplicates are too much
- Oct 8, 2009I also see rampant duplication as a potential concern for funding agencies.
|
Solutions
|
Comments
|
- Oct 1, 2009Option: Continue to de-dup against our brothers and sisters in BHL. Let duplication from ingest happen.
|
- Oct 1, 2009Right now, things are in IA that we are not dedupping against. Moving into the BHL will not be that different except titles will be put with ours.
- Oct 1, 2009Benefits of new stuff out weighs duplication.
- Oct 1, 2009New search interface potentially could address users concerns of seeing multiple copies.
|
Dupes in scanning workflow
|
|
- we will duplicate scanning of content already provided for free
- current deduplication tools cannot support
|
Solutions
|
Comments
|
- Oct 1, 2009Continue to dedup against ourselves for our workflow
|
- Oct 1, 2009We don't spend funders money on duplicating our own stuff
We know our tools need to be improved but this is a big problem that will not be solved in a timely fashion.
|
- Oct 1, 2009Stop dedupping completely
|
- Oct 1, 2009Not my first choice but maybe go back to some more general way of "bidding" of topics
|
Questions/Comments
-
Oct 1, 2009Chris, I believe that this issue over duplication and de-duplication that Keri and John are talking about has to do with the BHL libraries selecting materials from our own collections for scanning into BHL. There has to be a mechanism for us to identify materials from outside BHL libraries that have been injested into BHL so that we don't scan the same thing again, using up the funding allocations without advancing the content. The current tools available to BHL libraries to assist with this work will not be useful for this purpose, making the identification of the non BHL titles extremely cumbersome and time consuming : essentially a title for title search of the BHL portal for any packing list sent to IA.
- - Oct 1, 2009The only way we can do *anything* with these books & get them into tools, etc., is to bring them into BHL. The duplication already exists; these books have been scanned & are sitting on IA servers. We can either choose to bring them in & deal with them, or ignore them & continue to duplicate scanning effort & $.
- - Sep 29, 2009 i just want to point out that this is the deal-breaker for the ingest right now. We, SIL, can't keep scanning once the ingest takes place unless there is a robust deduplication tool available that will let us dedup against the portal, or the beta portal. If none exists, we will be wasting time, money, or both.
- - Sep 29, 2009 I agree. The absolutely vital first step of this procedure is that the deduping mechanisms need a thorough if not complete overhaul. After the enormous amount of time spent hand-wringing over the need to prevent duplicate material, and given that the existing mechanisms for deduplication are not particularly robust (to use Keri's well-chosen word,) to import this much material with the idea that we'll "dedupe it later" could lead to problems down the road.
- - Sep 29, 2009 Plus, on further reflection, I have to wonder about the issues just getting this material into the deduping tools to begin with - wouldn't it have to be separated into monographic and serial files, as the current tools are? Would those loads then be deduped against the current tools? How would that work with the current serials tool bidding model?
- - Sep 30, 2009 To comment on John's above question about serials bidding (in Bernard's absence), if a large ingestion occurred without the serials content going into the Serials Mashup, then: when selecting titles to bid upon the librarian would need to consult the bid list and the portal or a list of ingested serials to make as informed a bidding decision as one could make. I am not sure if already scanned to be ingested content can be automatically be created as "bids" in the Mashup, perhaps a manual bidding process would need to take place to bring the mashup up to date. - Sep 30, 2009 (Bianca's suggestion directly below- does document our discussing automated serials bidding and the ingestion on 4/24/09)
- - Sep 30, 2009 Revisit dedupe group discussion for more details
- - Oct 2, 2009 The materials we are going to ingest are already scanned. For us to avoid duplication, we need them in the BHL for any sort of deduplication process. John's point above about "down the road" means we lack any means NOW to avoid scanning these already scanned texts.
Short Term Solution
- - Sep 30, 2009 Potential solutions emerging here: see BHL-E developments for WP2+ recently held meeting in Berlin
Long Term Solution
- - Sep 18, 2009 This potentially should morph into title control as well. Though not real FRBR, I believe duplication, editions, etc will need to be grouped together to help our users.
Consideration
|
Benefit
|
Problem
|
Contributors
|
New pool of contributing members. BHL strengthens its position as one-top-shop for biodiversity content online.
|
How to display contributor information? Current alpha display clutters contributor list, obscuring BHL participating members
|
Solutions
|
Comments
|
- Oct 1, 2009Move the drop down to an "advance" search
|
- Oct 1, 2009Does this have to wait until we have a new interface? Can it be done sooner than later?
|
Questions/Comments
- add here
- - Sep 29, 2009 - not necessarily a solution, but this is an interesting visual representation of many libraries contributing to a single library: http://www.wdl.org/en/browse/institution.html
- - Sep 30, 2009As a interface for browsing by contributor this is a cool idea. But as we know, a very significant amount of BHL searching is known-item searching. Someone has a citation for a text and wants it. In this case imposing a contributor display between a user and the source content is not good.
- - Oct 1, 2009 I agree this is an interesting idea and maybe there is a place for it, although not the first thing a user sees--maybe available if requested in an advanced search . What do the European and Chinese colleagues think? Is this an interesting or redundant capability?
- - Oct 2, 2009 Not to be a wet blanket, but I found that interface nearly useless as anything but a clever web designer exercise. Few affordances, very little idea of what I was looking at, and I find tooltips very problematic as an interface feature.
Short Term Solution
- - Sep 18, 2009 I suggest we just have some generic "other" and mark anything we get from anywhere else like that. If IA wants "credit" then we use them as the source in the drop down. If we get from other scanning vendor/aggregates in the future (Europeana?) they can have "credit"
- - Sep 29, 2009 Group contributors as BHL participating members on top and the contributing members below
- - Sep 29, 2009 Order by contributor of the most titles or items
Long Term Solution
- [MK] Move the contributing library drop-down box into the advanced search options and eliminate it from home page
- - Sep 29, 2009 I also think that the contributor drop-down was much more for our own benefit and egoboo than the users'. I certainly don't think that it's essential on the home page and should move to an advanced search page, which would eliminate this problem entirely.
Consideration
|
Benefit
|
Problem
|
Metadata Quality
|
Addition of good quality metadata to the collection increases access for users
|
Poor quality metadata will further complicate existing metadata synchrony issues.
|
Solutions
|
Comments
|
- Oct 1, 2009Ingest. Don't let this concern stop the work. Encourage Gemini reporting of metadata concerns
|
- Oct 1, 2009Establish workload of metadata clean up as Gemini reports are created.
|
Questions/Comments
- - Sep 29, 2009Do we know if non-BHL contributors (e.g. California Digital Library) have worse metadata than BHL libraries?
- - Sep 30, 2009 Analysis of the small sample of CDL items (240) showed that 65% were considered as having high quality metadata. I would venture to say that this is no better or worse than our own, but what this shows (to me) is that there is an even greater need to institute metadata oversight, i.e. make resources available to address the 35% of metadata that invariably results in frustration for our users. Poor metadata = poor access ==> not so "library" like
- - Sep 29, 2009 our definition of metadata problems != users definitions of metadata problems. e.g., monographs cataloged separately aren't bad duplicates, just different. however, we do guess that most [CDL et al] libraries won't necessarily have volume and issue info for their scans. Um, I think.?
- - Sep 29, 2009 The concerns which have been expressed regarding ingestion timetable relative to metadata, etc. issues are impressive as I think these concerns demonstrate that BHL Librarians are in it for the "long haul". I am reminded that commercial publishers have entire departments devoted to metadata and imaging correction issues of their scanned content. That which cannot be corrected by long term or short term technological solutions (as we have seen in manual portal editing), will need to be addressed as a fact of any such project, over time, with the positive caveat that BHL Librarians are committed to getting it as correct as possible.
- - Oct 2, 2009 Re: Keri's question about volume and issue information for serials, as well as dates for these. Do we know from our test sample if any of that information is present and, if so, in the correct fields to ingest properly into BHL? (I looked at just monographs) If we are going to get a lot of material added with "no volume information", then it would be good to have a way of identifying multi-volume ingested materials and making the clean-up of that metadata a high priority for the group.
Short Term Solution
- - Sep 29, 2009 We have a responsibility to manage ingested whether good or bad so we need to fix it via portal editing
Long Term Solution
- - Sep 29, 2009 Crowd source corrections to metadata issues
- - Sep 29, 2009 Implement algorithmic solutions to metadata clean-up
Consideration
|
Benefit
|
Problem
|
Image Quality
|
Addition of new pages, increases BHL collection and gets us closer to our page count goals. More content available for the name services we provide and data mining opportunities.
|
Ingested scans of poor quality will only frustrate our users; access to bad scans from non-BHL member libraries will be difficult to "fix" if at all
|
Solutions
|
Comments
|
- Oct 1, 2009Ingest and encourage Gemini reporting of missing pages, bad scans etc.
|
- Oct 1, 2009Gemini will help us determine workload of dealing with these issues as reported. Decisions can be made to take down an entire title, redirect users to ILL services, potential new service of "digital" ILL that BHL provides. Hard to throw stones at ingest quality when few titles supplied by BHLers are being quality reviewed.
|
Questions/Comments
- - Sep 29, 2009I have repeatedly asked for examples of poor quality scans from these other libraries, but haven't received anything. Do they really exist, or are we worried about something that isn't even a problem?
- This relates back to QA issues. SIL has encountered carts that fail QA on multiple occasions which leads me to believe that other libraries must experience the same phenomenon. Agreed, IA scanning cannot be 100% perfect -- that's impossible, but at least with BHL member contributed content we have the opportunity to send the book back to IA for rescanning should we discover pages missing, dark or light pages that fail to render OCR, blurred plates, skewed pages, tissue covering pages, etc. On scanning workflow / portal management end, we need to know what to do when a user reports a image quality error with ingested content. Here are some examples:
- http://beta.biodiversitylibrary.org/bibliography/19481- many pages are quite skewed and starting at page 49 the right hand side of the pages are all very, very dark, while the left hand side is fine.
- http://beta.biodiversitylibrary.org/bibliography/21516 - missing pages, incl. title page
- In the analysis, there seemed to be lots of comments about noticing differences b/w the flip book view and the portal view as in: http://beta.biodiversitylibrary.org/bibliography/19407 - To begin with, when you go to alternate page viewer for this, all of the pages are reversed, meaning that what is supposed to be the right-hand pages are actually the left-hand pages. Secondly, the margins in this particular volume are so tight that text is obscured. Thirdly the quality of the scans themselves is low - as though a photocopy; this is interesting, however, because the scans in the actual portal are completely different - it doesn't even seem like they are the same files - the margins are good, image quality good with these files...I'm assuming, though, that we want all derivatives of these files to be good, so I maintain that the image quality is poor
- http://beta.biodiversitylibrary.org/item/60265#1 - page 48, post-it note in image ( - Oct 1, 2009 )
- - Sep 30, 2009Concerning the definition of the problem, I would modify the wording to say "frustrate some users." For other users a scan with some missing pages is still a text that they don't have at all. Otherwise why would they be searching for it? 80% is not 100% but it is far better than 0% if none of a text is available at all.
- - Sep 30, 2009 Except when the missing page(s) is exactly what the user is looking for. As librarian in charge of acquisitions for our collection, if an item is incomplete (i.e.missing some page(s), illustration(s), chart(s), index(s)) then we do not accession the item. If the user doesn't find what they need, there is a library somewhere that has the complete document, as published and as the author intended and as the editors reviewed and approved. (Is there an editor anywhere with any respect that would approve of a work going public that was incomplete?) Traditional libraries have established relationships for interlibrary loan to provide users with materials that they do not themselves hold. Tom's argument undercuts all of the work that the original BHL libraries have done regarding the quality and review of our own input. It also undercuts the emphasis that he placed on our own Q/A earlier this year. If users come to BHL materials and find incomplete or otherwise unusable materials more that once or twice, they will begin to loose confidence in the product. If we (BHL) have to go back and replace something that we did not scan initially, we end up doing the work and spending money anyway. As to Chris's request for examples of poor scans from other than BHL libraries : our sample test set of the UCDL was only 2.4 percent of the total. This is not a good representation. If we had more time to review the materials (say the 12+ months that the Tech team has been working on this project), then perhaps we might have more examples, or at least be able to state confidently that the issue is not significant. I would like to see another 'issue' catagory set up for consideration. That is the amount of time / effort that the BHL library staffs will be putting into the corrections of problems occuring from an ingest of this size, not just for image quality, but for metadata / search problems as well. There are issues already accruing from our own input, some of them being a bit complex.
- - Oct 2, 2009 Concerning Don's comments. Most acquisitions librarians have a limited budget that constrains what they can and cannot purchase. The scanning for these volumes has already been paid for by others. I fail to see how acquiring these works "undercuts" all the valuable work we have done on quality. As Don points out the BHL Portal even now has several quality problems with metadata and I perceive creative examination of both bottom-up and algorithmic solutions being discussed by BHL staff. We receive emails right now about the quality of metadata. I view this as good. Users are pointing out errors and we can use their knowledge to help us.
- - Oct 1, 2009 It has been made clear that funders expect more content posted than has been posted, so it makes sense that there is an argument for mass ingestion of available content. It makes sense that there are other arguments for QA of content as a prerequisite for ingestion. Having such discussions will serve to futher distinguish this library from other mass digitization projects. I wonder if it would be possible as has been done already, to download content intended for ingestion to the beta site, then make the beta site live, and then over the period of time it takes, BHL QA Librartians perform QA on titles...as this is done, titles would be moved from the live beta site to the live portal. This would take time, all content would be live, and just as it is for BHL scanned content, live content with errors is posted until errors are discovered and resolved. I stress that the above is just an idea.
- - Oct 2, 2009 Does anyone have a sense of whether other institutions are doing QA or correcting problems? We could approach error corrections with a step-wise process. 1) Alert the contributing library and see if they are willing to send it back to correct. If not, 2) ILL from the contributing library and have BHL member send it for rescanning, especially if IA won't charge us to fix it (we would have to work out details with IA and the original library). If no ILL, 3) if a BHL member has the book, are we willing to make a "Frankenbook" by just using our copy to fix the problem since IA can do page insertions/replacements now (and hopefully wouldn't charge us to just fix the problem)? Lots of issues surrounding a "Frankenbook" that we have discussed already but now this would involve changing the digital representation of a book from a non-BHL library. So should "Frankenbooks" even be a consideration? Finally, 4) if no BHL member has the book, either unpublish or find a way to put a prominent user note with the book altering them to the problem.
Short Term Solution
- - Sep 29, 2009 Permanently unpublish the content from the portal
- - Sep 29, 2009 if the problems are found by a user, will that user be a little upset when we unpublish the entire book? we really need to tell folks that we can't magically insert pages to books "we" don't have. expectation management.
- - Sep 29, 2009 Temporarily unpublish and re-scan if we have the book
- - Sep 29, 2009 Contact scanning institution to notify them of the problem and investigate possible solutions
- - Sep 29, 2009 Contact IA to let them know; what do they have in place to address this if anything?
- - Sep 29, 2009 I believe IA has nothing particular in place for this. They can scan a book. The book is scanned, the book is rejected. Not a lot of post-processing.
Long Term Solution
- - Sep 29, 2009 Set up ILL procedure with scanning institution to re-scan materials that we don't own at any of our member libraries
Consideration
|
Benefit
|
Problem
|
Resource Allocation
|
Ingested content is free and allows the BHL to grow without spending $$ for digitization
|
Staff resources need to be reassessed in managing considerations listed above, namely deduplication, image quality (QA), & metadata quality (portal editng)
|
Solutions
|
Comments
|
- Oct 1, 2009Doesn't seem to be a real issue
|
- Oct 1, 2009Use Gemini tracking of reporting to establish true workload issues
|
Questions/Comments
- - Oct 2, 2009 We all have to-do lists a mile long for our own materials and own projects. If we have to dedupe against the ingested materials without robust deduping tools, that will only add to our workloads. I think we need to clearly decide if we are going to simply ingest the materials and not touch them unless a problem is reported through Gemini, or whether we feel that some specific projects with the ingested books, such as merging serial titles, need to be accomplished.
Short Term Solution
Long Term Solution
Consideration
|
Benefit
|
Problem
|
Scope
|
BHL expanded to include wide range of fields that relate to biodiversity; expansion of user base.
|
Irrelevant content cluttering collection
|
Solutions
|
|
- Oct 1, 2009Possibly not a real problem with the qualifications imposed on ingest
|
- Oct 1, 2009Search and display may elevate some of these issues. Smithsonian is use to having an ILS that has a wide range of topics. To my knowledge, no user has complained that when searching for one topic, an "out of scope" record appeared.
|
Questions/Comments
Short Term Solution
Long Term Solution
- - Sep 29, 2009 Implement collection level metadata oversight to help users access targeted sub-collections within BHL
Consideration
|
Benefit
|
Problem
|
Search/Browse
|
Known-item searches conducted anyway; users benefit from BHL as is, improvements to come with development of CiteBank
|
More content = decreased precision & recall
|
Questions/Comments
- - Sep 18, 2009 We haven't discussed much about relevancy which I am worried a little bit about. Relevancy ranking for returned results of a search.
- - Sep 29, 2009 are we talking about relevance in the regular library way, or relevance specifically as it applies to species descriptions, etc.? because the latter i would imagine to be a total nightmare.
ST Solution
LT Solution
Consideration
|
Benefit
|
Problem
|
Controlled Vocabulary
|
Non-issue: no controlled vocabulary at present in BHL
|
lack of synchrony with author/title lists makes it difficult for users to find related items
|
Questions/Comments
ST Solution
LT Solution
- add here
- - Sep 18, 2009 I believe we will soon see a need for personal name control. This is a lesson libraries have learned over the years with large about of data - we need to group people together. I do not think the library's solution of having a "right" form of the name is necessarily the way to go as much as having a way to group "name strings" of the same "identity" together so that our users find what they need.
Consideration
|
Benefit
|
Problem
|
Ingest Methodology
|
As collection increases with regularly ingested content so will our BHL subject headings & call. nos. -- do we consider these additions helpful in expanding our scope?
|
OR are the additions of subject headings and call nos. as a result of ingest only adding to scope "creep"?
|
Questions/Comments
Short Term Solution
- - Sep 18, 2009 I would really like us to examine Dewey numbers (or at least get a count of how many records in this test set had Dewey numbers.
LT Solution**
- - Sep 18, 2009 Do we need to look at specific thesauri to grow our subject terms we are using? or is our broad reach good enough?