TechCall_08May2017

Notes

User-generated metadata from PDF downloads for articles

Martin discussed with EC and they voted to continue with this

How to proceed?
What role EABL might have?
What might Mike need to get into work planning process?

Mike doesn't need any further input

Susan - is what we're trying to create new records in Segment table?
Trish - A lot has changed since 2013. At the time, people were creating PDFs and it would be fairly simple for them to submit the article metadata in the process.
Susan - Will these be created the same way as if they came in via BioStor?
Mike - yes, that was the intent.

Metadata that users are supplying is quite incomplete. Some can be added programmatically. But any attempts are not guaranteed to be successful

Concerns are volume, issue, page number, year. We're not asking BHL users to provide this information
Either the record is very incomplete or Mike attempts to derive from the Item record. It's problematic because at item level, volume field frequently contains ranges.

Year field in Item record is often sparsely populated or not populated

It's difficult to get complete citation given what we have to work with. Concern is generating a bunch of incorrect records.

Giving some versus none - is what we have better than nothing?

We were going to tag this as user generated metadata

Code could derive but wouldn't be foolproof

Agree with the idea of this; But as soon as we have full text search, we would have these access points anyway. So after doing full text search, the benefit falls away but the problems would remain.

Maybe Susan, Trish and Bianca could discuss a bit more.

Meanwhile we'll go forward with full text search.

Work we discussed last week - ability of Members and Partners to provide article level metadata
Similar but much more strategic
Fits into idea of defining our own version of BioStor

How many user generated PDF metadatas do we get per month / per year; or how many we've accumulated since started in 2009

217,317 PDFs with article info
but we can't just take all of it blindly; maybe half is good; So around 100k+

Susan, Trish, Bianca:
Do we need to mark as user-generated article metadata?
If you see full text search meeting that need instead?

We could also use it to inform systematic article definition; or to prioritize it; If we see over and over again PDFs for the same title, that indicates it should be articilized from beginning to ending.

JSTOR, BioOne and other sources have good article metadata. BioOne

IIIF server things
Ben Brumfeld mentioned Missouri might be installing IIIF server.
Trish - interest but not at this point.
Martin was talking to Tom Cramer and he askedd Would BHL Be interested in participating in discussions on science content?
Tom would like to talk with Trish and Ari; Martin will send a virtual intro

NEH grant at NYBG includes programming time from Ben Brumfield. In absence of IIIF, He may do a simple rotate feature and add it to FTP as part of the grant.

IA IIIF viewer? What is the status? Is it active or was it a test?
Long term goal is to have available on our own end to reduce blocking in other countries, etc.

Full Text Search server ordered
In transit, should be delivered this week

EC Calls - minutes are available on wiki, including both details or highlights
Martin may ask Bianca to remind BHL staff that those are available

Because we have enhanced APIs into OCLC would like to do some pre-emptive getting biblio info into BHL systems with assumption Z39.50 might be going away. At least a couple of institutions have had trouble w Z39.50

Creator of MARCedit, talking about SRU which is functionally the same as Z39.50
SRU may have something to Search or Retrieve via URL - XML protocol querying data
Probably related to BibFrame
SRU has been around since 2004

Some issues with Scribe software and Z39.50

OCLC doesn't hold local notes

Agenda

Background: See PDF+Generator

For discussion:
User generated metadata was reviewed and assessed as "good enough" in 2013 for use in providing article level access.
An excerpt from that page shows that proposed parameters for implementing included, among others:

User-generated articles where only one-word is provided for title, author and subject will be omitted from BHL as it is thought that the metadata for these articles is insufficient. Approx. 4-5K articles will be omitted.
User-generated content will be clearly marked as such in search results and on article landing pages (article records) with "User contribution" in the 'Contributed by' field.
Problem of duplicate user-generated articles exists but this is similar to the issue we have with title duplication. Mike L. will be taking steps to cut down on the amount of duplicate articles where possible. We will have the option to merge duplicate articles manually, just like we do for duplicate titles as we are made aware of them.

How does a typical user generated article compare to what is provided by BioStor?

From Susan's analysis:
It looks like Title and Secondary Title are the only ones that can have metadata derived with a high level of accuracy. How to handle other metadata (e.g., author names) may not be derived with accuracy? How to normalize formatting of user-submitted author names?

What is the status of year, volume, issue and page numbers in the review process?