BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

DiscoveryTools

Charge

Scope of Work
1. Investigate concerns about quality of title level or item level descriptive metadata especially as found in discovery layer systems. Focus on OCLC, ProQuest, ExLibris, DPLA, EBSCO, SirsiDynix, GoKB,

2. Review article level /segment level content from BHL into discovery layers. OCLC and Proquest have expressed an interest in a feed of article level metadata.

3. Create NISO KBART Recommended Practice formatted feed for books and journals

4. When appropriate, inform Knowledge Base vendors about improvements to the BHL metadata.

Membership

Name
Institution
Email
System(s)
Schema
Confirmed Member
Adam Chandler
Cornell
alc28@cornell.edu
Voyager, WorldCat Local, Proquest

Yes
Suzanne Pilsk
Smithsonian Institution
pilsks@si.edu
Proquest, OCLC

Yes
Diana Duncan
Field Museum
dduncan@fieldmuseum.org
OCLC WMS

Yes
Matt Person
MBLWHOI Library
mperson@mbl.edu
Voyager

Yes
Bianca Crowley
Smithsonian Institution
crowleyb@si.edu
DPLA (gretchen@dp.la), OCLC
MODS Item Set
Yes
Michael Loran
National History Museum
m.loran@nhm.ac.uk
Ex Libris

Yes
William Ulate-Rodriguez
Missouri Botanical Garden
william.ulate@mobot.org


Yes
Mike Lichtenberg
Missouri Botanical Garden
mike.lichtenberg@mobot.org


Yes
Susan Lynch
New York Botanical Garden
slynch@nybg.org


Yes

Deadline

December 2015

Working Documents

>https://docs.google.com/spreadsheets/d/1oIKbrKsuARqPMSayDp_KLPPeQ0o3kJsFTo4vabcS4cM/edit#gid=0

KBART Test file(s)

https://www.google.com/fusiontables/DataSource?docid=1a28t9MlwWFlEzYFHZK7umEXfKxn1YNBtWIPbF0W-
https://docs.google.com/document/d/1BWRYqBBcnBJMt6t4EN4Bsydc5SoKxAq7Vd1_KtRownQ/edit?usp=sharing

Supporting Documentation

  1. BHL Metadata Schema (as of 05/01/2015) - lipscombb lipscombb Apr 30, 2015Out of date, William can you update this please?
  2. BHL Data Model (jpg) (current version on github) | searchable spreadsheet version (downloadable via github)
  3. NISO RP-9-2014, KBART Phase II Recommended Practice (see Table 5, pg. 19)
  4. Metadata Schemas Comparison to BHL (gdoc)
  5. 2014 Library Systems Report helps to sort out the companies, systems, services, products (oh my!) https://web.archive.org/web/20140820085043/http://www.americanlibrariesmagazine.org/sites/americanlibrariesmagazine.org/files/content/Charts_MarshallBreeding.pdf
  6. marc7xx.txt
  7. hathi_trust_sample_records.xml
  8. OCLC requirements for ingesting article level metadata

Final Report


The BHL Discovery Tools Task Force met bi-weekly (22 times) over Webex from March 2015 - March 2016. Meeting notes and all the group's documents are available on the project wiki page: DiscoveryTools

Focus of the group's effort was on improving the title and article level metadata that BHL makes available to discovery systems and knowledge bases. The group feels it has accomplished all it can as constituted and that the outstanding work should be passed along to the Technical Advisory Group.


Completed
  1. The group completed all the analysis and code for a MODS title level feed for DPLA.
  2. The group completed all the analysis for a NISO KBART title level feed.
  3. The group completed all the analysis for JATS formatted article level feed for BHL.

Within scope, but not completed (Gemini tickets):

57285 Indication of Publication type requirement
57286 Documenation/Requirements for Metadata -KBart Needs
57400 Create article level feed in JATS format
57401 Retrospective Title Clean Up (removing extra data): Remove the 245c portion
57474 Add Fields to Item Table To Capture Additional Volume Details
57475 Update Machine Interfaces to Use New Item Volume Fields
57476 Parse Volume Details into New Fields During Data Ingest
57477 Change Sequencing of Serial Volumes by the BHL Ingest Process
57478 Populate New Item.Volume Fields by Parsing Existing Volume Values
57479 Create New Reports For Managing Volume Metadata Cleanup
57480 Create a KBART Data Export
57483 Change Post-Scan Process For Item Updates and Re-Sequencing
57484 IA Partner Meta App Updates for Volume Metadata
57485 Macaw Updates for Volume Metadata
57486 BHL Ingest Updates for Volume Metadata
57487 Updated Item Edit Page with New Fields Related to Volume


Meetings


March 16

Agenda:
  1. Review action items from March 2
  2. Discuss Mike's Gemini background emails and try to finish Gemini tickets

Notes:
In attendance: Adam, Diana, Bianca, Matt, Mike, Susan, Suzanne and Trish

Review of action items:
Adam will be joining the tech call on March 21st. Bianca has added him to the agenda.
Susan emailed Laura at OCLC to ask how they would like to receive JATS and will update wiki with Laura's response
Neither Susan nor Mike did anything with selecting records for a sample JATS file yet
We did not decide what to do with publication type = collection
Matt did not hear back from Regina about alternative pricing for ISSN's so there may not be another option for us.

Next we reviewed Mike's emails.
From his email of March 10th, Mike's first point was to clarify that the ISO standard for date does not require format YYYY-MM-DD. YYYY is the minimum. The next portion of that email dealt with the question of gap fills which we decided to discuss after we figured out the steps for capturing data from the second email.

We next went through the separate points in Mike's second email.

For no.1, the background was that the current volume strings are not usable for parsing out the data we need for KBART. We have text such as "Narrative" or "Zoology" that do not fit anywhere in KBART but are useful to the user for navigation. We need to keep these in the description for display purposes but Mike needs to add separate fields for KBART.

Proposed change is Gemini ticket no.1: add fields to item table. See action items for specific list. Mike will attempt to parse the information as it comes in. Could modify MACAW or the partner app. Bianca noted that people would prefer not to have to add these after the fact but we would need to get people on board first. Ingested materials would also go through the parser. Bianca thought it would be great to get a full-time person for dedicated metadata cleanup. Adam noted that he sees something missing from the workflows in the BHL project. There is metadata work upfront and software to handle it but something there is no metadata cleanup in the process. Bianca brought up that this has come up in other groups such as User Feedback--now BHL is in a different phase. While we still have a lot of material to scan, we need to figure out how to sustain what is already in BHL. Adam asked where this fit in BHL 2.0 and the answer is that no one knows--Bianca thinks it is more of a catchphrase than a plan. Trish mentioned with the limited staff we have, that she would rather see us focus on the thing we can control (BHL member scans) and not worry as much about non-BHL materials.

Adam asked a question about what he was presenting to the tech committee on Monday. He planned to add a couple of paragraphs to the wiki that summarize all that our group has accomplished. He also wants to make a statement of the problems with metadata cleanup mentioned above. Everyone agreed that was a good plan.

We went back to the question of whether we needed the Macaw change. Susan liked the idea of doing as much as we can in MACAW & the partner meta app. She had some questions about the hierarchy of the new fields. We don't always have with legacy issues and it gets confusing with foreign language parts. Matt suggested maybe thinking of it has 1st, 2nd, 3rd level of enumeration. For KBART we need to capture the most signifcant number. Mike emphasized that labels can remain as is. We decided we would add another ticket for updates to Macaw and another ticket for updates to the partner meta-app. Bianca is a bit concerned with the IA partner meta-app. We've had problem with some of the scanning centers' interpretations of the metadata in the app in the past so it will not be a smooth process to change it. We'd also need to let all the institutions using the app about the changes. Some institutions are resistant to changing the process.

Susan asked the question about whether we would need a field to track whether a human had corrected these new fields like we have for pagination. Mike thought it was a good suggestion but will have to think about how to do it.

No.2 updating all the different data feeds to use the new fields where appropriate. Not much discussion--Suzanne just wanted to make sure that we record all the places the metadata goes for this ticket to make sure we include everything.

No.3: update the ingest process to attempt to parse values for the new fields out of the volume string. This may not work well. Diana suggested adding a ticket to change the post-scan process to manually update these fields--agreed to by group.

No.4: when new volumes appear for a title, the process currently attempts to put them in sequence. All this does is just mess up the sequence. Change the process to just add them at the end. Everyone thought this was a great idea. Also need to add re-sequencing to the post-scan process ticket suggested in no.3

No.5: create some scripts to attempt to fill the new fields for existing data. Mike is not sure if this would be a one-time process or on-going. No.3 might take care of this for new issues added.

No.6: new reports to help with metadata cleanup. Not much discussion. We could add more as needed.

No.7: Report to support Gap Fill activities. Mike needed help with this one. Suzanne mentioned that we need to look at after we figure out how to identify gaps for KBART. Mike said that would be part of the KBART feed. Bianca felt that KBART needs come first and this is not necessary for KBART. Suzanne suggested this fit better with Bianca's collections committee. Adam mentioned that the KBART feed data might take care of this need. Bianca will take care of this ticket.

No. 8: Create the feed

Adam asked how we reconcile these to the original tickets (57280, 57281, 57282, 57284)? Mike--these new tickets will replace the old ones.

Last thing was the last point from Mike's first email. If new fields are not populated are a record, it will not be included in KBART. Agreed to by everyone.

Who will add to Gemini? Mike
What will we do with old tickets? Bianca will close as duplicates. She will work with Mike on these.

Action Items:
Mike and Bianca to redo Gemini tickets:

1) Add the following fields to the Item table. This will allow for more specific detail about each item to be recorded in a more useful format. Currently, this detail is embedded within the Volume values:

StartVolume - this should contain JUST a volume designation (i.e. "8")
EndVolume - this should contain JUST a volume designation (i.e. "8")
EndYear
StartIssue
EndIssue
StartNumber
EndNumber
StartSeries
EndSeries
StartPart
EndPart

Possibly add field(s) to track if human has corrected these fields.

1a) Make changes to Macaw for these new fields

1b) Make changes to IA partner meta app to capture these new fields

2) Update the OpenUrl resolver, APIs, OAI feeds, and data exports to use StartVolume and the other new fields when appropriate.

3) Update the ingest process to attempt to parse elements of the Volume values into each of the new fields.

3a) Change the post-scan process for BHL scanning institutions to include correcting the new item fields and re-sequencing of items for a title.

4) Change how the ingest process sequences volumes in a serial. Instead of attempting to insert newly ingested volumes into the proper place in the sequence of existing volumes, assign all newly ingested items a sequence value of 10000, effectively putting them at the end of the list of volumes.

5) Create scripts to automatically clean up the data as best as is possible. They should be created with the assumption that they may be used periodically (once a month/quarter/year), though they might end up being used only once (when the new fields are added to the item table).

6) Create the following new reports to aid in identifying and cleaning up serial volumes:
A list of Items that have Volume values, but no values in the new Volume/Number/Issue/Series/Part/Date fields
A list of Titles (Serials) with "unsequenced" volumes

7) Create a report to support "Gap Filling" activities. This is a nice-to-have to be recorded but not part of the KBART Gemini tickets

8) Create a process to produce KBART data exports.

Bianca to close as duplicates tickets 57280, 57281, 57282, 57284.

Outstanding item from last call:
Susan/Mike select records and create sample JATS file for Mike to send to OCLC

March 2

Agenda:

Review tickets for outstanding work that our task force does not have the resources to complete. See DTTF ticket discussion so far for background.

Notes:
A month ago Suzanne created Gemini tickets as placeholders for the work that needs to be done. The group may not be able to figure out all the next steps--maybe Tech team needs to do this.

We added ticket BHLFEED-57400 for JATS work. Susan completed the JATS mapping but it may need tweaking once Mike produces data. The tab on the spreadsheet is JATS version 2. We decided we were only including articles that are in BHL not articles that point outside of BHL. Susan spoke to Laura Falconi of OCLC. Procedure will be to send a sample of 100 records and Susan wants to make sure that we include book chapters, journals, items with diacritics. She will pull 10 and let Mike pull the remainder. Laura will attempt to process the sample file. If this is successful BHL will sign a written agreement or contract. After that, we will be in production. Mike will send to Laura but copy Susan. Outstanding question is how OCLC would like to get feed: OAI or monthly data export. Susan will ask Laura.

Bianca asked if there was any metadata cleanup required for JATS. Susan--nothing that will hold us back but there are a couple of issues: BHL uses preferred form of author's name but the practice for journal article metadata is to use the form of the name as it appears on the article. Another issue was ISSN/ISBN. For book chapters we use ISBN if we have it. For ISSN, we could retrospectively assign but the tech team decided not to because of the expense. The ISSN registry has one expensive price to get access to the registry. There is another expensive price to contract with them to match titles from BHL to their database to retrieve ISSNs. There is a modest charge for registering an ISSN for a title. Susan said we might consider buying an ISSN occasionally for titles. Suzanne suggested that someone contact Regina of ISSN registry to see if they would work with us on the price: Matt will do this. We will just send data that we have and see what OCLC's feedback is. Matt added post meeting: titles we have without ISSN's which are in JSTOR will have ISSN's.

We then moved on to GEMINI tickets that were already entered.
57279--we separated into 2 issues. 57279 is now just for DOI cleanup, which is a broader issue not just for Discovery Tools group. Currently we just have a one-time process for registering titles for DOI's. We need a process that will update the metadata registered with CrossRef when changes that affect the metadata are made in BHL.
New ticket 57401 created for 245 subfield c cleanup (title includes statement of responsibility)

57286--documentation/training materials changes. Changed resources from Mike and Joel to Bianca.

57285--Publication type. Mike had a problem with this one because there were a lot of open-ended statements without decisions. We made it clear in the issue now: we are not including records with NA for publication type. Mike will map monograph component part to monograph and serial component part to serial. Raw data publication type will not be changed. Going forward, we need to make it clear that this field is mandatory.

We still have a few tickets to review. We agreed to continue meeting until we finish these. Mike will put together some thoughts about these and send the group an email .

Adam wanted to include Martin on our next call but Martin is not available. Plan is for Adam to join the March 14th tech team call to discuss our group's report and recommendations (as long as Martin is available for that call). Bianca will talk to Carolyn about putting this discussion on the agenda for that call.


Action items:
Susan will contact OCLC rep Laura Falconi to ask how they would like to get JATS feed.
Susan/Mike select records and create sample JATS file for Mike to send to OCLC
Matt to contact Regina at ISSN to let her know we won't be assigning ISSN's due to pricing to see if they will work with us. - Matt wrote to Regina updating her - have not received a response as yet regarding assistance or alternate funding schemes they may have available for projects such as ours.3/15
For 57285: what are we doing about publication type = collection?
Mike to send an email with thoughts about remaining tickets
Adam to join tech team call on March 14th [correction: March 21]
Bianca to contact Carolyn about adding the DT tools discussion to the March 14th tech team call.

February 3

Notes:
Need to review JATS documentations that is all on the wiki. Questions need to reviewed then Bianca and Susan questions to contact Laura Falconi and see what needs to be done from there. Thanks Susan for a thorough job. The Data has now been copied into a sheet in the Google doc that has all the mappings, called "JATS version 2."

Matt's response to ISSN questions from Regina. Connected to the wiki. We need to follow up with Regina to see how BHL can work with the ISSN office to get these done. Funds might need to be allocated. Potentially a Gemini ticket and who to assign it to? Thanks Matt!!

Gemini tickets:
JATS - Attach the files to a ticket create JATS export for BHL
ISSN - Follow up with Regina Reynolds to find out what next steps are and identify tasks
KBart work document to point out Gemini tickets.
Suggestions for actually getting the tickets into Gemini and close out our task force.

February 17: Next meeting be the last to review the Gemini issues/questions - and invite Martin to the call.
Suzanne will check in with Martin to see if he can attend.

Gemini tickets created by Suzanne (with potential help from others) - as much completed by the 17th.

Agenda:
  1. Volunteer to take notes?
  2. JATS update
    1. Susan/ Bianca will contact OCLC rep Laura Falconi with questions about JATS including whether we must use the publishing tag set and other questions about specific elements. Also whether they want only articles or would they like book chapters, too? ISSN is marked as required--will they reject records without this data?
    2. Susan: determine JATS tags needed and how to map data from BHL for license and copyright information.
    3. Susan will update the JATS reference page with the final list of tags: probably one set for public domain materials and one set for in copyright materials.
    4. Latest BHL JATS draft: JATS
  3. Matt will contact Regina Reynolds of US ISSN Office with questions about ISSN lookup and assignment for our journals. He will also send the group an email with examples of JSTOR legacy journals with ISSN's.
    1. ISSN question and reply
    2. Two examples of titles in BHL without an ISSN, which have an ISSNs in JSTOR (scanning library record preceded JSTOR development):
    Florida Buggist
Notizblatt des Königl. botanischen Gartens und Museums zu Berlin

4. Everyone: Review the report Adam produced and add comments for the Gemini tasks that need to be created. Report available at: https://docs.google.com/document/d/14xQz5mTdg4y9v_12tqk-J0bXOCii0y4HGAsUhh_JdrM/edit#heading=h.9p37z1xt69g8
5. Our last task force meeting, February 17? Should we invite Martin?


January 20

Agenda:

  1. Review BHL JATS XML draft
  2. Review and complete BHL JATS draft data mapping
  3. How to handle Link to full record (DOI or URL)?
  4. Proposal: Start migrating our KBART report and other remaining task force products into the BHL issue tracking system

----------------------------
From: Lynch, Susan
Sent: Tuesday, January 19, 2016 11:48 AM
To: Adam L. Chandler <alc28@cornell.edu>
Subject: JATS and BHL

Adam,
Attached is some sample XML that I’ve been working on. I tried to plug in the metadata for an existing part in BHL. I also added a couple of attributes and tags not present in your XML e.g. permissions, license and self-uri. The example uses BHL part 167868. See http://biodiversitylibrary.org/part/167868
I also have the following notes and questions:

Landing page for JATS, Journal Article Tag Suite
http://jats.nlm.nih.gov/

Landing page for Journal Archiving and Interchange Tag Set which is 1 of 3 sets in the Tag Suite

Information about the tags in the archiving tag set
http://jats.nlm.nih.gov/archiving/tag-library/1.1/
There’s a sample of the JATS publishing tag set here http://jats.nlm.nih.gov/publishing/tag-library/1.1/FullArticleSamples/pnas_sample.xml

Questions:
(1) The issue that contains this article is volume 6, number 2. The 2 isn't visible in the BHL public portal or the admin dash. Is it in the database? Did Rod Page provide it? There isn’t a data entry field for issue in the BHL Dashboard.
(2) Does the metadata describe the original or the facsimile in BHL?
(3) In an email from OCLC, the ISSN is described as required. Is it really required? Do we have both e-ISSNs and print ISSNs in BHL?
What about book chapters? Should we code the <isbn> tag? Are there any ISBNs for parts in BHL?

Please make any of this available to the group as you see fit.
Best,
Susan Lynch

Notes:
Susan reviewed the work she and Adam did on researching the JATS tag suite. There are 3 different tag sets: publishing, archiving and article authoring. Each tag set can be used to structure metadata or the text itself. BHL will only be using it for metadata.
After looking at the different tag sets, Susan decided that the archiving tag set would work the best because it is the most flexible. OCLC is using the publishing tag set, but many of the elements are interchangeable so it would not be a problem if we had to use that one. This is one thing we'll need to clear up.

Susan took an existing article and put its metadata into the archiving tag suite--See example on wiki page linked from no.1 on the agenda. She had some questions about elements:
Would it be helpful to include BHL part number in metadata? Yes, this would be helpful for testing.
What is our strategy regarding DOI's -- when are they assigned and when not? Currently we do not assign them. The only ones that are in the database were assigned by someone else. Mike said we do plan to assign these at some point in the future. Susan suggested that it would be better to leave this element out until we are assigning them. Susan then asked if we added the DOI later, would it be picked up by OAI harvesting? Mike said, yes, if they are doing the harvesting correctly, the change would be picked up. Adam--Would you produce an OAI option. Mike said yes.

Another question was about language, which is an optional field for OCLC. Susan asked whether we had this at the article level? Bianca did not think this was being filled in for segments? Mike was not sure if Rod Page was providing this. He spot checked some data and it appears that it is not being provided.

There was some discussion about license and copyright data. This is stored at the item level, not the segment level. This is somewhat complicated. Susan and Bianca are going to work together to figure out the JATS tags needed and how to map from BHL metadata.

Next we discussed the issue of ISSN. Adam thought we should leave the decision about ingesting materials without ISSN's to the vendor. Suzanne wondered how HathiTrust and JSTOR legacy journals was ingested into OCLC. Mike mentioned that one reason we do not assign DOI's to articles is that CrossRef does not know what to do with articles that lack ISSN's. Suzanne wondered if there was another ID the CrossRef would accept since we are a trusted repository. Diana asked if there was a way to assign these to defunct journals? Matt looked up some JSTOR journals and they all have ISSN's so this must have been done. He will send us an email with examples. He knows Regina Reynolds of the US ISSN Office and will contact her to find out how to lookup ISSN's for our legacy journals and how we could assign them if they don't have them already.

Adam's last agenda item was whether we should take our report and enter tasks needed for our recommendations as Gemini tasks. The members may make a decision not to do the work but at least we would have documentation of what would need to be done to meet KBART and JATS requirements. We decided that we all need to read the report and add comments of what Gemini tasks need to be added for each area of the report by the next meeting.

Action items:
Susan or Bianca will contact OCLC rep Laura Falconi with questions about JATS including whether we must use the publishing tag set and other questions about specific elements. Also whether they want only articles or would they like book chapters, too? ISSN is marked as required--will they reject records without this data?

Susan and Bianca: determine JATS tags needed and how to map data from BHL for license and copyright information.

Susan will update the JATS reference page with the final list of tags: probably one set for public domain materials and one set for in copyright materials.

Matt will contact Regina Reynolds of US ISSN Office with questions about ISSN lookup and assignment for our journals. He will also send the group an email with examples of JSTOR legacy journals with ISSN's.

Everyone: Review the report Adam produced and add comments for the Gemini tasks that need to be created.

December 16

Agenda:

1. Discuss draft meeting schedule for Q1 2016.

2. Progress on action items from December 2 meeting:
Adam will continue discussions with the vendors.

He will also speak to Marty about the analytics but that is of secondary importance.

He will plug our estimates for title merging into the report.


3. BHL segment data and compare it to the JATS format.

Notes:
First we went over the action items from the last meeting.
Adam started off the call giving an update of his communications with vendors. He had further contact with OCLC and Exlibris but has not heard back from ProQuest.

Adam also gave us an overview of BHL metadata in Cornell's catalog. There is only article level metadata. He showed us a journal title available from BHL and other vendors. The other vendors show coverage information but BHL does not. Within Serials Solutions, there is a package level record for BHL that shows total number of monographs and serials titles. If you look at the title information, BHL does not have ISSN and coverage like BioOne has. Though a few titles in Serials Solutions for BHL had coverage dates, that would be because a Cornell staff member overrode the dates. Susan asked how Proquest got the data. Adam did not know, but thought they might have used the BHL export page. Mike said they could also be pulling it in from the OAI feed.

Adam plugged in our estimate for title merging into the report for KBART. At some point in a future call, we will have to revisit the report to answer a few of his questions.

Next we spoke about JATS. It is very confusing because of all the permutations. There are Tag Suites and Tag Sets. Adam clarified that KBART contains journal title level metadata and JATS contains article level metadata. KBART is a tab-delimited file and JATS is in XML.

We decided to finish our mapping of BHL article metadata even though we may not have the exact Tag Suite/Tag Set. Mike explained that the Segment table is where we should go first to get most of this metadata. Sometimes a field may be blank in the Segment table and he would have to get that data from another table. Susan asked if we would need to retain information about who is the first author for an article? We weren't sure but the segment-author table has a sequence number for that information. Another question came up as to whether there is an abstract. There is also a field for that. All of our mapping data is found in the spreadsheet linked to in section 3 above. Susan had a question about whether we would be creating a OAI feed or an export file. Mike said we could do either one or both.

Adam next proposed that we have a smaller group take a closer look at the DTD and report back for our next call: Adam, William, Mike and Susan. He also proposed a call with some of the vendor reps. William suggested we start a notes document for questions that we would have for the vendors to present to them first. Adam thought the next step after reviewing the DTD would be for Mike to make a sample file for them to review and at that point we would probably have questions for them.

Susan had one more question about the segment data. There is a page range and also a start page and end page. Is there ever a case where page range would not be start page [dash] end page. Mike said that in theory there could be an article that started on p.5-17 then continued on p. 30-40, etc. but we do not currently have any article metadata like that.

Action item:
Small group to review DTD further and report back at our next call.

December 2

Agenda:

What standards are used by vendors when ingesting article level metadata?

OCLC can ingest either NLM PubMed xml schema or ONIX. See: OCLC requirements for ingesting article level metadata

Ex Libris supports PubMed XML and several other formats. See: Ex Libris requirements for ingesting article level metadata

Proquest: Adam is waiting for his Proquest contact to return from vacation.

Our best bet is NLM (PubMed) JATS:
Hi Adam,
Here are a couple of links to review. http://dtd.nlm.nih.gov/publishing/ is a link to the general NLM JATS page and http://jats.nlm.nih.gov/publishing/1.0/ is a link to their current standard which is marked NISO JATS. From there, you can navigate to the DTD, samples and other documentation.
Laura Falconi from our Content Integration team would be more than happy to help you further if you have any other technical questions. I’ve copied her in on this email. Have a great day!
Tim Martin (OCLC)

Sample PubMed XML

Follow up: Bianca estimates that 10% of serial titles in BHL are duplicates.
As of 11/18/15 there were 3,983 serial titles in BHL
Of those,
Using a pivot table to analyze 961 duplicate values, found 388 "unique" titles
388/3983 = ~10% of BHL serial titles need to be merged into other titles
Was this analysis logical?

Notes:
First discussion was about scope of our committee and where we are at. We have done significant work on all 4 items listed above but we will not be able to finish this year. Adam thought we would be able to finish in the first quarter of 2016. Everyone agreed they'd be able to stay on board.

Bianca spoke about her discussion with Gretchen from DPLA about the data we sent. There were problems with titles that had null values in the copyright field which Bianca will have to repair. Hopefully we can do some sort of batch update. Gretchen also asked for clarification on a couple of minor things. DPLA won't be ready for another harvest until January, which will give Bianca time to fix the copyright issue.

Bianca then explained how she went about her serials title analysis. Mike provided a spreadsheet of titles with bib level = serial or serial component. There were 3 parts: one match on title string, another on OCLC number and the third on ISSN. Bianca combined these 3 parts to come up with the 388 titles that require merging, These would not necessarily have to be fixed by a cataloger. The work would only include title merging--not resequencing the items or fixing metadata. Matt and Bianca gave estimates of 10-30 minutes per title. We agreed to use an estimate of 20 minutes, which comes out to 129 hours of work.

Article metadata
Adam spoke to 2 of the vendors. It appears that PubMed would be the best option. NLM created its original article DTD for its own needs and a few years ago, a NISO group was formed to make this an industry standard.
First we spoke about the feasibility of doing a feed. Rod Page is providing most of our metadata for articles and Suzanne asked if we needed to talk to him about our requirements. Mike L. noted that we are getting most of the metadata we need from him already. The only problems he sees are that we are not getting journal year and month and the author's names are in one field instead broken down by surname, middle name, given name.

Mike L. gave a summary of what Rod Page is doing to create this data. He has a website biostor.org. Rod takes bibliographies that are available and takes their article data and matches to BHL. He puts this article metadata up on his website and makes it available to BHL via API's. Mike L--BHL pulls in weekly updates and changes.

Suzanne asked if we had ever used his code against other bibliographies to get article metadata and Mike said no. She made a proposal that the Collections Committee look into that. Bianca stated that the CC could find bibliographies to submit to Rod and that this would a nice to have project but not necessarily something the CC could pursue. - lipscombb lipscombb Dec 2, 2015the CC can add sending bibliographies to Rod to our list for a project for later but we cannot do anything about using Rod's code itself.

Susan mentioned that Cornell & Harvard both have analytics for many journal titles in their ILS databases that could possibly be dumped into BHL. Bianca mentioned that this would require Mike L. time and Matt said that Joe DeVeer had told him that Harvard had already done a lot of work to create these segments in BHL. Suzanne was concerned that we need to think about a way to dedupe these before we start bringing in more article metadata. We would need to make sure we articulate what Rod Page send us for future contributors. Susan mentioned that 6% of the segments were generated from the Admin Dashboard and we would need to make sure that all the data we need is getting updated there. The agreement was that we should work on creating the metadata feed with the data we already have before we think about getting more article metadata from other sources.

Action items:
Adam will continue discussions with the vendors.
He will also speak to Marty about the analytics but that is of secondary importance.
He will plug our estimates for title merging into the report.

Next meeting we will look at BHL segment data and compare it to the JATS format.

November 18

Agenda:

  1. Finish estimation of work and other outstanding work related to KBART data problems report
  2. Start shifting our attention to item in our charge: Review article level /segment level content from BHL into discovery layers. OCLC and Proquest have expressed an interest in a feed of article level metadata.

Comparison of BHL vs. Bioone as they appear in Proquest Serials Solutions client center.

Notes:
First discussion was about Adam's idea to separate our KBART proposal into two phases. He showed that in Serials Solutions product only the dates are appearing for BioOne and BHL, not the volumes it is the dates that are critical to fix. The proposal would be to fix publication title and dates in the first phase and then the volume field fix would be in a second phase. Everyone agreed that this was a good idea.

Secondly we discussed concern that the KBART problems might have a low priority compared to everything else on the Tech wish list. William stated that the strategic plan for next year included the goal of sharing content with other aggregators and making it more visible so that might not be the case.

We looked at some examples of metadata with invalid metadata in the year fields and the question came up as to how we would be able to ensure that the data be valid going forward. Bianca--we recommend that people use the separate year fields in the partner meta app. Some libraries are using their own spreadsheets. While we could add requirements to the Macaw template we can't control the rest of the data that is going directly into IA. Diana--maybe we could have a report that would show the invalid data that we could bring to the attention of those contributors.

Adam asked about whether we had a place to store summaries that showed gaps. No. The question came up as to how common gaps are and if there was a report of this. Mike L.: It would be extremely difficult to figure this out programmatically

Adam asked if we had ever extracted the data into a spreadsheet and fixed data and then sent it back to Mike for updating the database. Mike L. We have not.

Adam then asked the question if it we needed to provide this much detail about how to fix the data. Everyone thought that the more detail we provide the better.

We discussed how to move forward. We could get interns, students and volunteers to fix the data. The question would be to figure out the best way to chunk up the data. Diana mentioned the problem was not just the gaps but the need to merge titleids for the same journal title. Mike L. did not think it was a good idea to merge records using a spreadsheet upload. We also have the problem of matching titleids for the same title that don't have the OCLC number.

In order to figure out the scope of the problem, Mike L is going to produce a simple report of title, titleid and contributor of the serial titles by next week. Bianca will review the data and report back in one month.

Adam moved on to the other part of our charge in determining how to provide article level metadata to aggregators. He has contacted reps from OCLC and ExLibris and will also contact someone at ProQuest (which has purchased Ex Libris). He has already heard back from OCLC. The problem is that there is no equivalent to KBART for article metadata. His rep was going to find out how other publishers are providing this data.

Lastly, Mike L. and Bianca are talking to DPLA to finalize that metadata process.

November 4

Agenda:

  1. Our plan for meeting at upcoming BHL staff meeting?
  2. Continue our work on BHL KBART Data Problems: Summary and Recommendations. Note, we do not need every KBART element listed in this document. What we need is a description of the problematic elements, suggestions for how to fix them, and estimates for how much work is required.
    https://docs.google.com/document/d/14xQz5mTdg4y9v_12tqk-J0bXOCii0y4HGAsUhh_JdrM/edit#heading=h.9p37z1xt69g8
    Along with a very brief narrative that puts it into the context of the task force scope and purpose

Notes:
At the Staff Meeting we will only have 15 minutes so we will just meet informally--perhaps have lunch together on Day 2.

Adam has been working on streamlining the WIKI and our group's report in progress. We have made some policy decisions on what data to include. Some clean up is necessary and we still have some policy decisions to make. We have proposed some workflow changes and have recorded some possible changes to how we format the data for BHL in order for us to capture the data necessary for KBART. These are documented in our report in progress (no. 2 above).

We also need to work on our cost estimates.

If Adam has time, he will start filling in the narrative portion of our report.
October 21

Agenda:

FYI DPLA informed of our new MODS and they are working on cross-walking and implementing new harvest!

BLvLs for KBART:

Here are counts of materials with no Bibliographic Level indication, by institution:
Decisions:
If there is no MARC record, we will exclude those titles
We will do one Gemini request for BHL institutions for fixing the BibLevel in the portal
We will do one Gemini request for ingested materials to see if we can make a blanket decision about what to do with these

California Academy of Sciences - 8 - To do: ask them to fix records
Fisher - University of Toronto (archive.org) - 1
Harvard University Botany Libraries - 39
Harvard University, Museum of Comparative Zoology, Ernst Mayr Library - 2
Institute of Botany, Chinese Academy of Sciences - 976, ingest exception for no MARC records, vote to omit from KBART
Internet Archive (archive.org) - 20
Natural History Museum Library, London - 1
Naval Postgraduate School (archive.org) - 1
Royal Botanic Garden Edinburgh - 25
Royal Botanic Gardens Kew, Library, Art & Archives - 19
Smithsonian Institution Archives - 589, all field notebooks with nonstandard MARC that lacked Leaders - can be corrected by blanket decision
UNEP-WCMC, Cambridge (archive.org) - 234, ingest exception for no MARC records, vote to omit from KBART
United States Geological Survey Libraries Program - 6
University of Southampton (archive.org) - 1


As far as the items coded as “collections”, here is the breakdown:
Decision:
Exclude if publication type is not monograph or serial

American Museum of Natural History Library - 1
California Academy of Sciences - 1
Cornell University Library - 41, seed/nursery catalogs
Harvard University Botany Libraries - 1
Harvard University, Museum of Comparative Zoology, Ernst Mayr Library - 3
Internet Archive (archive.org) - 1
Lincoln Financial Collection (archive.org) - 1
MBLWHOI Library - 1
Missouri Botanical Garden, Peter H. Raven Library - 578 = correspondence, seems OK yes?
Natural History Museum Library, London - 2
New York Botanical Garden, LuEsther T. Mertz Library - 11
Research Library, The Getty Research Institute (archive.org) - 1
Smithsonian Libraries - 7
The Bancroft Library (archive.org) - 1
U.S. Department of Agriculture, National Agricultural Library - 224, Suzanne and Bianca to check in with them
University Library, University of Illinois Urbana Champaign - 1
University of California Libraries (archive.org) - 45, unusual (example: http://www.biodiversitylibrary.org/bibliography/19282#/summary)
University of North Carolina at Chapel Hill (archive.org) - 1

Continue to work on our document, BHL KBART Data Problems: Summary and Recommendations
https://docs.google.com/document/d/14xQz5mTdg4y9v_12tqk-J0bXOCii0y4HGAsUhh_JdrM/edit#

Supporting documents:
Our wiki page: DiscoveryTools

KBART test1 file: http://beta.biodiversitylibrary.org/data/bhl-kbart.zip
KBART test1 online:
https://www.google.com/fusiontables/DataSource?docid=1a28t9MlwWFlEzYFHZK7umEXfKxn1YNBtWIPbF0W-

KBART RP:
http://www.niso.org/apps/group_public/download.php/12720/rp-9-2014_KBART.pdf

Action items:

Bianca to create Gemini requests for the Biblevel problems
Mike L. is going to rerun his query to include the source institution in the KBART test data

September 23

Agenda

Discuss draft: BHL KBART Data Problems: Summary and Recommendations



September 9

Agenda

Review KBART test file

KBART test1 file: http://beta.biodiversitylibrary.org/data/bhl-kbart.zip
KBART test1 online:
https://www.google.com/fusiontables/DataSource?docid=1a28t9MlwWFlEzYFHZK7umEXfKxn1YNBtWIPbF0W-

KBART test1 file review notes:
https://docs.google.com/document/d/1BWRYqBBcnBJMt6t4EN4Bsydc5SoKxAq7Vd1_KtRownQ/edit?usp=sharing


Action items:

Suzanne/Adam: Contact OCLC about their KBART requirements
Adam: Summarize data problems with BHL KBART test file for BHL decision makers

August 26
Review action items from August 12:

Bianca: Gretchen's answers to our questions about:

Mike: Generate csv version of DPLA file based on our mapping

August 12
Follow-up on action items :


July 29
Agenda:
Status of action items from July 15 call:

Continue working on DPLA mapping

Action items from call:

July 15
Agenda:
Continue working on DPLA mapping
Action items from call:


July 7
Bianca spoke with Gretchen of DPLA about our BHL to DPLA MODS mapping questions and Gretchen was super helpful with answers and providing examples
See our updated google doc to see answers
Gretchen showed Bianca the DPLA internal MODS mapping wiki
She mentioned that Empire State and NCDHC were good MODS mapping examples
Hathi and UIUC good MARC mapping examples
Since BHL derives our metadata from MARC into our own format that we are mapping to MODS we need to look at both MARC/MODS examples

July 1
Agenda:
  1. Continue BHL to DPLA mapping: https://docs.google.com/spreadsheets/d/1oIKbrKsuARqPMSayDp_KLPPeQ0o3kJsFTo4vabcS4cM/edit#gid=784960145
    1. worked on this with Mike L and identified some questions
    2. Bianca to follow up with Gretchen of DPLA
  2. If time, compare notes on BHL to KBART test1 results: https://docs.google.com/document/d/1BWRYqBBcnBJMt6t4EN4Bsydc5SoKxAq7Vd1_KtRownQ/edit - no time, did not cover

June 17
Agenda:
  1. BHL to DPLA mapping, see notes in wiki (40 minutes)
  2. Orientation to reviewing BHL to KBART sample file (10 minutes)
  3. Schedule next call (5 minutes)

DPLA


More Background Info
BHL OAI feed info http://biodivlib.wikispaces.com/Developer+Tools+and+API
I found this site useful for reviewing OAI http://validator.oaipmh.com/ enter "http://www.biodiversitylibrary.org/oai" into search field and hit "check now," choose ListRecords MODS option at left, then select XML Result tab
OLD BHL MODS metadata mapping documentation Brief Detailed and MODS displays in BHL 2011_02_11.docx - lipscombb lipscombb May 20, 2015 probably just need to start fresh based on KBART model exercise

KBART

KBART test1 file: http://beta.biodiversitylibrary.org/data/bhl-kbart.zip
KBART test1 online:
https://www.google.com/fusiontables/DataSource?docid=1a28t9MlwWFlEzYFHZK7umEXfKxn1YNBtWIPbF0W-

KBART test1 file review notes:
https://docs.google.com/document/d/1BWRYqBBcnBJMt6t4EN4Bsydc5SoKxAq7Vd1_KtRownQ/edit?usp=sharing

May 27
Adam and Bianca touched base with Mike Lichtenberg to bring him up to speed in light of William's absence. Still no word yet on William's return.
Together we reviewed the KBART to BHL mapping and tweaked a couple fields esp. related to VolumeInfo
Mike might be able to use the dbo.Item.Year field to get the date_first_issue_online and date_last_issue_online but as the Year field is not completely populated or may not be populated accurately, this may need some cleanup which Cornell may be able to do
Decided that BHL could not reasonably supply number_first_issue_online as most of our volume numbering is at volume, not issue, level so we removed this as a requirement
Some questions came up about understanding what exactly is needed for parent_publication_title_id -- Adam to follow up with a colleague who works with Summon about this field, perhaps has relationship to series statement in MARC 490?
MODS discussion started, BHL has an existing MODS server supplying a variety of sets:
  1. item = This set contains individual volumes hosted by BHL. The content is viewable in BHL.
  2. itemexternal = This set contains individual volumes not hosted by BHL. The content must be viewed on a site not maintained by BHL.
  3. title = This set contains the monographs and journals represented in BHL.
  4. part = This set contains articles/chapters/treatments/etc hosted by BHL. The content is viewable in BHL.
  5. partexternal = This set contains articles/chapters/treatments/etc not hosted by BHL. The content must be viewed on a site not maintained by BHL.
If a value is NULL then no MODS field passed
BHL MODS currently does supply <access condition> field where "Copyright Status" value is available, see
http://www.biodiversitylibrary.org/oai?verb=GetRecord&metadataPrefix=mods&identifier=oai:biodiversitylibrary.org:item/182973
Mike points out that <enumerationAndchronology> = VolumeInformation field
Bianca to start DPLA to BHL MODS mapping and check in with group for next call

April 30
In the meeting we finished our initial crosswalk. Bianca will see if she can answer the handful of outstanding questions. Then we will ask William how much time he needs to generate a test KBART file that we can collectively review.
Questions
  1. What is the relationship between Title.PublicationDetails field and Title.Datafield_260_b? A: PublicationDetails field concatenates Datafield_260_a through c, it is a holdover from Botanicus, Confirmed that we should use Title.Datafield_260_b for KBART publisher_name field
  2. How are genres being coded for field books in BHL? A: Smithsonian Archives field books do not have the Genre fields filled out; Connecting Content field books use the Genre = "Book" (based on the small sample I reviewed)
  3. What is a good field to use to populate the KBART "date_monograph_published_online" field? A: Use the dbo.Item.CreationDate field which is the date the record is created in the BHL database. For date scanned we'd have to pull from IA.
  4. What data goes into the TitleAssociation.Relationship field? A: It pulls Subfield g from one of these MARC fields: 770, 772, 773, 775, 777, and 787
- lipscombb lipscombb May 8, 2015 I have added this information to our KBART gdoc

Next meeting will be dependent upon William's schedule - how much time he needs to create the KBART test file.

April 2
Agenda:
Compare NISO RP-9-2014, KBART Phase II Recommended Practice (79 pages) to what BHL currently offers:
http://www.niso.org/apps/group_public/download.php/12720/rp-9-2014_KBART.pdf

BHL Metadata Overview
BHL Metadata Schema Doc http://www.biodiversitylibrary.org/data/BHLExportSchema.pdf
Title
MARC
BHL rec
Data Export samples
Nature
https://ia902604.us.archive.org/fetchmarc.php?path=%2F1%2Fitems%2Fnature6719021903lock%2Fnature6719021903lock_marc.xml
http://biodiversitylibrary.org/bibliography/40302

Annals and Magaize of Natural History
https://ia700406.us.archive.org/fetchmarc.php?path=%2F1%2Fitems%2Fannalsmagazineof211848lond%2Fannalsmagazineof211848lond_marc.xml
http://biodiversitylibrary.org/bibliography/15774

On the Origin of Species
https://ia601407.us.archive.org/fetchmarc.php?path=%2F30%2Fitems%2Fonoriginofspecie1878huxl%2Fonoriginofspecie1878huxl_marc.xml
http://biodiversitylibrary.org/bibliography/4183

Astragalus darwinianus (Leguminosae, Galegeae), a new species from Argentina

http://biodiversitylibrary.org/part/17539


Notes:
KBART Phase II Comparison to BHL: https://docs.google.com/spreadsheets/d/1oIKbrKsuARqPMSayDp_KLPPeQ0o3kJsFTo4vabcS4cM/edit?usp=sharing

Questions...


March 13
We started to divide up the liaison responsibilities, but then concluded that we all need to understand BHL data better before reaching out to vendors. When we do get ready to talk to vendors, we divided up responsibility as follows:

Bianca: DPLA
Adam: ProQuest
Diana: OCLC
Sarah: Ex Libris
Matt: EBSCO

Action item: Suzanne: Check on status of discontinuing OCLC OAI harvest
Next call Date/time: April 2, 2:00-3:00 Eastern

Agenda for next call:
- Bianca and William will lead the group through a review of the available BHL export file options (http://biodivlib.wikispaces.com/Data+Exports).

- As prep for the April 2 call, please read NISO RP-9-2014, KBART Phase II Recommended Practice (79 pages):
http://www.niso.org/apps/group_public/download.php/12720/rp-9-2014_KBART.pdf
amy@dp.laISSN question