BHLtech2012Notes
printer friendly
Back to
meeting agenda
BHL Technical Meeting Break-out Notes
Sep. 27, 2012
Attendees
Keri Thompson, Mike Lichtenberg, Trish Rose-Sandler, John Mignault, Chris Freeland, Joe deVeer, Joel Richard, Frances Webb, Jenna Nolt, William Ulate, Martin Kalfatovic, Connie Rinaldo
Introduction
William introduced the goals for the Tech Meeting and went through the planned topics in the Agenda:
https://docs.google.com/spreadsheet/ccc?key=0AgZbYGIZHhuYdGpOQ1V3WnlEcTZMYzhMRk12V1Vja3c
Martin explained what the plans are in relation to sustainability of BHL. If funds are available, he would expect the Technical Team at Missouri Botanical Garden to keep supporting and developing the software. In the meantime, a maintenance plan is being developed for after 2013, in case we don’t get funds to keep further developing at MBG.
Martin mentioned he has coordinated with IT staff at Smithsonian to look for a Fedora installation at SIL where we could have a copy of the content. He expects to have the Specs before the Christmas holidays to start moving content by Spring and, if storage for content is available on Jan 1st, have all 90 TB of BHL content in the Fedora installation by end of 2013. He also pointed out that as we approach other nodes, we are making clear that IA is our staging area and they should upload content there.
Replication in another site for the BHL site is also going to happen at the Smitshonian, where they are looking to have a copy of the BHL metadata, not as a mirror nor a failover, but a type of “cold failover”.
Connie asked if there is still going to be a copy in place in London? The answer was: yes, although they have not updated their content since they got the first shipment of content and finished copying from the Cluster. They are not uploading their content to IA either yet.
John Mignault mentioned the work of Ben Brumfield - transcription guy (
http://manuscripttranscription.blogspot.com/) as a potential way to get the content transcribed. For transcribing, Martin and Keri mentioned Trove is pretty much the solution they like the best.
Article-ization
Mike presented this topic. Up to now, BHL has only had monographs and serials, we now want to store articles, chapters, etc. (that is: any piece of an item). Internally, we call them segments. We are getting Rod Page’s clean and useful data on 70,000 articles to start with. The model is in place, we are already harvesting the data now. Article-ization is also part of the work that we are doing with Simon. Authors, keywords, names and other tables are now associated with articles as well as books. This was done during the summer. We have tables to store just the name segment genre, status (so we can take an article offline), authors, keywords. We allow segments to cluster together, so if we get the same article from different sources, you can assign one to be the main one to show in the UI. This approach has helped to conceptually store treatments. Chris asked if the clustering is automatic. Mike explained that there is not an algorithm in place, so it’s going to be semi-automatic. Mike said he’s looking for ideas and approaches to solve the problem. There are nice algorithms to compare documents, created by joining fields. What he has not tried is to do fuzzy matching. Frances explained that it may miss when the same author publishes the next segment with the same title. Connie also mentioned she has examples where an author publishes the same title over and over. Frances explained she might get use cases . Mike would appreciate any use cases that could send his way. Chris mentioned that in assigning DOIs we are finding Springer or others assigning DOIs. Crossref doesn’t provide the service. Mike thought that Crossref would indicate the duplication, but Chris explained it might not. Connie mentioned a similar problem with the Smithsonian and Harvard content.
Mike explained how the articles would be addressed in the new UI and how they will be shown in the search. Chris asked if we can get the PDF of the article. Frances suggested to be able to do either, jump to the first page of the article, or download the PDF.
Martin mentioned that it is popular to get only the 3 first pages. Mike mentioned that we are getting 700 PDFs a week. Frances said that having a PDF of the article or chapter would help and the use of the PDF-generation function could decrease. On the question if it is needed to keep a PDF if we have the page range of the article, Chris explained that there might be a need for the duplicated title. There used to be a 100-pages limit. There are a lot of pieces of articles getting downloaded. iText is been used to create the PDFs. Frances, explained that in their case, using a tool allowed them to make the PDFs smaller. A Windows server is running ABBY and Acrobat and both programs share the storage. The ABBY has the option to monitor a folder. Also she has set batches that run 7,000 PDFs on the background and they just come back to get the results later. Frances indicated that It pays to have smaller PDFs and their whole PDF generation and delivery is automatic. Mike mentioned that the current turnaround is 3 minutes from when they requested the PDF to the time it gets sent, and that’s including that generation does the polling every 2 minutes and takes the content from IA.
Joel explained that ImageMagick does work with PDFs.
TiffCP would copy multiple files fast.
There’s a consideration when submitting article metadata. Frances would recommend to submit serials with article metadata. Mike explained that there has been some exchange about updating article metadata to Internet Archive. Frances said that when someone finds there is something wrong with the metadata, they could correct it but there should be a way to update.
Citation
Mike mentioned that citations are also going to be stored in BHL. Some citations in Citebank may not have any content, others could point to other repositories and others could point to content we have. The issue here is that Global Names wants to receive a blob of text and interpret that. Mike would appreciate any input on that. Frances said that we can’t process that if we can’t guarantee the same citation mechanism. We might become a store for the text blobs. But Chris explained that we need to break up the text blob and convert it into Article. This is the problem of deduplication.
The other topic is the possibility to upload the content and the safe harbor consideration. Should we upload it to IA? Martin explained that for safe harbor we need an institution to get involved in having a responsible person. We can upload anything that we can legally upload. EFF used to be interested but they do not want to get involved internationally.
Action Item: From a technical point of view we would have things uploaded into a vetting area.
gBHL Topics
Mirroring and synchronization
How do we handle? With or w/o cluster?
Content from China - some in copyright. OCR not attempted for text with CJK characters.
Content from Brazil - Portuguese characters present OCR problem - bad quality OCR. Add English as a way to recognize species names.
Anything that IA doesn’t recognize isn’t getting OCR’d.
latest decision is to use IA as the source for mirroring. Need to find place to store metadata. What happens if we both update content at the same time?
We can put anything we want on IA - files are basically ignored by IA. Problem: IA uses a flat file structure so cannot nest folders. Would have to create a flat file structure containing a file that maps the location of metadata.
Ideal to expose data from BHL portal (API?) for synch.
How do we synchronize files and metadata across nodes? For example, when one node merges titles, how is this shared with other nodes? How share pagination metadata? Tabled for a later discussion.
Question: How much content will we get from BHL Europe? Will we get it from NHM? Also, Egypt, Brazil, Africa, etc. How do edits done at one node get propagated to the others?
Small changes like page inserts - IA not specific about where in the file the change has occurred. Best to replace entire file (images and metadata) rather than trying to find the change and replace the specific files. Algorithm to interrogate files to detect changes might be possible.
No automated solution to detect page metadata changes when there are page inserts. Page metadata gets thrown off. IA does not use checksums. IA uses the original item-level identifier for re-scanned items. A change in the file can only be identified by the date of the file.
How do we work within the limitations of IA? Primary problem is that our content is on a system that we have no control over.
Martin: might be able to harvest back from Europeana for BHL Europe content. Content resides at NHM? Or perhaps they plan to harvest from BHL-US?
Action Item: Martin and William to talk to Egypt.
How do we secure storage?
What size is BHL? 90 TB. Need startup number only.
Current cluster cannot be used - out of date.
Action Item: Martin and William talk to Nathan and Bob Corrigan from EOL for syncing the cluster. Have some money for syncing cluster, not a lot.
Global Names
Name Finding Algorithm
MBL staff have improved name finding algorithm. Improve ratio of names found. More IDs besides EOL ID - can link to external resources. Structure of synonyms and related names can be included in results. Orig. plan to be done by October. Algorithm finding too many false positives. Goes through text and compares names to namebank or taxon finder - links to results. Will resume work in January. Want to find a better solution. Gaming a possibility. Still possible to get better content.
Where do found names get written to? Mike L.: possibly to same name table we’re using now.
Mapping Articles to Items (also Disambiguation?):
In the case in which we don’t have an item, we will selectively create an item for the article info. Articles can also stand alone, but something like chapters can’t and would need to be grouped together. If we later get the item we need to associate it to the stub item. Still somewhat an open issue as to whether we create the stub item.
If there is no stub parent, it may not matter for articles, but it does matter for chapters.If we have the item, should we keep the article as it came in (e.g. as a PDF)? The metadata f an article can be different from person to person or system to system. Should we go with(prefer) the first source of metadata? So the list of segments multiple segments (articles) will be listed for the item, but only one would make it to the table of contents.Right now in Citebank, people can indicate which pages make up the article, but … it’s not known to be correct or incorrect. (?)
Did Bianca and Trish do an analysis of the PDFs generated and how many were accurate or not? They looked at things that had a title. The results were pretty good.
If the “crowd” is creating PDFs for articles, could the crowd be the ones who vet those articles as useful or not? It is not possible for us to manually vet all of the “articles” that come through as PDFs. Cornell has a variety of people doing article-ization with pretty good results from relative non-experts.
Could we have a way of allowing person 2 edit an item that Person 1 created making a just in time correction to the metadata that was incorrectly entered by Person 1? This is sort of a wiki model. Wikipedia, by comparison, is a system that needs a certain amount of education to make edits to the content. Similarly our users would need to be educated to make edits to the “wikified” metadata.
NEH Art of Life Project:
Trish explained about the NEH Art of Life Project that goes from May 2012 to May 2013 to discover images in the BHL digitized items. Right now identifying images is being done manually and we want to increase that by using an automated approach.
Trish shared the workflow diagram of the Art of Life project (
https://docs.google.com/drawings/d/109a5pnE2FZgG2M-IpwxVGNJQMe10FDTJM-jPOmqBSng/edit) and explained that 4 stages are considered. In the diagram, each stage section explains who the users are, what is the functionality, the platform and the developers of the tools used.
For the Classification stage, two tools have been mentioned: the paginator or Macaw. BHL Staff will be doing this Classification stage.
The description stage is planning to “crowd source” the description. Flickr and Wikimedia Commons are being considered.
The last stage is sharing the images into platforms like EOL, ARTStor and iTunesU.
The schema developed fits into the Extract phase. Flickr has some limitations, Wikimedia Commons allows for more information.
Trish explained the Schema with further detail, indicating what are the mandatory elements and used the VRA schema as a basis, plus the Darwin Core.
Keri asked if it was mapped into IPTC. Trish answered it would be discussed next week in our face to face meeting.
Frances asked if there’s a platform . Trish said that it’s true some of the information is in the book or the surrounding pages.
Frances asked if we wanted to slice and dice the several images in plates. Trish answered that is exactly one of the questions that she wanted to present to the group meeting next week.
Frances offered that some of the books in Cornell are already paginated indicating what is an illustration.
Joel asked if the algorithm was used? Mike explained it was.
Martin suggested using the schema first in Wikimedia Commons and pushing it to Flickr. Trish explained that Rob Guralnick’s PhD student looked into the editing in Wikimedia Commons.
Chris asked about the source of the elements in the schema. Trish explained that all elements in the schema are borrowed from an existing schema. Chris recommended to mention the namespace being used in each element.
Trish asked if we should include only plates or not.
Frances related that they used the amount of OCR text in the page to find little thumbnails.
Martin explained there are two competing interests, one to satisfy the scientist and other to satisfy new audiences that are not interested in the diagnostic characters.
Darwin’s finches is an example of an image that has text and images and would be lost. Currently, this would not be included into FIickr.
Martin suggested we might need to fork into the scientific interest.
Trish explained that one consideration is that the ABBY might have issues with images that don’t have well defined borders.
Another question is if we should push images into Flickr automatically (getting certain false positives) or do we want to classify it? The feeling is that it should.
This project was initially planned as a two phase project.
Trish said that another question is if we want to include this metadata into BHL? The answer in the room was yes.
Gilbert indicated that anything that has been manually paginated shouldn’t be overwritten.
We should take the manual pagination on top of the automatic pagination, but Mike indicated he has no auditing at that level. Probably by using a user id that indicated the automated identification could help solve this problem.
Gilbert indicated that the process of constructing the collections should be done manually and not all at once. The decision was to have a separate Flickr account for the massive unrevised content.
If there are time limits we may need to throttle the process of uploading content.
Martin said he expects that Flickr might have a new rebirth with the new CEO at Yahoo!.
Martin would like to see some tags, like the contributing library, pushed as machine tags so that it facilitates the manual process.
Macaw:
Requirements and dependencies:
Joel
- Linux (it should work in theory in Windows, you need Image Magick libraries to do JPEG2000 otherwise it won’t be able to upload to IA. Yaz extensions for Z39.50 avoid the need type the MARC record manually in Macaw, which process it to MODS. Jenna said she had problems because of lack of documentation . Frances also pulls the MARC and uses as authoritative data.
Reqs:
Apache, PostgreSQL, but can be modified to MySQL. YUI is used for the review page. There’s an issue with lossless JPEG2000. If you are running on Windows, you might need to run Cygwin, a UNIX shell and UNIX libraries that have been compiled for Windows.
Macaw was originally developed in Mac, but it is better to run on a Virtual Machine. There’s a virtual machine almost ready that can be run on VMware or VirtualBox. The base OS is Ubuntu.
Action Item: Joel will review the installation, there is minimal configuration, it has not been tested.
Virtualization:
Martin and others want to set up Macaw so that others can use it for those who don’t have the resources to install it. Museum Victoria (Simon and Joe) have provided a way to upload through the web and changed the workflow to be more flexible. They are testing the model of one managing and several users accessing it and uploading. The question is the space, we might need posting it in a paid infrastructure might get costly soon.
As soon as Museum Victoria is done, it will be uploaded to Github. At Smithsonian they are using a 2 TB disk and they are doing less than 200 a year. They are moving 6-8TB a year. We might need to purge oftenly. We are uploading the compressed JP2 , if you give TIFF it gets converted to JP2. Time is not the critical issue, it’s space.
Frances asked if the interface would get laggy. JPG and a thumbnail are created first for browser friendly exposure and it’s fast either local or remotely.
It usually takes 30-40 minutes to enter the metadata, including the pagination.
Frances mentioned that they have a cascading button to keep values in the following step. Their simple volumes are 4 an hour.
Jenna indicated she checks everything before uploading. Keri mentioned they use pagination as another QA.
Joel mentioned there’s a demo at: macaw.joelrichard.com with demo/demo.
We took a view at it and Joel ran us through the demo.
It saves every minute. Museum Victoria is doing some bug fixing with the saving.
The new interface has been improved by Museum Victoria: macaw-mv.joelrichard.com
Frances indicated that, in order to review, they might need a popup window to open a bigger screen.
Australia has done article capturing, others have done captioning. Australia added institutions to associate users and therefore determine the default collection and other data. They also added a Flash uploader and a PDF handling.
Jenna asked if deleting the files was difficult. Joel indicated it was delicate, but easy. There has been a request to delete from the Admin interface and it might be coming sooner rather than later.
Because IA is flaky, Joel has not implemented a deletion function but does it in a periodic bases.
Jenna mentioned the ratio of TIFF to JPEG is ⅛. Joel added uploading the JP2 at a rate of 3-5 minutes the most. Using JP2 reduces that.
Item Action: John Mignault suggested that we could look for grants, given that it makes sense to have a single Macaw. We should put a Committee to put up a copy of Macaw at Amazon or some other provider. John Mignault, Joel Richard, William Ulate.
Code extension / fork:
Joel mentioned that the metadata part can be customized, a tab appears underneath and it can be added. We can create
Joel mentioned that the code is in PHP, the selecting part is a lot of Javascript, the JSON structure is proprietary. The code was forked by MV, it will be merged in one.
Jenna indicated that they have customized their version of Macaw. She believes it has been done using version control. In that case, Joel mentioned it might be easy to merge.
Joel mentioned that there might be changes made on their own export.
Jenna pointed out that they made changes to the . Author and title was being asked, even when it was included in the MARC. The source file is under Macaw, but the changes done are not under version control.
Frances mentioned that they usually don’t trust extracting the MARC data for author and title. They are building a new interface for their OPAC and therefore have a need to process millions of MARC records. This has made evident that some fields (for example the 245P subfield, the partial title) correctly catalogued could make the catalogue searching much more “cool”.
Action Item: Joel would like to look at trying to find somewhere to put it up to and setup a list to share about Macaw.
New Functionality Requested
Full Text Searching: (William)
Mike explained this is a long standing request and it depends more on infrastructure rather than technical. Solr seems to be the way to go, and Mike said he could maintain it but would like to discuss the possibility of setting it up in other places. In the case of the index, the initial load is the part that would be more time consuming but you can use the same service. Mike indicated that 200 items a week is the load. The Solr installation doesn’t need to be in the same machine. The BHL Australia has done a proof of concept with Solr.
Chris said seed cataloguing is one of the priorities decided on the Staff Call.
Mike asked how quickly fast and dirty could be done? Frances indicated the installation is not hard, all you need to do is define a schema, decide what you want to index and it works with web services and it’s easy to submit documents and get back XML with the search results. It weighs your search terms (by default according how the field compares with the whole collection). You can point out a field or even a source and give it precedence.
Frances explained her experience with Solr. She assigned a record for each page and clicking on the initial search produces more detailed results on a second search. She also helps the browsing, but each search looks for certain types of records. One thing that is very useful are the faceted capabilities that Solr has to give you a summary.
Mike asked if anybody would know about the difference between Windows and Linux? Nobody has done it but it looks like it is been done.
This makes it obvious the need for a Cluster.
Frances mentioned they might be able to get a server if Cornell decided this would be a project they wanted to support. They are running all virtualized in the Enterprise version of Red Hat Linux, because of the preferences (except for the ones that need to be Windows). Cornell is not a paying member but it may consider the issue as an in-kind contribution.
Trish recommended to consider the technical and the political issues. Particularly if the member retires. Frances recommended to have it in an external service provider. John indicated it may be better to have it outside of the member institutions for a better control.
Frances estimated that (with a certain error margin), that the size of OCR is 100GB for 40 million pages. (avg size for a page 2300 bytes)
Full text search (Trish)
Solr index – can assign weights to different fields (title vs. subject) can also say content from this contributor comes up higher. Solr gives you summary.
Q: where to build it? Linux vs. Windows differences for implementing? Better to do on Windows for MOBOT but they are more comfortable with supporting it. Not aware of any challenges with either OS. Maybe look at Australia’s option.R
Cluster has not been updated for sometime and we don’t have staff behind it at MBL.
Frances said Cornell has more solid infrastructure and might be a candidate. Running all virtual machines currently.
Current search is on metadata only and part of SQL server. Don’t think the relational db could support full text searching. We’re on 2005 of SQL.
OCR improvement – generated by IA. Can’t modify 1/3 of collection. Done for wholeitem and separated by page. Brought into MOBOT and separated. Some projects have tried to use our OCR but then realize its not so clean.
Not so much how to improve the OCR but if other folks have better OCR to offer us how can we incorporate it into the workflow? IA would override if needed to. We have copies of OCR at MOBOT as text files. Not worth sending better files back to IA so they could override at anytime. Keep them out of the equation. How do we share it publically? Do it out of BHL instead. If we create a new file they may not be able to override. We’ve discussed a lot. OCR corrections could be through annotaitons, crowdsourced,
Workflow issue – how does corrected OCR become a part of BHL content.
Get dependable crowdsourced data – allows for delivery of books w/o page images. If we follow Guttenbergs model need a fanbase. Flickr attracts new audiences. What are statistics of users currently?
Technically easy. But takes a lot of manpower to manage the users. Go to Guttenberg because its proof-read (known quality) – this is important for attracting audiences.
Once corrected how do you incorporated back into the eReader formats the corrections? These are hosted at IA.
If we did a OCR proofreading project it would be nice if we have the illustrations included. What about the PDF? That’s nice to include to.
Need someone to coordinate efforts to alert folks when a popular book is almost done.
Since all derivate files come from OCR in IA.Need to verify if IA overrides the OCR file (we think they do but need definitive proof). If you make a change to OCR it triggers a rederive in IA
Do we then store and serve our own files? IA is Curse because of this, blessing because they do a lot of stuff we don’t have infrastructure for.
Acrobat vs. Luratek? To generate pdfs and OCR. How fast can Acrobat produce files? Very cost effective to use IA to generate the files.
Cluster would provide a tech base to store what we need to supplement what’s at IA. Framework for doing our own thing.
IA has allowed us to scale up quickly since we don’t have the person and tech support to do ourserlves. Lets try to utilize the IA services to make this work. We could store the corrected OCR locally but
Frances looked at her OCR file sizes for 2.7 million pages and based on that we guess we would need a total file size of 86 GB (only English language) foreign lang requires more bytes to encode. So est 100 GB for 40 million pages
Parked Ideas
Action Item: Full text searching should be first priority. Its an expected baseline functionality for any text projects.
Action Item:Frances not sure what resource commitments Cornell can make to BHL. Marty, Frances, and supervisor need to talk and discuss how they could help.
How to track IA updates and changes? Is Much easier if we have our own database to store and serve all the files. Comes into play for replication. Going to have to be date based.
Replication at London started but never finished online. Finally copied to hard files and mail over. Similar issues with Egypt – got 102,000 items so far.
Need to solve these issues in terms of solving some other task – e.g. find a way to handle inserting of pages
Keri feels that maybe not depend on IA (only be a staging and not a repository). Problem is globally IA needs to be the primary source for copying files.
Action Item: We are Planning a backup at SI for BHL material by 2014 - will happen in 2 stages 1) Copy the md, 2) copy the files
New Ideas/Next steps
Action Item: Full text search yes, OCR maybe not yet.
At EOL semantic reasoning workshop they expressed desire to use BHL content to do projects. OCR not good enough but names yes. They need the literature content to carry out the project.
John wonders if we took corrected OCR from someone would if we only update the portal and not update IA. Wondered why to download all the OCR from? At IA has to be done book by book – they want all the data at once. We have a significant dataset now that could be good for NLP.
If we have better OCR we could correct it in IA or not.
Can only get the OCR by page and not all 40 million pages via API
Cluster purpose is getting very muddy – back up server, expresso maker, etc. and we need to clarify what we need it to do. Need an infrastructure database
Frances has Tech issue with portal. Serials Solutions is grabbing our md and making bad MARC records.Grabbing our Custom files. Should we grabbing the MODs - Does MODS represent the holdings info?
Parking Lot
Action Item: How to track IA Updates and noticing changes in IA. (For articles and more?)
Particularly important as we start having other nodes updating the content. The solution has to consider about the possibility to change things that are not in the biodiversity collection.
Action Item: How do we know what has changed in IA in terms of notifying other repositories (for replication purposes)?