TechCall_25apr2016
Agenda
- Review 5-6 requirements for basic full text search, submitted by all
- Discuss updates to User Feedback form on the website (i.e., adding additional categories)
- Contributor field - change request from EABL on how used/assigned
For Full Text Discussion
From Mike:
General Wish-List
- Full-text search over both metadata and text (OCR/transcriptions)
- Search across all text
- Search within text of a single book - IA has this
- Search using words or phrases – (Boolean? Phase 2 or 3)
- Results should include variations (similar spellings, etc.)
- Search should disregard line breaks
- Search should disregard page breaks (how to handle if broken across pages? OCR is already messy. We may not be able to implement)
- Proximity searches (find words that are close to one another in the text)
- Limit searches by genre (Book vs Serial vs Collection)
- Provide for faceting of search results
- Type-ahead (find-as-you-type) search
- Highlight search terms in results
- Include a "backup" search technology? Currently, when the SQL Server full-text indexes are rebuilt (once a day), they go offline for a brief period of time (a minute or two). During that time, the site search falls back to direct database queries. Depending on the effect of re-indexing SOLR/ElasticSearch, we may need to consider a similar "backup" search strategy.
Fields to Index for Search
These are based mostly on what is indexed currently.
- Title.FullTitle
- Title.UniformTitle
- Title.PublisherName
- Title.PublicationPlace
- Title.StartYear
- Title.EndYear
- Title.EditionStatement
- TitleAssociation.Title
- TitleVariant.Title
- Identifier.IdentifierValue
- DOI.DOIName
- Item.Volume
- Item.Year
- Language.LanguageName (related to an Item or Segment)
- Institution.InstitutionName (related to an Item or Segment)
- Author.FullName
- Author.FullerForm
- Segment.Title
- Segment.TranslatedTitle
- Segment.ContainerTitle
- Segment.PublicationDetails
- Segment.Volume
- Segment.Date
- Segment.Series
- Segment.Issue
- Keyword.Keyword
- PageType.PageTypeName
- Name.ResolvedNameString
- Page Text (OCR/Transcription)
- HasSegments (flag)
- HasLocalContent (flag)
- HasExternalContent (flag)
Fields to be Faceted
- Item.Year/Title.StartYear
- Keyword
- Author
- PageTypeName (Map, Drawing, Foldout, etc)
- LanguageName
- InstitutionName
From Carolyn :
Could it be possible to use faceted search fields in conjunction with
- Scientific name?
- Collection?
In other words, Zea mays only in items published between 1890 and 1910?
How do we want results to be ranked? By term frequency (in both metadata and item’s OCR)?
Basic Top Five Functional Requirements
• Search over both metadata and text
• Search within text of single book
• Search using words or phrases – would Boolean logic be used?
• Stop words – do the number of stop words affect the level of complexity for implementing? Is it significantly more difficult to activate stop words from multiple languages? Or should we just start with English? This might be useful but I’m sure there are other lists:
http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html
• Results ranking based on frequency of search term within text
Future requirements
• Stemming (I see this as being eventually necessary; if somewhat simple, should be moved up to basic)
• Results ranking based on proximity of terms within text
• Type-ahead functionality (Nice to have; if very simple could be moved up to basic)
• And anything else on Mike’s list
From Susan L
n general, I like the search capabilities of both Internet Archive and HathiTrust. Search in HathiTrust is described in detail at
https://www.hathitrust.org/help_digital_library#SearchTips.
I think that Mike’s list is excellent. I especially want to see:
· Separate searching of metadata and full text as is done in HathiTrust. Full text should include both OCR text and transcription text.
· Ability to do full text search within a single book or BHL item.
· Ability to do full text search across the BHL corpus.
· HathiTrust also supports full text search across a user-defined collection. I like this but don’t know how much it would cost to implement.
· Removal of stop words in the absence of quotation marks. This includes stop words in other languages e.g. le, la, die…
· Stemming. I don’t have a strong opinion about whether this should be done automatically or controlled by the user
· Search term highlighted in the text. (IA does highlighting on the page image. HT does highlighting in the OCR text.)
· In a metadata search, I want to support searching for a range of years e.g. in Item.Year
· I don’t find type-ahead very useful
· For a metadata-type search, search on copyright status is important. Many people want to restrict the results to public domain, CC0… This would be useful as a facet too.
Joel's Notes Regarding Full Text Search
Phase 1
- Full-text search over both metadata and full text (OCR or transcription)
- Search should disregard line breaks (this happens by default)
- Proximity searches (find words that are close to one another in the text)
- Include a "backup" search technology
- Results ranked by relevancy
Phase 2
- Provide for faceting of search results
- Search using words or phrases – Boolean? (Unless supported by the search engine for Phase 1)
- Limit searches by genre (Book vs Serial vs Collection)
- Highlight search terms in results (Unless supported by the search engine for Phase 1)
Facets for Phase 2
- Item.Year/Title.StartYear/Date Range
- Keyword
- Author
- PageTypeName (Map, Drawing, Foldout, etc)
- LanguageName
- InstitutionName
- Genre (Book / Serial / Collection)
- Scientific Name (maybe)
- Collection
- Copyright Status
Phase 3 or 4
- Results should include variations (similar spellings, etc.)
- Search within text of a single book
- Search should disregard page breaks
- Type-ahead (find-as-you-type) search
Minutes
Attending: Carolyn, Martin, Joel, Susan, Trish, Mike
Action Items
Mike will move forward with the addition of the table / updates to the database
Joel will look into ElasticSearch some more
Notes
Trish, will send travel cost estimate to Carolyn and Martin
FULL TEXT REQUIREMENTS DISCUSSION
Reviewed and discussed prioritization of requirements as submitted by Mike, Carolyn, and Susan
From Mike's requirements, the General Wishlist includes some things we might want to prioritize including from user requests.
Facets -- these include what we have and what we want to search on. Maybe our first step will be to just get full text search implemented in a basic way and then add facets later.
How do we provide this information to the index, such that the facet data is available when indexing happens?
Do we give it a structured file?
User specification for full text over metadata and/or text?
Big question is whether it automatically searches over both or whether user can specify if they want to search over just text or metadata
Susan – providing the option to specify enables higher precision
If we can give metadata and text, and search it both at the same time, we would weigh the metadata higher than full text and rank those higher in the search results
Realizing we won’t get this done perfectly the first, Is there a development pass we could do to make mistakes early?
Make iterative changes to indexing and see what results look like
We can also look at what others are doing
Infrastructure things – when looking at our options, do any of those lock us into anything?
Should we go with most flexible infrastructure build?
What is the baseline of what we need?
Do we need a new server?
Storage will be an issue based on the amount of data we have
1.5 TB – is Joel’s ballpark of size of OCR text
Our index size will be on that order
Memory usage will be a bigger issue than storage space
Unfortunately there is not a great way to ballpark ahead of time
Any white papers we can find, reports at conferences would be useful for mapping out how we want to do this.
Would it be useful to get in touch with DPLA people?
Joel looked at their stuff GitHub and didn’t see much
They are using Elasticsearch
Not yet sure how to set it up to index, what do we give it to index?
Google has done a good job of full text indexing of IA. Site: archive.org and put in a term
Boolean – may not be available, not sure. How many would use that? What people are used to, putting quotes around it. So probably not a top priority for Phase 1.
Synonyms, query extension, related words, would be nice to have.
Facets
First phase might be full text with no facets,
Second phase add in facets
Type Ahead – Phase 4, nice to have but significant impact on server
Highlight the terms – In SOLR you would need to store text in addition to indexing, Effectively storage doubled (at least).
Could use coordinates, requires custom building, more complex
Balancing complexity of programming and storage needs
IA does the highlighting on page image, Hathi does it on the OCR
For BHL would most likely be on OCR
If we want to highlight the image on the page, very tricky. We don’t have the coordinates of image in OCR text. Would be tricky once we get to searching within book. We’ll come back to that
IIIF has support for coordinates
Backup search technology
If index server is down, do we want a fallback?
Yes, we would have a backup search technology
Stopwords – we’ll probably tweak these over time
If there’s a particular word we don’t want to index we would add it to that list
Shouldn’t effect complexity
Would need to re-index to add new ones, so get close as possibility at beginning
Would be good to pay attention to up front
Eliminate articles in 10 most common languages in corpus
Get a list of which non-stopwords appear most commonly
Now basic search would be results on a page
Indexing at Item or Page level?
Affects how we handle results ranking
If we’re searching on European starling, if there’s a book about it and it’s not in the title, and we get 40 pages that mention that, how should we handle that?
But if we’re indexing at page level,
Would it make sense to index also at an item level?
What are we indexing?
Maybe it makes sense to do a survey of users,
Do you like this or this one?
Which one is most helpful?
Stemming -- Plural one is pretty important
Ranking – indexing engine
Collection as a facet? Existing BHL collections could be something that we could facet on
User defined collections gets tricky with user accounts
Metadata – building a query
Faceting on
copyright status might be messy on that because it’s not a controlled field
Fifteen terms for rights statements– DPLA, Europeana catagoeries may be good fit for us. But it’s a separate effort.
Mike and Joel will pull something together
Lean more towards index per page
USER FEEDBACK FORM
Members approved adding new categories to the Feedback Form to help streamline incoming issues
Details will be forthcoming
CONTRIBUTOR
This is from the EABL team, specifically a desire to capture a scanning institution as well as the rights holder
Mike Proposed adding a table
Item currently associated with contributor
Add a table, item associated with this institution that role = contributor and this other institution that role = rights holder
Will need to look at reports, pulled by contributor
We’ll have to look at how this impacts ingest.
Probably is right way forward, but is a significant amount of work.
Would like to get everything added at once in database including the fields proposed by DTWG
EABL will need to edit what we’ve already ingested so the sooner we can implement, the better as the body of material will continue to grow.
Mike – timeline for implementation would be
probably a couple of weeks
Martin approved moving forward with the addition of the table / updates to the database
Joel will look into ElasticSearch some more
Mike – news about the Google book decision settled by Supreme Court. Does that change what we want to ingest?
There’s like 8,000 items in IA that we’ve been avoiding
We had a number of reasons, other than just legal
Those were uploaded by Aaron Schwartz
Part of it was quality issues and duplication
Martin will think about it a bit more
Would be one line of code if we want to change
As far as BHL move goes, another round with legal dept, they requested more info and they’re reviewing
Trish – sent Joel and Bianca the metadata document, a couple of iterations with the group, Need to involve you with some conversations about workflow
Someone who has metadata for segments and those who don’t have any and would be creating it from scratch in Macaw. Mapping container and segment metadata or creating from scratch. Thinking about UI.
Joel to join May 5 – EABL call, 3pm EDT?