TechCall_25apr2016

Agenda

Review 5-6 requirements for basic full text search, submitted by all
Discuss updates to User Feedback form on the website (i.e., adding additional categories)
Contributor field - change request from EABL on how used/assigned

For Full Text Discussion

From Mike:
General Wish-List

Full-text search over both metadata and text (OCR/transcriptions)
Search across all text
Search within text of a single book - IA has this
Search using words or phrases – (Boolean? Phase 2 or 3)
Results should include variations (similar spellings, etc.)
Search should disregard line breaks
Search should disregard page breaks (how to handle if broken across pages? OCR is already messy. We may not be able to implement)
Proximity searches (find words that are close to one another in the text)
Limit searches by genre (Book vs Serial vs Collection)
Provide for faceting of search results
Type-ahead (find-as-you-type) search
Highlight search terms in results
Include a "backup" search technology? Currently, when the SQL Server full-text indexes are rebuilt (once a day), they go offline for a brief period of time (a minute or two). During that time, the site search falls back to direct database queries. Depending on the effect of re-indexing SOLR/ElasticSearch, we may need to consider a similar "backup" search strategy.

Fields to Index for Search

These are based mostly on what is indexed currently.

Title.FullTitle
Title.UniformTitle
Title.PublisherName
Title.PublicationPlace
Title.StartYear
Title.EndYear
Title.EditionStatement
TitleAssociation.Title
TitleVariant.Title
Identifier.IdentifierValue
DOI.DOIName
Item.Volume
Item.Year
Language.LanguageName (related to an Item or Segment)
Institution.InstitutionName (related to an Item or Segment)
Author.FullName
Author.FullerForm
Segment.Title
Segment.TranslatedTitle
Segment.ContainerTitle
Segment.PublicationDetails
Segment.Volume
Segment.Date
Segment.Series
Segment.Issue
Keyword.Keyword
PageType.PageTypeName
Name.ResolvedNameString
Page Text (OCR/Transcription)
HasSegments (flag)
HasLocalContent (flag)
HasExternalContent (flag)

Fields to be Faceted

Item.Year/Title.StartYear
Keyword
Author
PageTypeName (Map, Drawing, Foldout, etc)
LanguageName
InstitutionName

From Carolyn :
Could it be possible to use faceted search fields in conjunction with

Scientific name?
Collection?

In other words, Zea mays only in items published between 1890 and 1910?

How do we want results to be ranked? By term frequency (in both metadata and item’s OCR)?

Basic Top Five Functional Requirements
• Search over both metadata and text
• Search within text of single book
• Search using words or phrases – would Boolean logic be used?
• Stop words – do the number of stop words affect the level of complexity for implementing? Is it significantly more difficult to activate stop words from multiple languages? Or should we just start with English? This might be useful but I’m sure there are other lists: http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html
• Results ranking based on frequency of search term within text

Future requirements
• Stemming (I see this as being eventually necessary; if somewhat simple, should be moved up to basic)
• Results ranking based on proximity of terms within text
• Type-ahead functionality (Nice to have; if very simple could be moved up to basic)
• And anything else on Mike’s list

From Susan L
n general, I like the search capabilities of both Internet Archive and HathiTrust. Search in HathiTrust is described in detail at https://www.hathitrust.org/help_digital_library#SearchTips.

I think that Mike’s list is excellent. I especially want to see:

· Separate searching of metadata and full text as is done in HathiTrust. Full text should include both OCR text and transcription text.
· Ability to do full text search within a single book or BHL item.
· Ability to do full text search across the BHL corpus.
· HathiTrust also supports full text search across a user-defined collection. I like this but don’t know how much it would cost to implement.
· Removal of stop words in the absence of quotation marks. This includes stop words in other languages e.g. le, la, die…
· Stemming. I don’t have a strong opinion about whether this should be done automatically or controlled by the user
· Search term highlighted in the text. (IA does highlighting on the page image. HT does highlighting in the OCR text.)
· In a metadata search, I want to support searching for a range of years e.g. in Item.Year
· I don’t find type-ahead very useful
· For a metadata-type search, search on copyright status is important. Many people want to restrict the results to public domain, CC0… This would be useful as a facet too.

Joel's Notes Regarding Full Text Search

Phase 1

Full-text search over both metadata and full text (OCR or transcription)
Search should disregard line breaks (this happens by default)
Proximity searches (find words that are close to one another in the text)
Include a "backup" search technology
Results ranked by relevancy

Phase 2

Provide for faceting of search results
Search using words or phrases – Boolean? (Unless supported by the search engine for Phase 1)
Limit searches by genre (Book vs Serial vs Collection)
Highlight search terms in results (Unless supported by the search engine for Phase 1)

Facets for Phase 2

Item.Year/Title.StartYear/Date Range
Keyword
Author
PageTypeName (Map, Drawing, Foldout, etc)
LanguageName
InstitutionName
Genre (Book / Serial / Collection)
Scientific Name (maybe)
Collection
Copyright Status

Phase 3 or 4

Results should include variations (similar spellings, etc.)
Search within text of a single book
Search should disregard page breaks
Type-ahead (find-as-you-type) search

Minutes
Attending: Carolyn, Martin, Joel, Susan, Trish, Mike

Action Items
Mike will move forward with the addition of the table / updates to the database
Joel will look into ElasticSearch some more

Notes
Trish, will send travel cost estimate to Carolyn and Martin

FULL TEXT REQUIREMENTS DISCUSSION
Reviewed and discussed prioritization of requirements as submitted by Mike, Carolyn, and Susan

From Mike's requirements, the General Wishlist includes some things we might want to prioritize including from user requests.
Facets -- these include what we have and what we want to search on. Maybe our first step will be to just get full text search implemented in a basic way and then add facets later.

How do we provide this information to the index, such that the facet data is available when indexing happens?
Do we give it a structured file?

User specification for full text over metadata and/or text?
Big question is whether it automatically searches over both or whether user can specify if they want to search over just text or metadata

Susan – providing the option to specify enables higher precision

If we can give metadata and text, and search it both at the same time, we would weigh the metadata higher than full text and rank those higher in the search results

Realizing we won’t get this done perfectly the first, Is there a development pass we could do to make mistakes early?
Make iterative changes to indexing and see what results look like

We can also look at what others are doing

Infrastructure things – when looking at our options, do any of those lock us into anything?
Should we go with most flexible infrastructure build?
What is the baseline of what we need?
Do we need a new server?
Storage will be an issue based on the amount of data we have
1.5 TB – is Joel’s ballpark of size of OCR text
Our index size will be on that order

Memory usage will be a bigger issue than storage space
Unfortunately there is not a great way to ballpark ahead of time

Any white papers we can find, reports at conferences would be useful for mapping out how we want to do this.

Would it be useful to get in touch with DPLA people?
Joel looked at their stuff GitHub and didn’t see much
They are using Elasticsearch
Not yet sure how to set it up to index, what do we give it to index?

Google has done a good job of full text indexing of IA. Site: archive.org and put in a term

Boolean – may not be available, not sure. How many would use that? What people are used to, putting quotes around it. So probably not a top priority for Phase 1.

Synonyms, query extension, related words, would be nice to have.

Facets
First phase might be full text with no facets,
Second phase add in facets

Type Ahead – Phase 4, nice to have but significant impact on server

Highlight the terms – In SOLR you would need to store text in addition to indexing, Effectively storage doubled (at least).
Could use coordinates, requires custom building, more complex
Balancing complexity of programming and storage needs

IA does the highlighting on page image, Hathi does it on the OCR
For BHL would most likely be on OCR

If we want to highlight the image on the page, very tricky. We don’t have the coordinates of image in OCR text. Would be tricky once we get to searching within book. We’ll come back to that

IIIF has support for coordinates

Backup search technology
If index server is down, do we want a fallback?
Yes, we would have a backup search technology

Stopwords – we’ll probably tweak these over time
If there’s a particular word we don’t want to index we would add it to that list
Shouldn’t effect complexity
Would need to re-index to add new ones, so get close as possibility at beginning
Would be good to pay attention to up front
Eliminate articles in 10 most common languages in corpus

Get a list of which non-stopwords appear most commonly

Now basic search would be results on a page

Indexing at Item or Page level?
Affects how we handle results ranking
If we’re searching on European starling, if there’s a book about it and it’s not in the title, and we get 40 pages that mention that, how should we handle that?
But if we’re indexing at page level,
Would it make sense to index also at an item level?

What are we indexing?

Maybe it makes sense to do a survey of users,
Do you like this or this one?
Which one is most helpful?

Stemming -- Plural one is pretty important

Ranking – indexing engine

Collection as a facet? Existing BHL collections could be something that we could facet on
User defined collections gets tricky with user accounts

Metadata – building a query

Faceting on copyright status might be messy on that because it’s not a controlled field

Fifteen terms for rights statements– DPLA, Europeana catagoeries may be good fit for us. But it’s a separate effort.

Mike and Joel will pull something together
Lean more towards index per page

USER FEEDBACK FORM
Members approved adding new categories to the Feedback Form to help streamline incoming issues
Details will be forthcoming

CONTRIBUTOR
This is from the EABL team, specifically a desire to capture a scanning institution as well as the rights holder
Mike Proposed adding a table
Item currently associated with contributor
Add a table, item associated with this institution that role = contributor and this other institution that role = rights holder
Will need to look at reports, pulled by contributor

We’ll have to look at how this impacts ingest.

Probably is right way forward, but is a significant amount of work.
Would like to get everything added at once in database including the fields proposed by DTWG

EABL will need to edit what we’ve already ingested so the sooner we can implement, the better as the body of material will continue to grow.
Mike – timeline for implementation would be probably a couple of weeks

Martin approved moving forward with the addition of the table / updates to the database
Joel will look into ElasticSearch some more

Mike – news about the Google book decision settled by Supreme Court. Does that change what we want to ingest?
There’s like 8,000 items in IA that we’ve been avoiding
We had a number of reasons, other than just legal
Those were uploaded by Aaron Schwartz
Part of it was quality issues and duplication
Martin will think about it a bit more

Would be one line of code if we want to change

As far as BHL move goes, another round with legal dept, they requested more info and they’re reviewing

Trish – sent Joel and Bianca the metadata document, a couple of iterations with the group, Need to involve you with some conversations about workflow
Someone who has metadata for segments and those who don’t have any and would be creating it from scratch in Macaw. Mapping container and segment metadata or creating from scratch. Thinking about UI.

Joel to join May 5 – EABL call, 3pm EDT?