BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

SolrInstallation

printer friendly

Meeting to define Full Text Search by using Solr

Skype call, March, 1 2013
Attendees. Mike Lichtenberg, Frances Webb and William Ulate


Native SQL Server search, names is more of a direct names search.

We would like to index 150 million documents

Solr field for the names, attach the names as a repeateable field in the documents itself without adding more to the documents.
Solr tends to deal with a document as the object and fields attached can be searched, faceted, etc.

It's a faceted search, you can have a field to be configured as repeatalbe and the summary of different values.

Solr doesn't have front end, it returns as XML. You can configure to show the first record and
Solr gives a summary breakdown, the front end implementation decides what to do with the languages.
Solr does the work of tabulating the restuls for whatever facet you decide. What appear and how often...
Solr leaves the front end to keep a state and qualify the search and create a query that matches the document.

Designing a first incarnation of the Solr Index is not too difficult, but then there will be an incremental process to define specific searches.

The first go could put together in a couple of hours if looking at what fields could be searched.

I don't know what's involved in importing data and the same with the update protocols.

And adjusting the front end mechanism could need to be interactive and where the hiccups may be...

Solr only requires the data that needs to be searched and returned. Requires a unique Id for every document that is going to be searched.

Bibliographic Metadata
Derived Metadata (scientific names)
OCR Text
Page Number

Search right now is to volume label but if we are going to OCR we may want to go to page level. You need a volume object and a page object for each page to return both.
For example, searching for bees, the search result might return a volume, but a second search might look at the .
This requires each object to be indexed twice, once as a volume and once as a page.
If you require to return Textual subjects then it augments the space required.

Mike proposed that we have the hierarchical relations: (sci names are at a page label), pieces of a book may or may not relate to scanned items.
The representation of Solr is not that difficult as long as we can group items to be indexed similarly than what you need to do with others.

The way Solr indexes is very arbitrary, you tell it what fields are going to be in your data.

Solr, out of the box, will consider that a search appearing in shorter text fields to be more important (unless configured otherwise). But you could indicate otherwise like Title being more important than Subject.

If we did page level documents for every item of every volume we could get

Volume relevancy search is probably not better, but not nearly as bad as page level relevancy.

Sorting, when you click on a particular volume gives a volume search. Show resulting relevant pages but in the order that they appear in the

Title level search + Page level search

We would need to model the external model as well as the external segment.

There is a possibility that an external segment associated to the

There is a segment search in production, there are currently 4 catalogs in BHL: Authors, Subjects, Title/Item/Volume, and Segments.

Solr is flexible to handle the different type of metadata we have if it's the metadata or fields attached.

The two step search might be adequate, returning titles and then allowing to see what pages are there.

People would like faceting.

Outliers may get cut off at the end of the list if you use a facet with the ones that appear the most first.

Next Steps

For Step 1.

We need to know granularity. We need to define searches.
2 Line up the 2 schemas: here's how it is in the SQL Database, here's how it can be represented in Solr.

Then open it up for experimentation.

Full text requires IA querying and Metadata... Single service from BHL can provide text and metadata.

Full text in SQL Search can be strange.

Whether Cornell can provide a Linux server, is still a question... if we end up running a Solr index in the DB. If we put it in the Garden, we'll put it in a separate box.

We may want to keep it in Linux.

For Step 2.

=> Page level, if we wanted by page or scientific name.

Will talk again in a couple of weeks!