TechCall_30oct2017
Agenda & Notes
Dima, Carolyn, Pam, Joel, Susan, Mike, Trish, Ari
Introductions
Global Names - re-indexing possibilities and future collaborations
Background
Did index a couple of times, and each time it was very slow
This means we cannot improve name finding as much as we would like
So at TDWG reported on improvements to speed of process
50 million pages, marked up titles in 11 minutes.
Able to run name finding in about 1 hour. Probably will fluctuate a bit, might be one day for whole process.
About a year ago.
MBG - GN will run Name Finding Service and BHL will use API
With new technology, appears to be not necessary any more
Running process locally
Wouldn't need to sync BHL to GN all the time
You can run the service any time you want
What would server requirements be?
more CPU the better
1 server for namefinding might be enough
don't believe we need a cluster
Maybe if have 20 threads, could be a few hours (Dima used 40)
On Dima's laptop took 4 hours
So far Dima has tried on Linux
POSTgres database, and SSD hard drives
Theoretically any machine would work.
Writes outputs into Postgres DB
Would just need a process form there to pull into BHL DB
Is it in a container? Easy to move?
Dima: One file. You execute it and you're done.
Scheduling - Martin question
Virtual server front at SI;
How often would we be re-running?
Several stages - 1. Just logistics to figure out everything running as it should.
Complete rewrite of everything, so some will be better some will be worse.
So would have to go back and name finding fixing and dictionary fixing.
Some time to refine it based on continued updates
Maybe in a year so, we will really nail it
Looking at a Plugin for namefinding, another piece for QC
For example, if there is a large book with many errors, annotating for corrections is a lot of work.
Dima is working on a project that will dramatically simplify name finding data
Preliminary results - finding false positives. Test volume = 600 pages
Averaging 4-5 hours per 500 pages
If false negatives, maybe will take a day of work. But not a month :-)
Will Dima need access to server?
No.
GNames on GitHub
Proejct: BHL Index
Might be easy since it would be behind firewalls
Could try on a laptop w 1TB, 8CPU
Could we try on full text server?
Joel: How much will full text and global names collide over time? 20 threads should be enough for both.
Would just need to install Postgres if not already there
Dima: hope it will be better and better over time.
Mike - is there no way to say we have a week's worth of new data, just send that?
Dima - right now, more of a proof of concept.
Mike - that would be one of the first things we'd want to do, be able to run just the new things every week. Not the whole thing every week.
Dima - would eventually like to contain previous result, and one that runs from scratch and uses statistical information from previous runs.
Dima, once start it will be helpful to check in , share screens, etc once get started.
Timeline - Check in with Martin
No change on full text.
Article - should be ready tomorrow.