NOTES! Actions Items! Further Considerations!
BHL-Europe Meeting, Berlin. 23rd-25th February, 2011
Wednesday 23rd February [Andreas, Henning, Boris, Melita, Adrian, Lola]
9.30-12.00 BHL-Europe architecture with particular regards to the data workflow and data enrichment. (Based on recent discussions we need to revise the architecture diagram to take into account the recent discussion/questions on the data workflow and data enrichment during Pre-Ingest).
Taxonomic Data in BHL-E .docx
a)
Taxonomic Considerations [Andreas]
There are 3 main areas to consider:-
a)Taxonomic data in the dataflow during the mapping and pre-ingest vs requirements
b) Alternative Scenarios
c) BHL-Europe metadata Schema limitations regarding taxonomic data.
There are 2 possible alternative Scenarios for the BHL-E Metadata Schema limitations-regarding taxonomic data (based on diagrams):-
1) Periodically Re-ingest and redo enrichment or
2) Use external web services during Solr/Lucene index generation (periodically).
The 2nd option is preferable with slight alteration to the diagram – suggested by Adrian - whereby metadata enrichment can be done ‘One-time’ off line at the Pre-Ingest stage and hence there will be no need to Re-ingest.
ACTION 1:
Andreas will revise the dataflow diagram and present during next tech call together with any further questions/input for group.
Andreas might need to go along with Adrian to Vienna in March…during talks with AIT.
b)
Using the Fedora Option: [Adrian]
The Fedora model is our contractual option as per the DoW but will probably have to be replaced anyway after 5yrs.
Using an atomistic model could ‘kill’ us at anytime- however the question is – at what stage? i.e. What is the Risk factor?
Such considerations are outside of the lifespan of the project.
EXECUTIVE DECISION: Use the atomistic model and accept the Risk.
We need to consider how we define our specifications of service i.e do we want to include Semantic Enrichment (relationship of various types of metadata to other relationships of metadata?)
ACTION 2:
Andreas to structure this information as per action above using existing data flow diagram and to link this with the diagram used during his presentation.
Hence, the Workflow will be used as base for a diagrammatic view.
c)
Adaptation of the BHL-Europe Architecture diagram –
Adrian confirmed that the architecture diagram recently provided by Lee (Atos is really the Implementation model.
The BHL-Europe Architecture diagram needs to be slightly revised with the following:- the Taxonomic Metadata enrichment needs to be outside of the Schema Mapping, it needs to be moved into the Pre-Ingest area.
The diagram has recently been updated by Henning/Melita and can be found here:-
https://docs0.google.com/drawings/edit?id=19mx3M967NQIyR_7CDVSO9fqMHTjNoQ25V9zV7o9AiBY
d)
Questions raised from tech call discussions related to the Feature List;-
As there were no members of the techgroup really in a position to discuss these particular items on the list (i.e Search for journal titles, metadata edit, tagging as part of myBHL?) Boris will need to take this offline and discuss further with Jana/techgroup members via techcall.
ACTION 3:
Boris to arrange a separate call with AIT/Lee after facilitating questions from the WG.
Considerations: - how easy will it be to integrate the GRIB, evaluate the possibilities and provide a framework etc. the data integration of the GRIB into the Portal – is this a high priority? How will Boris integrate his with the Portal Work?
Lee will need to determine the dependencies and whether this is something that can be done easily and then establish a proposal.
Thursday 24th February [Chris,Adrian, Adrian, Boris, Melita, Lola]
GUID Options (GUID Minting for BHL-Europe and implications for BHL-Europe architecture)
Adrian presented a proposal for building a GUID Mint which was ratified along with Chris Freeland.
[Adrian]
Handles – Resolution (works similarly to DNS, resolves GUIDs, provides a mechanism for distribution on a global architecture)
GUID Mint – providing a UUID – needs to be very robust/resilient for a global architecture.
Decided to use Random GUIDs
The objective is to build a Global Mint(a Global Service Availability)
Using a GeoIP/Proxy and creating a Web service providing GUID Handles.
Viability of our solution can be checked by Bibliotheca Alexandrina since they’re currently using V6.
N.B. Refer to :-Handle.NET Manual v.7. Technical/User manual for further info.
The diagram for the GlobalBHLResolver was revised by Chris doing away with the Proxy which has been combined with a SAV Monitor appliet. Refer to powerpoint diagram to follow the step by step process below:-
1)User submits a handle to SAV
2)SAV passes user’s IP to GeoIP to determine location
3)Proxy requests from Handes Server all of the URLs for the Handle.
4)SAV checks the availability of the regional node based on the GeoIP determined location:- 1) if online, resolves user to node and 2) if offline, iterates through the possible URLs until it finds an available node
ACTION 4
Chris will take this model back to the US and discuss further with Anthony/Phil followed by further talks with Atos (Lee).
Afternoon Tech call Discussion with Atos (Lee, Ameth) - providing update on recent decisions/discussions.
a)
Update on Fedora testing [Lee, Ameth]
Presently not ready to do load testing but we do have the Fedora system with some books from BHLUS (internet archive) which we will use to work with in terms of data.
We have been working on the Access part and implementation of Islandora….
Ameth will send out the URL.
We will start on performance testing once populated with data.
b)
Update given to Lee based on decisions recently made [Adrian]
A Fundamental decision has been made – Yes, we are contractually bound to use Fedora but do we use compound or atomistic model within Fedora?
Lee: Awaiting evidence from Bernd that we can use both – currently waiting for a hybrid model from him - needs to demonstrate if this is do-able.
Adrian: The Requirement for users is that: we need to deliver an atomistic model to associate metadata with pages at that level. This will increase load significantly over that of a compound model.
We should just accept this as a limitation (with a caveat) that we will reach physical limits of Fedora at a later stage.
Lee’s Considerations:-
Bearing in mind Fedora is not really scalable -
A Limiting factor will be the database.that Fedora uses and not Fedora itself.
If we can replace Fedora with something more distributable. Scale up i.e more memory, and then cluster to a MySQL cluster.
Adrian: There is an upgrade path possible.
We should use an atomistic model and we can state that we will need to upgrade at some point which is currently outside the scope of the project.
This will negate the point of testing.
The Bottleneck however is……the volume involved of 20million pages as per DoW. (Exponential no of triples associated with 20 million pages) This was previously discussed in Egypt.
The triple store is the bottleneck before issue of the database.
We don’t have a requirement for doing RDF triples currently. We haven’t looked at classification of Semantic reqmnts yet have we?
i.e There is currently no model to map into BHL-Europe.
We could shelve the requirement altogether which will remove some technical specs within Fedora.
AGREEMENT
Adrian: We can move forward as we accept that we can mitigate limitations by using MySQL. No testing is required with the atomistic model.
We can move quickly with the implementation of Fedora with the atomistic model.
ACTION 5
Ameth will need to validate and work through with Bernd at what point does ingest deliver to fedora the individual objects at page level or with a book. Or do a pre-test with Fedora.
Q:What is The role of the GRIB here….?
Lee demonstrates a typical scenario: e.g a Book with 100 pages, the CP goes to the GRIB to scan book, here’s the GUID for it, or I have a book with 100 objects so give me 100 GUIDs.
Consideration: Using the GRIB as an option and not as a requirement?
GRIB is out of the Global block i.e. there is no link between the GRIB and the GUID.
AGREEMENT
The GRIB Integration is still an option but deferred to just work with the approach of assigning GUIDs to items coming in from CPs for scanning.
c)
Update on the Global Mint Solution – ratified with Chris Freeland and powerpoint diagram (Global BHL Resolver) currently available on Wiki.
Adrian walks through the Global BHL Resolver diagram – Lee is familiar with this approach.
Adrian: We plan to create APIs from the GUID Mint.
Lee: There are different technologies used to generate GUIDs…ie. Database sequence is the easiest.
Adrian:Globally there can only be one GUID Mint. Handles Resolution system to apply a GUID to an object.
Major revisions have been made in V7. It is now possible to resolve multiple reqmnts from one UID. 1 GUID can have multiple urls associated with it.
You can also use a low balancing service called weight and another called country.
QLee: Are the variables part of the ID?
Adrian: High level abstraction is used to create global resiliency if one of the regions is not available – SAV.
This is outside the scope of what we need for BHL-Europe. Within handles system u can assoc a no. of bits of add info…no of urls and weight them and also country codes.
Lee’s Consideration:-
What is the URL?
Adrian the URL is the DOI. We can automatically prefix it with the DOI that we have. i.e We give out URLs from Fedora and prefix it with the DOI to give them both.
Lee: Do you want people hitting fedora directly? i.e. You don’t want to give them direct access to the archive repository.
Lee: By Changing book reader the URLs change and then u would have to do a massive update to handles system.i.e for every DOI in handles for the Europe version replace this url with this one for every single page….
Adrian: This would mean that, we need to choose the right bookreader from the very start.
CONSIDERATIONS to be discussed further for the BHL-Euorpe Architecture –
Where should the GUID mint and the handles resolution be located?
How do we implement this proposed solution. (presently it is just in the box without the handles resolution system.
Why not let fedora create IDs as in the GUIDs. We need robust APIs from GUID Mint.
We have no scanning process within our project at the moment. They can use our GUID mint and every object will have a unique ID.
Does this mean the Pre ingest, Ingest will be minting the GUIDs at the start
of the GRIB?
Example of scenario:- Has this book been scanned yet? You need a GUID to associate the metadata against.
Book (object) has a GUID and each page with object is associated with GUIDs and will be linked together.
QLee:Looking at batch ingest in Fedora – Will Most CPs be providing books in batches or ingest books one at a time.?
AMelita: Uploading differs from one CP to the other.
Lee:When interacting with system will they say here are my books and do they know they will have to operate each book at the time. This depends on Ingest model.
Lee confirms: Based on decisions made, we can start doing this on a page level.
NEXT STEPS/FURTHER CONSIDERATIONS: Regarding Global GUID Mint proposal:-
Chris will take the base model back to the US guys and then a separate call will be arranged with Atos thereafter.
The GUID Mint – is it regional or is it only the one?
Concerning The Global architectural diagram – based on the whole GUID solution proposal further discussion is needed before amending at a later stage.
Hence, at the moment the BHL-Europe architecture diagram is okay..in terms of resilience and the GUIDs…maybe it can be amended later.
We need to determine the technical resilience/Implementation of the GUID Mint.
QMelita:
Will the GUID model be at the front of the GRIB or at the Ingestion phase?
As per Wolfgang’s question – where does the GUID Mint process fit in as part of the dataflow process?
Adrian: We now have a framework to test the GUID Mint on – In terms of delivery, this should probably take approx 4wks?
QMelita: Who’s going to be responsible for the Handles Implementation/GUID build?
Adrian:- Chris Sleep (NHM) and Atos – potentially.
NHM will be the home for the GUID and the Master Handles Resolver for other regions to mirror.
Confirmed:- Deliverable for March end.
FURTHER CONSIDERATIONS: gaps in the GUID Mint process that are missing in the Dataflow Process:-
Further discussion is needed concerning the Process flow from front end for the Content Providers (i.e from CP to Ingest needs to be defined at a granular level.).
You can ingest but not resolve – without the GUID Mint. i.e it wont be accessible in that way.
ACTION 6
Adrian to include Gaps (GUID Mint aspects) in the dataflow process ahead of next Wednesday tech call.
d)
Update on Bookviewer on the Islandora System (Atos)
Lee gave a live demo via web link of the book viewer currently on the Islandora system:-
The bookviewer is good for ingesting one book at a time.
Custom ingest in the Pre-Ingest process – Bernd is working on this now.
Code for internet archive book reader – Open Source.
This is a good solution as a demonstrator. – Ingesting digital objects on Islandora.
ACTION 7
Atos to arrange call with both Jiris regarding portal design work…i.e. Provide different themes.
e)
ISSUE - MODS vs BHLE Schema
Lee: Bernd would like to organise a session to take the BHLE Schema and MODs schema and see if we can fit the BHLE Schema into the MODs schema….based on the fact that there are many tools available for MODs out there.
METs will tell you that you have a 100 files located in a directory.
Descriptive metadata within the METs for title, and detailed content as per technical envelope.
BHL-E Schema contains technical metadata and descriptive metadata and taxa data…
Most systems like Fedora use METs as the technical envelope to ingest digital objects and hence the questions is Can fedora ingest other types of XML?
Most people in libraries use METs sounds like a better choice for this project.
We will potentially be replicating data from schema and in the METs....
The BHLE Schema has been ratified by Atos however the issue exists of MODs being used instead of the BHL-Schema. Discussion to take place to find a resolution of whether to stick with our schema or use MODs.
ACTION 8
Issue to be included as an agenda item for the next techgroup call.