BHL
Archive
This is a read-only archive of the BHL Staff Wiki as it appeared on Sept 21, 2018. This archive is searchable using the search box on the left, but the search may be limited in the results it can provide.

William Research, Informatics, and the Published Record Notes

Return to Conference Schedule

Official Notes for Research, Informatics, and the Published Record:


Tom Garnett introduced the speakers:

Sandy Knapp
We will discuss about Life, The Universe and Everything, for the first half, each panelist speaked about 15 minutes and the other half to discuss.

================================================================

SANDY KNAPP

The literature as seen by Researchers: it is about books, but increasingly we think of the literature to be digital. We are the first generation to be the first to do something about this extension. Researchers are not only fromFirst Worldcountries with first World Access.
Recently TREE published an article saying that a lot of taxonomy is being doing by many more people than in the past, although that doesn’t mean necessarily that more species are being described.
Oldenburg (1665), one of the founders of the London Society of Journals, published the 4 functions of publications:
  1. Dissemination
  2. Registration: Marking how things work together
  3. Certification
  4. Archiving (the “Minutes of Science”)

No longer a simple task, there are more than 25,000 journals, 2 million of papers published a year, and no one person knows it all or works alone, science now tends to be done by groups.
The literature has changed from only books to things published on websites, even peer-reviewed ones.
Rich alluded to Electronic Publication starting in January, but electronic publication could be done before,Sandydid an experiment publishing 4 new names in PLoS.
There’s a plethora of journals, not overarching like the ones duringOldenburgtime, now literature comes from different ways, and from next January on, there’s going to be new interesting adjustments to adapt to the Electronic Publications of Names.
When Researchers were asked why they publish where they do? The reasons were mostly Reputation, Readership, Impact Factor, Speed of Publication and Editorial Board. [Rowlands et al. (2004) ciber Report].

Dissecting this, when they are asked how do they decide where to submit it in one of the 24,000 journals? And the single biggest is Reputation but the second best is Impact Factor, which is Thompson-Reuters’ way to measure how important something is. There’s a relation between Impact Factor and times of rejection. Having an e-publication is not important because the publication has become a way to further the career of a researcher, the literature published today is the legacy literature of tomorrow.

The two highest impact journals in Physics Review Letters and Syst. Biol compared with Nature, the place that has more authors per title, is Nature… that’s an impact factor of 32.

Science published a story that the number of papers is astronomically increasing… in the future, it’s going to be a nightmare. Scientists are reading more articles but spending less time in each article… but it’s not because we read faster, it’s because the access is mainly to the abstracts, rather than the whole content!

“Scientists have always strived to avoid unnecessary reading”

A strategy in a Blog to avoid reading will take you about 4-5 hours a day, which leaves not much time for doing science. The position is: “I don’t really follow the literature anymore. If there’s something really important, it’ll find its way to me.”

In Facultyof1000 reviews, ResearcherID uses the Impact factor (a “spawn of the devil”) and Google Scholar.

Sandysaid that doing a quick poll among colleagues asking: “Have you ever cited a paper only reading the abstract?” 95% said yes.

Why are we reproducing the literature if we are not reading the content?

Maybe we need something more monolitical like Systema Natura or EOL or Rod’s Page work allowing to find a word within text.

Should we consider the collective behavior and connecting research informatics? This is done by communities. Maybe we should think about communities and how do they use this literature.
=======================================================================

STAN BLUM

Biodiversity Informatics and the Biodiversity Literature

Progress over the last decade on informatics
Organisms ocurrence data, one of our

To publish organism Occurrence Data, we have all information flowing to GBIF.
A map of density of 300 million organisms… we can see that Organisms occupy the world and how they are distributed.

A very special habitat environmental data with species occurrence data creates an interpolation model to predict the probability of distribution. Adding the climatic data, we can predict the optimistic and pessimistic scenarios. For example, how the rain changes could contract the distribution.

Remaining challenges with Occurrence Data:
Lots of Digitization still to do
Taxonomic identifications need to be updated
Georeferencing still needs to be done

Relationship to literature:
Specimens and observations are primary data.
Literature contains both reports of primary data, as well as summarized data.
Large scale digitization efforts in museums might (will) swamp the content of literature.

Taxonomic databases
Richard Pyle already exposed this topic, in the last 25 years, and before in paper, we have nomenclators to establish the names, then we have checklists for valid/accepted data (plus synonyms) and then Catalogs of uses in taxonomic works and finally index –all unique name-strings mapped to valid names/concepts.

Emergent consensus

- Philosophical / methodological debates

Name – Name usage – Citation (publication metadata)

What we need is to anchor this name…

Remaining challenges with tax data:
Taxa are concepts created in literature
Physical instances of the same published work are “equivalent”
Needs to develop a shared logical identifications
That allows Reconciliation across “authorative” databases; fewer number of same as records.

RECAP
Taxonomical names are key to:
Observational data

WHAT’s NEXT?
What other classes of information remain in literature to extract and structure to be really useful?

The Paper just sits there… while in biodiversity Informatics we want to slice and dice.

Next 100 years it’s going to be genetic and genomic data? Are not communicated or stored in literature!

The broad description in a : she was working in Zebra scape.
The Zebra fish is a Model Organism.

Understanding the origins of species through structured descriptions of diversity.

Morphological variation across species difficult to find and synthesize, it’s a lot of work because the terms used are not standard: Information retrieval from text is difficult, Not computable across studies.

In Zea scape they got to set an Ontology to represent the knowledge. Ontologies quickly becomes very large and complex; guiding philosophy required: how do you reason using the ontology.

Phenoscape II & Research Coordination Network (RCN)

Working with the Amphibian Antomy Ontology, Hymenoptera Antomy Ontology and Plant Ontology.
Hon Cui from theUniversityofArizonais working with NLP and term extraction.

What’s next?
Descriptions of biological phenomena, determine how best to do? This will take time, top-down design, guided by functional demonstration (what you can do?) Bottom-up curation of existing descriptions… What is in the literature tomorrow and what will come in the future?
===========================================================================

ELY WALLIS

Digitising for what?

We want to provoke discussion while seating in the comfortable chairs in two ways:
The library is shown as museum.
What are we digitizing for? This rhinoceros is now extinct. Martin Kalfatovic made the point that this animal can only live in the literature, hunters are naturalists.

My first provocation is about Access. We digitize for Access to anyone, anywhere, anytime (as long as you have a decent collection).

The British Library and Google books will digitize. The use of this vast resource for the specialized research and the simply curious. Just by making it available on-line, they will find and access the collection. General public should be banned…

Why would the General Public (the simply curious) want to access what the specialists would like to access it? Is there proof?

We heard specialized researchers use cases, how can we think of all possible use cases?

We think people is doing cool things and not only science? What sort of tools are they using?

The ‘simply curious’ wants to access a blog like the scientific illustration blog where images are set aside from the context.

Another reason to digitize is Preservation.

You can digitize to make Art of it, like Alexander Korser-Robinson: The Gardener. If we are not reading the text anyway, then image is a way to go.

We can cage them if we can’t digitize them, like the Founder of Internet Archive was doing.

The physicality of the book is still compelling, like very old notebooks from 1659.

There is something important in the physical object. I want to touch the object, I want to see the cover.

For one use, you can

By Printing a PDF, are we getting a souvenir? Or are we planning to read the rest after the abstract and then not having time?

We are providing tools… provocation #2

Provocation #3: Digital Scholarship tools

The book as a configuration is dead, when put online, it is just text.

We only see what fits us, but literature tools are helping us search and find.

The OCR has not got to a point to be able to find all occurrences of a name

There is a huge quantity of literature copied

What are the real tools that we need to access?

What are the real use cases to access lit and how do we support those
What are the tools we provide for that?
What does the scholarship need in the digital life and how do we support that?
=====================================================================================
DONAT AGOSTI
Recognition to:
Terry Catapano is the architect behind the XML that we use to markup.
Lyubo Penev from Pensoft uses markup schemas

Why?

At Iran, Internet is dangerous…

Research
Conservation
Wealth of Nations

Not only scientists have the responsibility to
In most cases, we don’t have a list of the species disappearing.
We are doing science because taxpayers think that it will be useful

1. Research What taxon?
What sort of species are there? Which ones?

We should come from standing questions and use literature as a means to communicate.

We publish quickly, we tweeter, we publish fast and little bits.

What is the taxonomic currency? The Treatment is something we are interesting on.

We are not interesting in Journals, nor Articles, but in Treatments. That is where our things are.

EOL wants to get there.

Where are the units?

2. People want to get Credit!

3. Communications are Sharing. Is it really important to share? Sharing means encapsulating what we have saying here’s the metadata. We defined a set of vocabulary to publish the terms and it can get to them. A description is full of elements defining characters and character states linked at resources.

Science is testing hypothesis. We have to be able to go back to data, so we need access to this plethora of information. We are not going to have only links to a certain thing, but to have all Linked Data.

But even being an abstract reader we can’t get anywhere, we need data deluge, so we need machines to do it. We don’t want it to be wrong and we

We need Open Access, not new elites… the tragic point is that when only 20% has agreed to open access.

We need a Mandate, but to get the Mandate to work, we need to provide Incentives.

We took down the treatments, added some marked data.

All of this applies to Conservation, many scientists don’t care about it. We need to make sure the data index is findable.

There are very interesting projects going on (GEOSS, IPBES, COP/SBSTTA) but there is very little link between what they do and what we have.

We have to look at the data… it’s not what’s the species…

We have to make the information accessible, we need to re-use.

We need to do some Quality Control Management… QCM

What does this data represent, is this part of a research? Is this collected randomly?

This way we can inform how to collect data, using a detailed GPS…


Finally a very important point is that we are here to

There are OECD Guidelines with how to give access to data.
The Berlin Declaration on Open Access says we want to make all our content openly accessible.
The mandate is to make the data always accessible.
NIH made it a requirement for their field.

Think about Open Access, not only here in the bowl you are living, we should care.

We have to be very sure we know what we do, should we put it in Mendeley? Can Citebank maintain that?

We should not forget it is money that matters… I was very critical about the idea of motherships: EOL, Global Name Architecture… we should focus and focus and focus again.

We have to keep in mind money has to be revised, change the business model where the impediment of money determines what gets done.

The future is that we don’t do more BHL and how do we prevent that markup is included? Forget about old communications.. concentrate

LIFE COMMUNICATION
UNIT
CREDIT
METADATA
LINKED DATA
OPEN ACCESS
MANDATE
INCENTIVES
SPLENDID ISOLATION
QCM
OPEN ACCESS
CARE
LEDOM SSENISUB (Business Model)
FUTURE

QUESTIONS
Chris: could it be that people is only accessing the abstract because that is all they can access and the rest is behind a pay-wall?
Sandy: This is after the payment…
Donat: People only read the treatment part.

Question from Nebraska Lincoln Library: One thing that we had at LCSH is that wt tried with Ontologies but we couldn’t extend the Ontology unless we have someone (a person) who understands the nuances of the Ontology
Stan: What we are seeing in the Phenoscape project is a variety of approaches being compared. Multiple languages, terms out of use, etc. and I forgot to mention the use of characters to allow for identification like the use thatAustraliahas enabled to identify their species.
Ely: It’s the people doing the work to allow that
Donat: Flora ofAmericaand the Flora of China folks are trying to extract those characters from literature, like the project from Hong Cui inArizona.
Stan:
Donat: we need to create that first!

Tom: Anything that is not in digital format takes money, markup of what is already existing costs money, money is small, what should we prioritized?
Born in print literature, digitized and marked up is ideal.
Focus in fine-tuning
Sandy: The interesting thing of scientific questions is that the ones now are not the ones later, so digitizing the info, we can decide to do this for us or for our grandchildren.
Donat: It’s an academic question. There’s money there and you can go to the politician and make the case that we need more of that. We need to go beyond the digitalization. Ely: A good point is not to create more Motherships.
Donat: You have to ask what is the most important things in Conservation? For example, taking only what is inMadagascarmay not be enough.
Dean Penchett: One of the things that is intimidating is the scale of the work but there will be no more 20th century literature, there is not going to be anymore. Some people are dealing with it. LikeSandy said: Take in consideration that 30-40% of current publications happening today are coming from a combination of Zootaxa, Zookeys, etc. So it is a manageable!
Ely: I like that there is no more 20th century literature. The elephant in the room is not an elephant is Mickey Mouse. There are difficulties that I am positive that we could solve.
Sandy: An idea is to make all the treatments online which would take care of impact factors. This is important for young scientists; we need to make a way to make a different impact factor.
Stan: Taxonomy needs to separate from that model. What needs to be done now to change the trajectory? The way we do business has to change, the analysis, the sharing, all needs to change. The Drier Project we are in, where a publication is a snapshot in time, we in bioinformatics we need to share, combine etc.
Sandy: The Administration agrees. We need to work with our institution because that is the way institutions measure and ranks against each other: which of the children do you love the more? Changing to one different paradigm, one way is to change yourself, other is to train others in a changed way…
Donat: Here’s the people in BioONE, SciELO, and other publishers, they should know that the next generation could push to provide treatments out.