[API-users] Is there a GBIF specific LSID that can be used?

Tue Aug 19 17:43:23 CEST 2014

Hi Markus,

Sorry for the long diatribe – I agree it was probably not the right forum
for it. Back to the original question, speaking from the perspective of API
users, I would say:

1)      If you are going to expose your internal integer ids (as you do, and
as I’m very GLAD that you do), make sure that your internal identifiers are
as stable as they can be (i.e., don’t re-assign them in the future without
warning).  If you do expose them, no doubt API users will cache them and
cross-link to them, and rely on them to be there.

2)      If you provide a dereferencing service in the form of
[httpURL_prefix]/[internal_integer_identifier] (as you do with the
http://www.gbif.org/species/ and http://www.gbif.org/occurrence/ restful
portals, and as I’m very GLAD that you do), try to persist this service
using the same http URL prefix.

3)      If you must re-assign internal integer identifiers in the future
(for whatever reason), then provide an API that cross-links the old integers
to the corresponding new identifier, so people will be able to update their
local cached cross-links. Also, avoid using overlapping numbers in this
case, so that the [httpURL_prefix]/[internal_integer_identifier] for both
old integers and new integers remain globally unique without collision.

I think these are by far the most important three things from the
perspective of an API user.  I think the addition of support for DOIs in the
future would be wonderful as an additional dereferencing service, and I
think it would be very valuable and appropriate for GBIF to provide this
service to end-users.  However, if the DOIs take the form of
10.[GBIFDOIDomain]/[Identifier], then I would STRONGLY recommend that the
“[Identifier]” part itself be globally unique, independently of the
“10.[GBIFDOIDomain]”.  Doing so costs you nothing, and gains you something
VERY valuable – which is independence of the identifier and the
dereferencing service.  Such independence means that the global uniqueness
of the Identifier (sensu stricto) does not depend on the persistence of the
dereferencing service.  Such cross-dependence weakens the utility of
identifiers, because it means that if EITHER the dereferencing service
httpURL Prefix OR the identifier suffix needs to change for some reason,
then the GUID dies.

I’ll reply in more detail to the rest of your email off-list, as it involves
more specifics than are appropriate her; any subscribers interested in those
details should ping me and I’ll forward it along.

Aloha,

Rich

From: Markus Döring [mailto:mdoering at gbif.org] 
Sent: Tuesday, August 19, 2014 12:38 AM
To: Richard Pyle
Cc: Roderic Page; Rob Guralnick; api-users at lists.gbif.org
Subject: Re: [API-users] Is there a GBIF specific LSID that can be used?

Hi Rich, Rod & Rob,

thanks for this interesting taxonomic / GNA discussion. It might be a little
confusing and boring to GBIF API users, so maybe we continue privately and
restrict discussions on this list to GBIF API related topics.

The original question raised was if GBIF provides LSIDs or other globally
unique identifiers for GBIF backbone taxa. As we only have local ids now and
GBIF will be able to issue DataCite DOIs very soon I wondered if it helps to
mint DOIs on top of the local ids to make them globally unique. Any thoughts
on this would be appreciated.

A checklist bank id refers to a "name usage" and is similar to a TNU in GNUB
I suppose. It identifies a taxon name being used within a certain
(taxonomic) dataset and can refer to either an accepted taxon or a synonym.
Identifiers are stable over different versions of the backbone, but the
exact classification and list of synonyms for an accepted taxon is allowed
to change. In the near future I would also like to allow the name string to
change in case of misspellings and other small variations. 

For concrete implementations it is quite a challenge to come up with a clear
definition when a *taxon* identifier should change and when it should remain
the same. Would users like to see true taxon concept identifiers for the
GBIF backbone that remain stable as long as GBIF regards the taxon still the
same whatever scientific name is used as the currently accepted label? If we
had better information about types and original name usages (protonyms,
basionyms) we could try to assign stable ids to a fixed set of protonyms in
the GBIF backbone. Does that sound reasonable?

Cheers,

Markus

On 18 Aug 2014, at 23:20, Richard Pyle <deepreef at bishopmuseum.org> wrote:

Since Rod opened the can of worms, I’ll dig in to it an feast along with the
others.

Here is what seven years of NOMINA ( <http://globalnames.org/Nomina>
http://globalnames.org/Nomina) meetings, plus millions of conversations at
TDWG, Pro-iBiosphere, ICZN, ICB, iDigBio and many other regional, national,
and international conferences, plus millions of dollars of targeted funding
from various sources to drive the Global Names initiative
.has led us to.

First, the biodiversity informatics realm is full of name-strings.  These
are strings of text characters, usually encoded as UTF-8, purported to
represent taxon names of organisms.  They may or may not include
authorships, and/or abbreviations, and/or qualifiers of various sorts.
These are the things that are indexed in GNI ( <http://gni.globalnames.org/>
http://gni.globalnames.org)

I completely agree with Rod that a “taxon name” is much more than just the
string of UTF-8 characters used to render it.  For clarity of communication
(as if that were even possible in these kinds of discussions), I refer to
these as “name objects”.  They are conceptual (abstract) constructs, and are
uniquely represented by a rich suite of metadata (publication metadata in
which the name was originally established in accordance with a nomenclatural
Code, authorship metadata, type specimen or type taxon metadata, etc.). A
single taxon name might be represented via different name-strings (e.g.,
different alternate spellings, different genus combinations, etc.), and a
single name-string might be applied to different name-objects (homonyms &
homographs).

And, again, I completely agree with Rod that a “taxon” (=taxon concept, =
taxonomic circumscription) is something else – it is another conceptual
(abstract) construct, typically represented by a broader collection of
metadata, including things like included child taxa, included synonym taxa,
biological characters, and possibly other stuff such as geographic
distribution. A single taxon might have more than one taxon name applied to
it (synonyms), and a single taxon name (in the name-object sense, not just
the name-string sense) might have been used to represent different taxon
concepts (e.g., sensu stricto vs. sensu lato senses of the same
name-object). The most practical way to refer to a taxon is the combination
of a name-object (as described above), plus usage instance, e.g. “Aus bus
Linnaeus 1758 sec. Pyle 2014”  (the part before the “sec.” represents the
name-object, and the part after the “sec.” refers to the specific usage
instance that applies the name-object to a taxon concept).

Classifications (per se) are a little bit different, but are often included
in the taxon concept space, even though they are technically not (logically)
part of the taxon concept.  The taxon concept is really the circumscribed
set of organisms included within the concept.  Changing the higher
classification, by itself, has no impact on the circumscribed set of
organisms included within the concept.  However, that’s a topic for another
can-of-worms discussion.

So
. The seven years of NOMINA meetings, millions of conversations and
millions of dollars has revealed that the notion of a “Taxon Name Usage”
instances (TNU), as indexed in the Global Names Usage Bank (GNUB), is an
extremely powerful unit that addresses taxon names (name-objects), taxon
concepts, and classifications; all with a single domain of identifiers
(minted for TNUs).  Rob Whitton and I have functioning prototypes that
demonstrate the power of TNUs for managing nomenclatural, taxonomic, and
classification data; and we just last week submitted a proposal to NSF to
expand these prototypes into full-function services.

The seven years and millions of conversations and dollars has also taught us
that the most practical way to manage this information in biodiversity
informatics-land is through two nodes:  a “dirty bucket” (GNI name-strings),
and a “clean bucket” (GNUB).  Dima Mozzherin has new funding from NSF to
begin developing the service workflows to bridge name-strings (as they exist
in most biodiversity databases) to Protonyms (the subset of TNUs that
represent name-objects).  Starting in October, we will begin to bridge our
respective prototypes (funded by NSF through the Global Names project) into
a seamless tool.  We hope to have something more meaningful to say about
this at TDWG; but one of the key things to keep in mind is that GNA (which
includes GNI & GNUB) are low-level cross-linking tools and services – NOT
replacements for CoL, ITIS, EOL, GBIF, WoRMS, NCBI taxonomy, etc., etc.,
etc.  These other initiatives provide the information that end-users
actually want.  The role of GNA is to provide a core infrastructure
(analogous to DNS) that most people use every day without ever knowing it.

The DOI thing is a bit of a misdirection.  The “identifiers” (sensu non-LOD
world) for name-strings are managed by GNI, and for TNUs by GNUB.  Both are
UUIDs, and as such are pure identifiers (i.e., not actionable by
themselves). DOI is one of many possible identifier dereferencing services
(ARC is another, and there are a host of others).  DOI happens to be a
particularly robust and useful dereferencing services, and as such it makes
perfect sense to me to represent TNU identifiers as DOIs, as long as someone
has the funding to make it happen.

So
 to follow on Rod’s example, the TNU representing the “name-object” for
the species epithet “vilcabambae”, as originally established in the
publication Lehr 2007, is:

4B913B74-E880-4EC9-B0A9-F3AB9F02288B

Alone, that UUID does even less for you than the text-string “Pristimantis
vilcabambae”  does.  However, combining it with a dereferencing service,
such as  <http://zoobank.org/> http://zoobank.org/, you can start doing some
more interesting things:

 <http://zoobank.org/4B913B74-E880-4EC9-B0A9-F3AB9F02288B>
http://zoobank.org/4B913B74-E880-4EC9-B0A9-F3AB9F02288B

For example, you can get to the original publication as registered in
ZooBank ( <http://zoobank.org/37BFC245-DDD6-4AB4-B4B1-DD6826B86873>
http://zoobank.org/37BFC245-DDD6-4AB4-B4B1-DD6826B86873), which gets you a
link to the DOI and the ResearchGate page for this reference.  You can also
get a link to the GBIF page, ITIS page, EOL page, ION page, and a few others
(you’d also get links to the ASW site, if they had continued to expose their
internal identifiers; though now it seems that they don’t anymore).  You
also see a call to BHL’s OpenURL service to “automagically” get the page
image of the original description.  And you get a resultset from GNI to see
links to other datasets.

And that’s all from just ONE metadata dereferencing service (ZooBank).  I
think it would be WONDERFUL to have this identifier represented within
DOI-space as well (e.g.,
<http://dx.doi.org/10.XXXXX/4B913B74-E880-4EC9-B0A9-F3AB9F02288B>
http://dx.doi.org/10.XXXXX/4B913B74-E880-4EC9-B0A9-F3AB9F02288B), but
someone needs to step forward as the “XXXXX” domain to mint the DOI.  By
doing so, not only would you be plugged into the GNA infrastructure (as
described above), but also the CrossRef infrastructure and all the whizbang
services that it provides.  PLAZI and GNA have agreed that a taxon treatment
= a TNU, and hence will share the same UUIDs for them (thus opening up the
PLAZI services for use with the same identifiers).

In summary, Taxon name-strings, name-objects, concepts (and also
classifications) are very different things, with different implied
properties, and different implied meanings.  GNA is well on its way to
serving robust services based on persistent identifiers that are actionable
through multiple dereferencing services. Including more dereferencing
services (like DOI) is a GOOD thing!  Re-using identifiers is a GOOD thing.
Unnecessarily re-inventing wheels is NOT a particularly good thing.

Aloha,

Rich

P.S The astute among you will have noticed that the GNA cross-links and
services (including ZooBank registrations) described above did not exist
before I started replying to this email. And that is the POINT.  GNA is an
INFRASTRUCTURE to allow *US* (we the biodiversity practitioners of the
world) to cross-link content.  The fact that I was able to use the EXISTING
GNA infrastructure to cross-link all these resources associated with the
text-string name “Pristimantis vilcabambae” in FAR LESS time than it took me
to compose this email message, speaks volumes about the potential that such
an infrastructure can have.

From: api-users-bounces at lists.gbif.org
[mailto:api-users-bounces at lists.gbif.org] On Behalf Of Roderic Page
Sent: Monday, August 18, 2014 3:53 AM
To: Rob Guralnick
Cc: api-users at lists.gbif.org
Subject: Re: [API-users] Is there a GBIF specific LSID that can be used?

Hi Rob,

At the risk of opening the whole taxon/name/concept can of worms, I’d see
this a little differently.

For me a taxon name is a name + the original publication, rather than simply
a text string. A taxon is different again, being essentially a statement
about a collection of things that belong to the same taxon, and a statement
of what to call them.

Taxon databases (e.g., GBIF) tend use strings for names, when it would be
more elegant to use identifiers for names + publications.  We could go some
way towards cleaning the mess we’ve accumulated if we adopted (and reused)
identifiers for these things. For a start, name strings that don’t map to
identifiers in nomenclators would immediately be under suspicion as being
potentially erroneous. it also links names to evidence, which is something
we’re spectacularly bad at doing at the moment. 

For example,  "Pristimantis vilcabambae” is a text string which isn’t
terribly useful. But if we combine that with details on where and when it
was published we get something a bit more useful:

 "Pristimantis vilcabambae Lehr 2007 published in DOI  <http://dx.doi.org/>
http://dx.doi.org/10.3099/0027-4100(2007)159[145:NEFLPP]2.0.CO;2 “  This is
the information I’m accumulating in BioNames, by combining metadata from ION
LSIDs with data from CrossRef and BioStor , see
<http://bionames.org/names/cluster/1949681>
http://bionames.org/names/cluster/1949681

Should this "name string + publication” get a DOI? Sure. Then I’d want GBIF
(and other taxon databases) to link to this name on their taxon pages. In
other words,  <http://www.gbif.org/species/2425396>
http://www.gbif.org/species/2425396 should have an identifier for the taxon
name, instead of simply using a text string. 

I’m beginning to sound like Rich Pyle, and he and I would a lost certainly
model these things differently, but name strings  <> taxon names <> taxa

Regards

Rod

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email:   <mailto:Roderic.Page at glasgow.ac.uk> Roderic.Page at glasgow.ac.uk
Tel:  +44 141 330 4778
Skype:  rdmpage
Facebook:   <http://www.facebook.com/rdmpage>
http://www.facebook.com/rdmpage
LinkedIn:   <http://uk.linkedin.com/in/rdmpage>
http://uk.linkedin.com/in/rdmpage
Twitter:   <http://twitter.com/rdmpage> http://twitter.com/rdmpage
Blog:   <http://iphylo.blogspot.com/> http://iphylo.blogspot.com
ORCID:   <http://orcid.org/0000-0002-7101-9767>
http://orcid.org/0000-0002-7101-9767

Citations:   <http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ>
http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ

On 18 Aug 2014, at 14:29, Robert Guralnick <
<mailto:Robert.Guralnick at colorado.edu> Robert.Guralnick at colorado.edu> wrote:

  Markus --- I think the answer to the question: "Would a taxon DOI be a
valuable feature for you?" really depends on some of the details.  With a
taxon name, you are putting a DOI on a string and one that has been
dissociated from its source(s).  I would think more valuable would be a DOI
linked to the checklist that contained the name, and maybe a passthrough (a
la suffix passthroughs in the EZID system) to the individual name.  That way
I can resolve that taxon name to the source from whence it came.  

Best, Rob

On Mon, Aug 18, 2014 at 3:44 AM, Markus Döring < <mailto:mdoering at gbif.org>
mdoering at gbif.org> wrote:

Hello Geoff,

GBIF uses simples integers as taxon identifiers, for example 2396049 for
Ecsenius bicolor.
These ids are stable, but obviously not globally unique. If you need a URI
right now I would recommend for now to use our restful portal URL:
 <http://www.gbif.org/species/2396049> http://www.gbif.org/species/2396049

For the future I could imagine us assigning DOIs to taxa reusing the current
integer ids, but that has to be carefully evaluated first.
Would a taxon DOI be a valuable feature for you?

Cheers,
Markus

--
Markus Döring
Software Developer
Global Biodiversity Information Facility (GBIF)
 <mailto:mdoering at gbif.org> mdoering at gbif.org
 <http://www.gbif.org/> http://www.gbif.org

On 05 Aug 2014, at 05:17, Geoff Shuetrim < <mailto:geoff at galexy.net>
geoff at galexy.net> wrote:

> Working with a range of web services, I have found myself making extensive
use of the LSIDs that are specific to each data source.  For example, for
ITIS, I use the Ecsenius bicolor LSID: urn:lsid:itis.gov:itis_tsn:636326
>
> For WoRMS, the LSID for Ecsenius bicolor is:
urn:lsid:marinespecies.org:taxname:277652
>
> For Atlas of living Australia the LSID for Ecsenius bicolor is:
>
urn:lsid:biodiversity.org.au:afd.taxon:99c29e7c-5b04-4e57-8e6b-82aa442a801a
>
> Is there a GBIF LSID that can similarly be used as a unique identifier for
a taxon? I have come across the various GBIF unique keys but these are not
unique outside of the GBIF environment and within the Gaia Guide systems I
am deciding how best to work with these, ensuring their uniqueness,
alongside identifiers from other data sources.
>
> Thanks again for your assistance.
>
> Geoff Shuetrim
> Gaia Guide Association
>  <http://www.gaiaguide.info/> http://www.gaiaguide.info/
> _______________________________________________
> API-users mailing list
>  <mailto:API-users at lists.gbif.org> API-users at lists.gbif.org
>  <http://lists.gbif.org/mailman/listinfo/api-users>
http://lists.gbif.org/mailman/listinfo/api-users

_______________________________________________
API-users mailing list
 <mailto:API-users at lists.gbif.org> API-users at lists.gbif.org
 <http://lists.gbif.org/mailman/listinfo/api-users>
http://lists.gbif.org/mailman/listinfo/api-users

_______________________________________________
API-users mailing list
 <mailto:API-users at lists.gbif.org> API-users at lists.gbif.org
 <http://lists.gbif.org/mailman/listinfo/api-users>
http://lists.gbif.org/mailman/listinfo/api-users

_______________________________________________
API-users mailing list
 <mailto:API-users at lists.gbif.org> API-users at lists.gbif.org
 <http://lists.gbif.org/mailman/listinfo/api-users>
http://lists.gbif.org/mailman/listinfo/api-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/api-users/attachments/20140819/58f38e59/attachment-0001.html