[Pipelines] How stable are occurrences GBIF ids

Mon Oct 11 09:14:53 UTC 2021

Hello,

Context:

We are currently developing an Early Alert system for invasive species
where all data transits through GBIF:

- one one side, we are helping multiple data providers to push their
observations to GBIF at a high frequency (probably daily)
- on the other side, we have a web application
<https://github.com/riparias/early-warning-webapp> that allows users to be
alerted of specific species in specific locations. To do so, that web
applicatio*n refreshes its internal database daily, by dropping and
recreating all occurrences based on an automated GBIF data download*. Data
in the gbifId field is used as a long-term occurrence identifier in this
web application.

It is crucial for our system that we can unambiguously distinguish new
occurrences from existing occurrences that were recently republished (and
maybe updated), and also detect occurrences that have been updated in the
source dataset.

Finally, the question:

in the GBIF infrastructure, when exactly is a new occurrence Gbif ID issued
(vs reused). Is it just a matter of following the publisher's occurrenceID,
or is the algorithm that chooses Gbif IDs more complex?

Or, more practically: what should I say to our data providers to guarantee
all new occurrences they publish get a new gbifId and existing occurrences
that get republished keep the same gbifId?

I'd be also interested to know if you think the concept of using GBIF ids
as stable identifiers in an external tool is fundamentally flawed (but I
hope not!)

Thanks a lot, don't hesitate to redirect me if this is not the right place
to ask this question.

Nicolas

My apologies if this quest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gbif.org/pipermail/pipelines/attachments/20211011/df196552/attachment.html>