Re: [API-users] What happens to previous data after dataset/crawl?

29 Aug 2016

      El lunes, 29 de agosto de 2016, Tim Robertson <trobertson@gbif.org>
escribió:
...
(...) I’m afraid it is a little messy, but I will try and explain.
The original thinking by the DwC authors was that *dwc:occurrenceID*
would be used to identify an occurrence in nature, while *dc:identifier*
would be a digital record identifier.
(...)
If a provider adds occurrenceID and leaves those 3 fields the same, we
will recognise this and update the records.  If they were to remove one of
those triplets, we would insert new records and the old ones would be
deleted at some point.  It is a manual process to allow us to engage with
publishers before running deletions.
Thanks Tim. I understood your explanation about occurrenceID / identifier
differences.

Can you elaborate a bit more on what happens if a provider using triplets
decides to add occurrenceID?
You said GBIF reindex (1st reindex) will recognize them and update the
records.
What if LATER ON (after 1st reindex) a record is changed so occurrenceID
KEEPS STABLE but the triplet value is deleted or modified? Shouldn't next
GBIF reindex (2nd and later) keep using the already extant occurrenceID?
So, the record should be updated, not inserted.
...
From your comment I understood it would be inserted but I guess you meant
that happens if both changes (occurrenceID added + triplet changed) are
done at the same time (same GBIF reindex).
Would this two steps strategy make a difference to avoid us asking you to
delete records?
And yes, I think it makes sense to use some kind of UUIDs for occurrenceID
values.
I understand these only need to be unique only in the dataset context, but
its better to make them globally unique for linking purposes, right?

Thanks for your help, and indeed for the new incoming GBIF filtering options

David
...
I notice that the dataset you linked to was published in 2007, before
dwc:occurrenceID existed.  It is therefore using the *dwc:insitutuionCode*
, *dwc:collectionCode* and *dwc:catalogNumber *identifier strategy.
Please note, that Darwin Core recommends concatenating the 3 fields to
create a *dwc:occurrenceID*.  Please be aware that this approach means
that should someone chose to e.g. change the collection code, the
occurrence record ID will also change thus removing all linkability.  If
this is expected, then forging unique ids for records using e.g. UUIDs or
similar would be a more robust longer term solution and in general we
recommend targeting this.
We do recommend people strive to provide occurrenceID, even on older
data.  This simplifies things going forward.
I hope this helps, but please feel free to ask me any questions around
this.
This is off topic, but while you are reading please know that we expect
stateOrProvince to be a filter on GBIF next week along with locality,
protocol, license, organismID, publishingOrgKey (API only), crawlID (API
only).  I know you have interest in this functionality.
Best wishes,
Tim
From: API-users <api-users-bounces@lists.gbif.org
<javascript:_e(%7B%7D,'cvml','api-users-bounces@lists.gbif.org');>> on
behalf of Herbario SANT <sant.herbarium@gmail.com
<javascript:_e(%7B%7D,'cvml','sant.herbarium@gmail.com');>>
Date: Sunday 28 August 2016 at 16:00
To: "api-users@lists.gbif.org
<javascript:_e(%7B%7D,'cvml','api-users@lists.gbif.org');>" <
api-users@lists.gbif.org
<javascript:_e(%7B%7D,'cvml','api-users@lists.gbif.org');>>
Subject: Re: [API-users] What happens to previous data after
dataset/crawl?
Hi
I take the opportunity to ask about the difference between two GBIF terms:
What is the difference of "occurrenceID" compared to "identifier"?  Both
have the same value in this dataset:
http://www.gbif.org/occurrence/1291766512/verbatim
http://api.gbif.org/v1/occurrence/1291766512
I see "occurrenceID" well explained here:
http://gbif.blogspot.com.es/2014/04/ipt-v21.html
http://rs.tdwg.org/dwc/terms/#occurrenceID
But I can't find the explanation for "identifier", which I think some
institutions have been incorrectly understanding as "occurrenceID".
For example:
http://www.gbif.org/occurrence/142907792/verbatim
http://api.gbif.org/v1/occurrence/142907792
There is an "identifier" in that occurrence, but no "occurrenceID".
1) What is exactly the meaning of that "identifier"?  Why is it not
explained in dwc terms page?
2) What happens if the data provider keeps all data UNCHANGED, but adds
the "occurrenceID" which was missing?
    Would next GBIF reindex keep the same number of records and add their
occurrenceIDs? (perhaps looking at the triplet in that "identifier"?)
    Would later on be safe to change any fields in the dataset (even
"identifier", "catalognumber", ...) if that data provider keeps those
occurrenceIDs stable?
Thanks
On 27 August 2016 at 08:01, Roderic Page <Roderic.Page@glasgow.ac.uk
<javascript:_e(%7B%7D,'cvml','Roderic.Page@glasgow.ac.uk');>> wrote:
...
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my
local database I now assign those 10 occurrences new identifiers a-j. If I
create a new DwCA file for my data and crawl the new archive, my
expectation is:
1. Old data with ids 1-10 is deleted from GBIF index
2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know
in the past the some datasets have changed identifiers and this has
resulted in records with old and new identifiers coexisting in GBIF index,
resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for
occurrences, but the for data set I'm working with the creators have
changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOS <https://aka.ms/o0ukef>
_______________________________________________
API-users mailing list
API-users@lists.gbif.org
<javascript:_e(%7B%7D,'cvml','API-users@lists.gbif.org');>
http://lists.gbif.org/mailman/listinfo/api-users
--
David García San León
Herbario SANT
Facultade de Farmacia - Laboratorio de Botánica
Universidade de Santiago de Compostela
15782 - Galicia (Spain)
http://www.usc.es/herbario
-- 
David García San León
(dixitalización / control de fondos)
Herbario SANT
Facultade de Farmacia - Laboratorio de Botánica
Universidade de Santiago de Compostela
15782 - Galicia (Spain)
http://www.usc.es/herbario
Tel. +34 881815022
Fax +34 981594912
Skype: herbarium_sant
Twitter: @SANT_Herbarium