Hi David,
I hope you are well.
This stems from the community developed Darwin Core (DwC) standard building on top of the
Dublin Core (DC) standard, rather than something GBIF have introduced. I’m afraid it is a little messy, but I will try and explain.
The original thinking by the DwC authors was that dwc:occurrenceID would be used to identify an occurrence in nature, while
dc:identifier would be a digital record identifier.
In practice almost everyone mapped their database foreign key to occurrenceID and thus they became pretty synonymous.
GBIF recommend that publishers put unique identifiers on records using dwc:occurrenceID. The IPT helps enforce this. This way the publisher can very easily control how many records they wish to appear in GBIF. We pass through
dc:identifier as you notice, but simply treat it like any other non-interpreted field – we have never interpreted this field, and it will not control how we identify records in the GBIF.org index. However, it makes sense to provide it, as there could
well be systems other than GBIF indexing data, and they may likely recognise Dublin Core terms.
Historically, where users have not provided occurrenceID they must have given data mapped with
dwc:insitutuionCode, dwc:collectionCode and dwc:catalogNumber or the equivalent in the
ABCD standard.
If a provider adds occurrenceID and leaves those 3 fields the same, we will recognise this and update the records. If they were to remove one of those triplets, we would insert new records and the old ones would be deleted at some point. It is a manual
process to allow us to engage with publishers before running deletions.
I notice that the dataset you linked to was published in 2007, before dwc:occurrenceID existed. It is therefore using the dwc:insitutuionCode, dwc:collectionCode and dwc:catalogNumber
identifier strategy.
Please note, that Darwin Core recommends concatenating the 3 fields to create a dwc:occurrenceID. Please be aware that this approach means that should someone chose to e.g. change the collection code, the occurrence record ID will also change thus
removing all linkability. If this is expected, then forging unique ids for records using e.g. UUIDs or similar would be a more robust longer term solution and in general we recommend targeting this.
We do recommend people strive to provide occurrenceID, even on older data. This simplifies things going forward.
I hope this helps, but please feel free to ask me any questions around this.
This is off topic, but while you are reading please know that we expect stateOrProvince to be a filter on GBIF next week along with locality, protocol, license, organismID, publishingOrgKey (API only), crawlID (API only). I know you have interest in this
functionality.
Best wishes,
Tim
Hi
I take the opportunity to ask about the difference between two GBIF terms:
What is the difference of "occurrenceID" compared to "identifier"? Both have the same value in this dataset:
1) What is exactly the meaning of that "identifier"? Why is it not explained in dwc terms page?
2) What happens if the data provider keeps all data UNCHANGED, but adds the "occurrenceID" which was missing?
Would next GBIF reindex keep the same number of records and add their occurrenceIDs? (perhaps looking at the triplet in that "identifier"?)
Would later on be safe to change any fields in the dataset (even "identifier", "catalognumber", ...) if that data provider keeps those occurrenceIDs stable?
Thanks