Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
Hi Rod
It is not done automatically due to the fact it normally happens due to some mapping error rather than by design.
Today we trigger it manually, but do want to automate it - probably only for cases it seems genuine.
Cheers, Tim
On 27 Aug 2016, at 08:01, Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> wrote:
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Thanks Tim,
Specific example I'm working on is DNA barcoding data from BOLD. Their data dumps and web API differ in how they identify same record (basically whether they include the suffix '.COI-5P' or not) which is deeply annoying. So I may have a case where I need to update ids for large number of records, and want the other version of those records to be replaced. Sounds like I would need to ask you specifically to delete old ones if I want to this to happen.
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
On Sat, Aug 27, 2016 at 7:12 AM +0100, "Tim Robertson" <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
Hi Rod
It is not done automatically due to the fact it normally happens due to some mapping error rather than by design.
Today we trigger it manually, but do want to automate it - probably only for cases it seems genuine.
Cheers, Tim
On 27 Aug 2016, at 08:01, Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> wrote:
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
The easiest approach is actually for us to delete all records before indexing the new version. Just before you are ready for it to be reindexed, if you let us know it's a trivial thing to do and would be complete in around 1hr.
Thanks, Tim
From: Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> Date: Saturday 27 August 2016 at 08:18 To: Tim Robertson <trobertson@gbif.orgmailto:trobertson@gbif.org> Cc: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <api-users@lists.gbif.orgmailto:api-users@lists.gbif.org> Subject: Re: [API-users] What happens to previous data after dataset/crawl?
Thanks Tim,
Specific example I'm working on is DNA barcoding data from BOLD. Their data dumps and web API differ in how they identify same record (basically whether they include the suffix '.COI-5P' or not) which is deeply annoying. So I may have a case where I need to update ids for large number of records, and want the other version of those records to be replaced. Sounds like I would need to ask you specifically to delete old ones if I want to this to happen.
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
On Sat, Aug 27, 2016 at 7:12 AM +0100, "Tim Robertson" <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
Hi Rod
It is not done automatically due to the fact it normally happens due to some mapping error rather than by design.
Today we trigger it manually, but do want to automate it - probably only for cases it seems genuine.
Cheers, Tim
On 27 Aug 2016, at 08:01, Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> wrote:
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
OK, I’ll do that. Many thanks!
Regards,
Rod
On 27 Aug 2016, at 08:17, Tim Robertson <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
The easiest approach is actually for us to delete all records before indexing the new version. Just before you are ready for it to be reindexed, if you let us know it’s a trivial thing to do and would be complete in around 1hr.
Thanks, Tim
From: Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> Date: Saturday 27 August 2016 at 08:18 To: Tim Robertson <trobertson@gbif.orgmailto:trobertson@gbif.org> Cc: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <api-users@lists.gbif.orgmailto:api-users@lists.gbif.org> Subject: Re: [API-users] What happens to previous data after dataset/crawl?
Thanks Tim,
Specific example I'm working on is DNA barcoding data from BOLD. Their data dumps and web API differ in how they identify same record (basically whether they include the suffix '.COI-5P' or not) which is deeply annoying. So I may have a case where I need to update ids for large number of records, and want the other version of those records to be replaced. Sounds like I would need to ask you specifically to delete old ones if I want to this to happen.
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
On Sat, Aug 27, 2016 at 7:12 AM +0100, "Tim Robertson" <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
Hi Rod
It is not done automatically due to the fact it normally happens due to some mapping error rather than by design.
Today we trigger it manually, but do want to automate it - probably only for cases it seems genuine.
Cheers, Tim
On 27 Aug 2016, at 08:01, Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> wrote:
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk Tel: +44 141 330 4778 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com ORCID: http://orcid.org/0000-0002-7101-9767 Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ResearchGate https://www.researchgate.net/profile/Roderic_Page
Hi
I take the opportunity to ask about the difference between two GBIF terms:
What is the difference of "occurrenceID" compared to "identifier"? Both have the same value in this dataset: http://www.gbif.org/occurrence/1291766512/verbatim http://api.gbif.org/v1/occurrence/1291766512
I see "occurrenceID" well explained here: http://gbif.blogspot.com.es/2014/04/ipt-v21.html http://rs.tdwg.org/dwc/terms/#occurrenceID
But I can't find the explanation for "identifier", which I think some institutions have been incorrectly understanding as "occurrenceID". For example: http://www.gbif.org/occurrence/142907792/verbatim http://api.gbif.org/v1/occurrence/142907792
There is an "identifier" in that occurrence, but no "occurrenceID". 1) What is exactly the meaning of that "identifier"? Why is it not explained in dwc terms page? 2) What happens if the data provider keeps all data UNCHANGED, but adds the "occurrenceID" which was missing? Would next GBIF reindex keep the same number of records and add their occurrenceIDs? (perhaps looking at the triplet in that "identifier"?) Would later on be safe to change any fields in the dataset (even "identifier", "catalognumber", ...) if that data provider keeps those occurrenceIDs stable?
Thanks
On 27 August 2016 at 08:01, Roderic Page Roderic.Page@glasgow.ac.uk wrote:
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
- Old data with ids 1-10 is deleted from GBIF index
- New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOS https://aka.ms/o0ukef
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi David,
I hope you are well.
This stems from the community developed Darwin Core (DwC) standard building on top of the Dublin Core (DC) standard, rather than something GBIF have introduced. I’m afraid it is a little messy, but I will try and explain.
The original thinking by the DwC authors was that dwc:occurrenceID would be used to identify an occurrence in nature, while dc:identifier would be a digital record identifier. In practice almost everyone mapped their database foreign key to occurrenceID and thus they became pretty synonymous.
GBIF recommend that publishers put unique identifiers on records using dwc:occurrenceID. The IPT helps enforce this. This way the publisher can very easily control how many records they wish to appear in GBIF. We pass through dc:identifier as you notice, but simply treat it like any other non-interpreted field – we have never interpreted this field, and it will not control how we identify records in the GBIF.org index. However, it makes sense to provide it, as there could well be systems other than GBIF indexing data, and they may likely recognise Dublin Core terms.
Historically, where users have not provided occurrenceID they must have given data mapped with dwc:insitutuionCode, dwc:collectionCode and dwc:catalogNumber or the equivalent in the ABCD standard. If a provider adds occurrenceID and leaves those 3 fields the same, we will recognise this and update the records. If they were to remove one of those triplets, we would insert new records and the old ones would be deleted at some point. It is a manual process to allow us to engage with publishers before running deletions.
I notice that the dataset you linked to was published in 2007, before dwc:occurrenceID existed. It is therefore using the dwc:insitutuionCode, dwc:collectionCode and dwc:catalogNumber identifier strategy.
Please note, that Darwin Core recommends concatenating the 3 fields to create a dwc:occurrenceID. Please be aware that this approach means that should someone chose to e.g. change the collection code, the occurrence record ID will also change thus removing all linkability. If this is expected, then forging unique ids for records using e.g. UUIDs or similar would be a more robust longer term solution and in general we recommend targeting this.
We do recommend people strive to provide occurrenceID, even on older data. This simplifies things going forward.
I hope this helps, but please feel free to ask me any questions around this.
This is off topic, but while you are reading please know that we expect stateOrProvince to be a filter on GBIF next week along with locality, protocol, license, organismID, publishingOrgKey (API only), crawlID (API only). I know you have interest in this functionality.
Best wishes, Tim
From: API-users <api-users-bounces@lists.gbif.orgmailto:api-users-bounces@lists.gbif.org> on behalf of Herbario SANT <sant.herbarium@gmail.commailto:sant.herbarium@gmail.com> Date: Sunday 28 August 2016 at 16:00 To: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <api-users@lists.gbif.orgmailto:api-users@lists.gbif.org> Subject: Re: [API-users] What happens to previous data after dataset/crawl?
Hi
I take the opportunity to ask about the difference between two GBIF terms:
What is the difference of "occurrenceID" compared to "identifier"? Both have the same value in this dataset: http://www.gbif.org/occurrence/1291766512/verbatim http://api.gbif.org/v1/occurrence/1291766512
I see "occurrenceID" well explained here: http://gbif.blogspot.com.es/2014/04/ipt-v21.html http://rs.tdwg.org/dwc/terms/#occurrenceID
But I can't find the explanation for "identifier", which I think some institutions have been incorrectly understanding as "occurrenceID". For example: http://www.gbif.org/occurrence/142907792/verbatim http://api.gbif.org/v1/occurrence/142907792
There is an "identifier" in that occurrence, but no "occurrenceID". 1) What is exactly the meaning of that "identifier"? Why is it not explained in dwc terms page? 2) What happens if the data provider keeps all data UNCHANGED, but adds the "occurrenceID" which was missing? Would next GBIF reindex keep the same number of records and add their occurrenceIDs? (perhaps looking at the triplet in that "identifier"?) Would later on be safe to change any fields in the dataset (even "identifier", "catalognumber", ...) if that data provider keeps those occurrenceIDs stable?
Thanks
On 27 August 2016 at 08:01, Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> wrote: Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- David García San León Herbario SANT Facultade de Farmacia - Laboratorio de Botánica Universidade de Santiago de Compostela 15782 - Galicia (Spain) http://www.usc.es/herbario
El lunes, 29 de agosto de 2016, Tim Robertson trobertson@gbif.org escribió:
(...) I’m afraid it is a little messy, but I will try and explain.
The original thinking by the DwC authors was that *dwc:occurrenceID* would be used to identify an occurrence in nature, while *dc:identifier* would be a digital record identifier. (...) If a provider adds occurrenceID and leaves those 3 fields the same, we will recognise this and update the records. If they were to remove one of those triplets, we would insert new records and the old ones would be deleted at some point. It is a manual process to allow us to engage with publishers before running deletions.
Thanks Tim. I understood your explanation about occurrenceID / identifier differences.
Can you elaborate a bit more on what happens if a provider using triplets decides to add occurrenceID? You said GBIF reindex (1st reindex) will recognize them and update the records. What if LATER ON (after 1st reindex) a record is changed so occurrenceID KEEPS STABLE but the triplet value is deleted or modified? Shouldn't next GBIF reindex (2nd and later) keep using the already extant occurrenceID? So, the record should be updated, not inserted.
From your comment I understood it would be inserted but I guess you meant
that happens if both changes (occurrenceID added + triplet changed) are done at the same time (same GBIF reindex). Would this two steps strategy make a difference to avoid us asking you to delete records?
And yes, I think it makes sense to use some kind of UUIDs for occurrenceID values. I understand these only need to be unique only in the dataset context, but its better to make them globally unique for linking purposes, right?
Thanks for your help, and indeed for the new incoming GBIF filtering options
David
I notice that the dataset you linked to was published in 2007, before dwc:occurrenceID existed. It is therefore using the *dwc:insitutuionCode* , *dwc:collectionCode* and *dwc:catalogNumber *identifier strategy.
Please note, that Darwin Core recommends concatenating the 3 fields to create a *dwc:occurrenceID*. Please be aware that this approach means that should someone chose to e.g. change the collection code, the occurrence record ID will also change thus removing all linkability. If this is expected, then forging unique ids for records using e.g. UUIDs or similar would be a more robust longer term solution and in general we recommend targeting this.
We do recommend people strive to provide occurrenceID, even on older data. This simplifies things going forward.
I hope this helps, but please feel free to ask me any questions around this.
This is off topic, but while you are reading please know that we expect stateOrProvince to be a filter on GBIF next week along with locality, protocol, license, organismID, publishingOrgKey (API only), crawlID (API only). I know you have interest in this functionality.
Best wishes, Tim
From: API-users <api-users-bounces@lists.gbif.org javascript:_e(%7B%7D,'cvml','api-users-bounces@lists.gbif.org');> on behalf of Herbario SANT <sant.herbarium@gmail.com javascript:_e(%7B%7D,'cvml','sant.herbarium@gmail.com');> Date: Sunday 28 August 2016 at 16:00 To: "api-users@lists.gbif.org javascript:_e(%7B%7D,'cvml','api-users@lists.gbif.org');" < api-users@lists.gbif.org javascript:_e(%7B%7D,'cvml','api-users@lists.gbif.org');> Subject: Re: [API-users] What happens to previous data after dataset/crawl?
Hi
I take the opportunity to ask about the difference between two GBIF terms:
What is the difference of "occurrenceID" compared to "identifier"? Both have the same value in this dataset: http://www.gbif.org/occurrence/1291766512/verbatim http://api.gbif.org/v1/occurrence/1291766512
I see "occurrenceID" well explained here: http://gbif.blogspot.com.es/2014/04/ipt-v21.html http://rs.tdwg.org/dwc/terms/#occurrenceID
But I can't find the explanation for "identifier", which I think some institutions have been incorrectly understanding as "occurrenceID". For example: http://www.gbif.org/occurrence/142907792/verbatim http://api.gbif.org/v1/occurrence/142907792
There is an "identifier" in that occurrence, but no "occurrenceID".
- What is exactly the meaning of that "identifier"? Why is it not
explained in dwc terms page? 2) What happens if the data provider keeps all data UNCHANGED, but adds the "occurrenceID" which was missing? Would next GBIF reindex keep the same number of records and add their occurrenceIDs? (perhaps looking at the triplet in that "identifier"?) Would later on be safe to change any fields in the dataset (even "identifier", "catalognumber", ...) if that data provider keeps those occurrenceIDs stable?
Thanks
On 27 August 2016 at 08:01, Roderic Page <Roderic.Page@glasgow.ac.uk javascript:_e(%7B%7D,'cvml','Roderic.Page@glasgow.ac.uk');> wrote:
Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
- Old data with ids 1-10 is deleted from GBIF index
- New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOS https://aka.ms/o0ukef
API-users mailing list API-users@lists.gbif.org javascript:_e(%7B%7D,'cvml','API-users@lists.gbif.org'); http://lists.gbif.org/mailman/listinfo/api-users
-- David García San León Herbario SANT Facultade de Farmacia - Laboratorio de Botánica Universidade de Santiago de Compostela 15782 - Galicia (Spain) http://www.usc.es/herbario
Thanks David,
Can you elaborate a bit more on what happens if a provider using triplets decides to add occurrenceID? You said GBIF reindex (1st reindex) will recognize them and update the records. What if LATER ON (after 1st reindex) a record is changed so occurrenceID KEEPS STABLE but the triplet value is deleted or modified? Shouldn't next GBIF reindex (2nd and later) keep using the already extant occurrenceID? So, the record should be updated, not inserted.
Once occurrenceIDs are established and indexed in GBIF.org these take precedence so subsequent changes will be updates. Therefore it is important that if you foresee making changes and want to keep stable records in GBIF to first add occurrenceID without changing the triples. Then any subsequent change can occur even on the triple.
All the best, Tim
I notice that the dataset you linked to was published in 2007, before dwc:occurrenceID existed. It is therefore using the dwc:insitutuionCode, dwc:collectionCode and dwc:catalogNumber identifier strategy.
Please note, that Darwin Core recommends concatenating the 3 fields to create a dwc:occurrenceID. Please be aware that this approach means that should someone chose to e.g. change the collection code, the occurrence record ID will also change thus removing all linkability. If this is expected, then forging unique ids for records using e.g. UUIDs or similar would be a more robust longer term solution and in general we recommend targeting this.
We do recommend people strive to provide occurrenceID, even on older data. This simplifies things going forward.
I hope this helps, but please feel free to ask me any questions around this.
This is off topic, but while you are reading please know that we expect stateOrProvince to be a filter on GBIF next week along with locality, protocol, license, organismID, publishingOrgKey (API only), crawlID (API only). I know you have interest in this functionality.
Best wishes, Tim
From: API-users <api-users-bounces@lists.gbif.orgjavascript:_e(%7B%7D,'cvml','api-users-bounces@lists.gbif.org');> on behalf of Herbario SANT <sant.herbarium@gmail.comjavascript:_e(%7B%7D,'cvml','sant.herbarium@gmail.com');> Date: Sunday 28 August 2016 at 16:00 To: "api-users@lists.gbif.orgjavascript:_e(%7B%7D,'cvml','api-users@lists.gbif.org');" <api-users@lists.gbif.orgjavascript:_e(%7B%7D,'cvml','api-users@lists.gbif.org');> Subject: Re: [API-users] What happens to previous data after dataset/crawl?
Hi
I take the opportunity to ask about the difference between two GBIF terms:
What is the difference of "occurrenceID" compared to "identifier"? Both have the same value in this dataset: http://www.gbif.org/occurrence/1291766512/verbatim http://api.gbif.org/v1/occurrence/1291766512
I see "occurrenceID" well explained here: http://gbif.blogspot.com.es/2014/04/ipt-v21.html http://rs.tdwg.org/dwc/terms/#occurrenceID
But I can't find the explanation for "identifier", which I think some institutions have been incorrectly understanding as "occurrenceID". For example: http://www.gbif.org/occurrence/142907792/verbatim http://api.gbif.org/v1/occurrence/142907792
There is an "identifier" in that occurrence, but no "occurrenceID". 1) What is exactly the meaning of that "identifier"? Why is it not explained in dwc terms page? 2) What happens if the data provider keeps all data UNCHANGED, but adds the "occurrenceID" which was missing? Would next GBIF reindex keep the same number of records and add their occurrenceIDs? (perhaps looking at the triplet in that "identifier"?) Would later on be safe to change any fields in the dataset (even "identifier", "catalognumber", ...) if that data provider keeps those occurrenceIDs stable?
Thanks
On 27 August 2016 at 08:01, Roderic Page <Roderic.Page@glasgow.ac.ukjavascript:_e(%7B%7D,'cvml','Roderic.Page@glasgow.ac.uk');> wrote: Just wanted to check the consequences of the following dataset operation.
Say I have a dataset with 10 occurrences with occurrence ids 1-10. In my local database I now assign those 10 occurrences new identifiers a-j. If I create a new DwCA file for my data and crawl the new archive, my expectation is:
1. Old data with ids 1-10 is deleted from GBIF index 2. New data with ids a-j is indexed
So, end result is dataset has 10 occurrences. I'm asking because I know in the past the some datasets have changed identifiers and this has resulted in records with old and new identifiers coexisting in GBIF index, resulting in duplicated data.
Obviously it would be nice to have stable, unchanging identifiers for occurrences, but the for data set I'm working with the creators have changed their minds between versions of the data :(
Regards,
Rod
Get Outlook for iOShttps://aka.ms/o0ukef
_______________________________________________ API-users mailing list API-users@lists.gbif.orgjavascript:_e(%7B%7D,'cvml','API-users@lists.gbif.org'); http://lists.gbif.org/mailman/listinfo/api-users
-- David García San León Herbario SANT Facultade de Farmacia - Laboratorio de Botánica Universidade de Santiago de Compostela 15782 - Galicia (Spain) http://www.usc.es/herbario
-- David García San León (dixitalización / control de fondos) Herbario SANT Facultade de Farmacia - Laboratorio de Botánica Universidade de Santiago de Compostela 15782 - Galicia (Spain) http://www.usc.es/herbario Tel. +34 881815022 Fax +34 981594912 Skype: herbarium_sant Twitter: @SANT_Herbarium
participants (3)
-
Herbario SANT
-
Roderic Page
-
Tim Robertson