February 2019 - IPT - lists.gbif.org

Re: [IPT] Daily feeds and archive history
by Kennedy, Jonathan 19 Feb '19

19 Feb '19

Thanks very much, this is helpful feedback. On a related note, Harvard-IQSS created a platform called Dataverse (https://dataverse.org/about) around 2007 and one interesting element is that they published a method for hashing datasets. This is done for the purpose of creating a citation element that can be used to verify that you have downloaded the same data. Passing along in case this is of interest to the group: http://best-practices.dataverse.org/data-citation/ Best regards, Jonathan A. Kennedy Director of Biodiversity Informatics Harvard University Herbaria, Department of Organismic and Evolutionary Biology From: Daniel Noesgaard <dnoesgaard(a)gbif.org> Date: Tuesday, February 19, 2019 at 3:22 AM To: Quentin Groom <quentin.groom(a)plantentuinmeise.be>, Tim Robertson <trobertson(a)gbif.org> Cc: "Kennedy, Jonathan" <jonathan_kennedy(a)harvard.edu>, "ipt(a)lists.gbif.org list" <ipt(a)lists.gbif.org>, helpdesk <helpdesk(a)gbif.org> Subject: Re: [IPT] Daily feeds and archive history I might also add that every download from GBIF.org–be it a single dataset or an aggregate–is archived and given a unique, persistent DOI for citation. And that citations of downloads count against all the datasets that contributed to that download. -- Daniel Noesgaard Science Communications Coordinator GBIF | Global Biodiversity Information Facility - Secretariat Universitetsparken 15 DK-2100 Copenhagen, Denmark E: dnoesgaard(a)gbif.org<mailto:dnoesgaard@gbif.org> W: www.gbif.org T: +45 35 32 08 74 From: Quentin Groom <quentin.groom(a)plantentuinmeise.be> Date: Tuesday, 19 February 2019 at 08.38 To: Tim Robertson <trobertson(a)gbif.org> Cc: "Kennedy, Jonathan" <jonathan_kennedy(a)harvard.edu>, "ipt(a)lists.gbif.org list" <ipt(a)lists.gbif.org>, helpdesk <helpdesk(a)gbif.org>, Daniel Noesgaard <dnoesgaard(a)gbif.org> Subject: Re: [IPT] Daily feeds and archive history While it would be great to have versioned datasets I generally create a snapshot of the data used in a paper and archive this in Zenodo. This gives complete reproducibility without putting extra demands on the data providers. I do however need to cite the source and the snapshot. Regards Quentin On Mon, 18 Feb 2019, 17:45 Tim Robertson <trobertson(a)gbif.org<mailto:trobertson@gbif.org> wrote: Hi Jonathan (adding GBIF helpdesk to the CC) This is just a quick answer which I expect will result in follow up questions. In terms of citation, we use a DOI to identify the concept of a dataset, not the specific version. E.g. https://doi.org/10.15468/cup0nk<https://urldefense.proofpoint.com/v2/url?u=https-3A__doi.org_10.15468_cup0n…> If you start deleting copies of data (e.g. a background housekeeping task) what will break are links to the downloads in the IPT pages. https://ipt.huh.harvard.edu/ipt/resource?r=huh_all_records&v=1.3 This may or may not be considered a problem for you. I think others might have contacted you about suggestions for improving the dataset titles being used but if not I would suggest considering correctly formatted titles as they are used in many places (https://www.gbif.org/dataset/4e4f97d2-4670-4b24-b982-261e0a450faf)<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.gbif.org_dataset_4…>. I hope this helps as a start, Tim From: IPT <ipt-bounces(a)lists.gbif.org<mailto:ipt-bounces@lists.gbif.org>> on behalf of "Kennedy, Jonathan" <jonathan_kennedy(a)harvard.edu<mailto:jonathan_kennedy@harvard.edu>> Date: Monday, 18 February 2019 at 18.31 To: "ipt(a)lists.gbif.org<mailto:ipt@lists.gbif.org>" <ipt(a)lists.gbif.org<mailto:ipt@lists.gbif.org>> Subject: [IPT] Daily feeds and archive history Hi All, I am finishing an upgrade to the Harvard University Herbaria IPT instance and have configured our feeds for daily auto-publish. The HUH has invested in a mass digitization workflow and we are currently creating ~20,000 new vascular records per month (with minimal data), so we do have new records on a daily basis. However, our DwC archives are fairly large (100MB+), so we can’t keep the daily archive history. I am looking for guidance on how it will work with GBIF dataset citation if we do not preserve each daily archive. It seems problematic if a version of our dataset is used and cited but cannot be reconstructed. Best regards, Jonathan A. Kennedy Director of Biodiversity Informatics Harvard University Herbaria, Department of Organismic and Evolutionary Biology _______________________________________________ IPT mailing list IPT(a)lists.gbif.org<mailto:IPT@lists.gbif.org> https://lists.gbif.org/mailman/listinfo/ipt<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.gbif.org_mailman…>

1 0

Re: [IPT] Daily feeds and archive history
by Tim Robertson 19 Feb '19

19 Feb '19

Hi Jonathan (adding GBIF helpdesk to the CC) This is just a quick answer which I expect will result in follow up questions. In terms of citation, we use a DOI to identify the concept of a dataset, not the specific version. E.g. https://doi.org/10.15468/cup0nk If you start deleting copies of data (e.g. a background housekeeping task) what will break are links to the downloads in the IPT pages. https://ipt.huh.harvard.edu/ipt/resource?r=huh_all_records&v=1.3 This may or may not be considered a problem for you. I think others might have contacted you about suggestions for improving the dataset titles being used but if not I would suggest considering correctly formatted titles as they are used in many places (https://www.gbif.org/dataset/4e4f97d2-4670-4b24-b982-261e0a450faf) I hope this helps as a start, Tim From: IPT <ipt-bounces(a)lists.gbif.org> on behalf of "Kennedy, Jonathan" <jonathan_kennedy(a)harvard.edu> Date: Monday, 18 February 2019 at 18.31 To: "ipt(a)lists.gbif.org" <ipt(a)lists.gbif.org> Subject: [IPT] Daily feeds and archive history Hi All, I am finishing an upgrade to the Harvard University Herbaria IPT instance and have configured our feeds for daily auto-publish. The HUH has invested in a mass digitization workflow and we are currently creating ~20,000 new vascular records per month (with minimal data), so we do have new records on a daily basis. However, our DwC archives are fairly large (100MB+), so we can’t keep the daily archive history. I am looking for guidance on how it will work with GBIF dataset citation if we do not preserve each daily archive. It seems problematic if a version of our dataset is used and cited but cannot be reconstructed. Best regards, Jonathan A. Kennedy Director of Biodiversity Informatics Harvard University Herbaria, Department of Organismic and Evolutionary Biology

2 1

Daily feeds and archive history
by Kennedy, Jonathan 18 Feb '19

18 Feb '19

Hi All, I am finishing an upgrade to the Harvard University Herbaria IPT instance and have configured our feeds for daily auto-publish. The HUH has invested in a mass digitization workflow and we are currently creating ~20,000 new vascular records per month (with minimal data), so we do have new records on a daily basis. However, our DwC archives are fairly large (100MB+), so we can’t keep the daily archive history. I am looking for guidance on how it will work with GBIF dataset citation if we do not preserve each daily archive. It seems problematic if a version of our dataset is used and cited but cannot be reconstructed. Best regards, Jonathan A. Kennedy Director of Biodiversity Informatics Harvard University Herbaria, Department of Organismic and Evolutionary Biology

1 0