Re: [IPT] Daily feeds and archive history

18 Feb 2019

      Hi Jonathan
(adding GBIF helpdesk to the CC)

This is just a quick answer which I expect will result in follow up questions.

In terms of citation, we use a DOI to identify the concept of a dataset, not the specific version. E.g. https://doi.org/10.15468/cup0nk
If you start deleting copies of data (e.g. a background housekeeping task) what will break are links to the downloads in the IPT pages.  https://ipt.huh.harvard.edu/ipt/resource?r=huh_all_records&v=1.3
This may or may not be considered a problem for you.

I think others might have contacted you about suggestions for improving the dataset titles being used but if not I would suggest considering correctly formatted titles as they are used in  many places (https://www.gbif.org/dataset/4e4f97d2-4670-4b24-b982-261e0a450faf).

I hope this helps as a start,
Tim

From: IPT <ipt-bounces@lists.gbif.org> on behalf of "Kennedy, Jonathan" <jonathan_kennedy@harvard.edu>
Date: Monday, 18 February 2019 at 18.31
To: "ipt@lists.gbif.org" <ipt@lists.gbif.org>
Subject: [IPT] Daily feeds and archive history

Hi All,

I am finishing an upgrade to the Harvard University Herbaria IPT instance and have configured our feeds for daily auto-publish. The HUH has invested in a mass digitization workflow and we are currently creating ~20,000 new vascular records per month (with minimal data), so we do have new records on a daily basis. However, our DwC archives are fairly large (100MB+), so we can’t keep the daily archive history. I am looking for guidance on how it will work with GBIF dataset citation if we do not preserve each daily archive. It seems problematic if a version of our dataset is used and cited but cannot be reconstructed.

Best regards,
Jonathan A. Kennedy
Director of Biodiversity Informatics
Harvard University Herbaria,
Department of Organismic and Evolutionary Biology

Re: [IPT] Daily feeds and archive history

Tim Robertson