Re: [IPT] [EXTERNAL] Re: How does one upload large datasets to GBIF?

7 Jul 2020

      Hi Annie,

With additional RAM allocated, the IPT can publish proportionally larger 
datasets.  However, this can be inefficient (or expensive in terms of 
RAM), especially if the dataset has extensions.

To construct the DWCA outside an IPT you will need:

- data files.  It's a good idea to check them for common errors -- 
incorrect number of columns, duplicate occurrenceIds and so on (the IPT 
does several checks like this).

- a meta.xml data description file, linking columns to Darwin Core 
terms. This can be written by hand, using various programming languages, 
or (often easiest if the process isn't to be repeated) by using an IPT 
to make a suitable mapping and extracting the resulting file.
- an eml.xml metadata file, describing the dataset.  The same applies 
here -- the IPT is useful for providing a UI to write this metadata, 
especially if all 8 are similar.

Once the DWCA exists, it should be copied to a webserver.

Note that using the registry API is not strictly necessary, and a 
publisher with a small, unchanging number of datasets outside the IPT 
need not use it.  They can simply give the helpdesk a URL for each 
dataset's DWCA file, and update the DWCA files at those URLs as necessary.

Using the API is useful for adding additional datasets, making changes 
(e.g. changing the URL) of the existing 8, or prompting GBIF to 
reprocess a dataset.  To use the API the technical team should create a 
suitable username (e.g. "usgs" or "bison") on both gbif.org and 
gbif-uat.org.  The latter is our test system.  They should then contact 
helpdesk@gbif.org to ask for permission for that account to make changes 
under the USGS 
<https://www.gbif.org/publisher/c3ad790a-d426-4ac1-8e32-da61f81f0117> 
publisher, or whichever publisher is/are appropriate.  This will only be 
on the test system at first.

It's then possible to register a new dataset under that publisher, 
following the example here: 
https://github.com/gbif/registry/tree/master/registry-examples/src/test/scri... 
and see the result.

For general questions on this, the GBIF API mailing list is probably 
most appropriate: https://lists.gbif.org/mailman/listinfo/api-users

If you have problems or errors with a specific dataset, 
helpdesk@gbif.org will be the best contact.  (They also read both 
mailing lists.)

Cheers

Matt

On 07/07/2020 17:48, Simpson, Annie wrote:
...
Thank you, Laura, for your replies.
The datasets have been exported from databases and cleaned. They are 
generally UTF-8 tab delimited files. So it seems that the GBIF 
Registry API would be the correct solution.
We currently have 8 of these large datasets, only 2 of which would not 
be updated in the future. Do you have names of GBIF Product Team 
Members whom my technical team should contact to begin this process? 
Is there "how to" documentation you can point me to that they should 
read first?
Annie
------------------------------------------------------------------------
*From:* Laura Anne Russell <larussell@gbif.org>
*Sent:* Tuesday, July 7, 2020 11:17 AM
*To:* Simpson, Annie <asimpson@usgs.gov>; ipt@lists.gbif.org 
<ipt@lists.gbif.org>
*Subject:* [EXTERNAL] Re: [IPT] How does one upload large datasets to 
GBIF?
* This email has been received from outside of DOI - Use caution 
before clicking on links, opening attachments, or responding. *
I could also mention that it is possible to script the creation of the 
Darwin Core Archives and then use the GBIF Registry API for the 
connections with GBIF. Symbiota, PlutoF and some others are 
successfully doing this. It does require some initial coordination 
with our Product Team on how to set up and coordinate the registration 
process and potentially with our Informatics Team.
Best,
Laura
Laura Anne Russell
Programme Officer for Participation and Engagement
Global Biodiversity Information Facility (GBIF) Secretariat
larussell@gbif.org (email)
laura.anne.russell (Skype)
@pagodarose (Twitter)
#CiteTheDOI @GBIF
https://www.gbif.org/
+45 35 33 35 51 (office, direct line)
GBIF
Universitetsparken 15
DK-2100 Copenhagen Ø
Denmark
*From: *IPT <ipt-bounces@lists.gbif.org> on behalf of "Simpson, Annie" 
<asimpson@usgs.gov>
*Date: *Tuesday, 7 July 2020 at 16.48
*To: *"ipt@lists.gbif.org" <ipt@lists.gbif.org>
*Subject: *[IPT] How does one upload large datasets to GBIF?
Colleagues:
What is the easiest or most popular way to send large datasets to 
GBIF, ones that are too large for the IPT software (I think that is 
more than 100MB zipped, 10+million records)? Does one modify their IPT 
instance? How? Or is there another process that is preferred?
We currently have IPT Version 2.3.6-r3985b6a installed and plan to 
upgrade to 2.4.0 soon.
A technical answer is what I seek (on behalf of our technical team).
Again my apologies if the answer to my question is easily found and 
I'm just not finding it.
Annie Simpson, BISON product owner
(she/her/hers)
BioFoundational Data Team
Science Analytics & Synthesis Program
U.S. Geological Survey
12201 Sunrise Valley Dr. Mailstop 302
Reston VA   20192
asimpson@usgs.gov
+1 703-648-4281
https://orcid.org/0000-0001-8338-5134
https://bison.usgs.gov
Image removed by sender. <https://bison.usgs.gov/>
Biodiversity Information Serving Our Nation (BISON) 
<https://bison.usgs.gov/>
USGS Biodiversity Information Serving Our Nation (BISON) is a unique, 
web-based Federal mapping resource for species occurrence data in the 
United States and its Territories and Canada, including marine 
Exclusive Economic Zones (EEZs).
bison.usgs.gov
_______________________________________________
IPT mailing list
IPT@lists.gbif.org
https://lists.gbif.org/mailman/listinfo/ipt

Re: [IPT] [EXTERNAL] Re: How does one upload large datasets to GBIF?

Matthew Blissett