[Ala-portal] DwC-A loading problems

David.Martin at csiro.au David.Martin at csiro.au
Fri May 9 06:03:09 CEST 2014


Thanks Daniel.

The unique key is specified in the collectory.
For the dr0 resource, what are the settings in your instance of the collectory ?
If its possible, please send through a URL to your collectory instance.

Cheers

Dave

________________________________
From: Daniel Lins [daniel.lins at gmail.com]
Sent: 09 May 2014 13:59
To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
Subject: Re: [Ala-portal] DwC-A loading problems

Thanks David,

We use the DwC term "occurrenceID" to identify the records. It's a unique key.

However, when I reload a dataset to update some DwC terms of the records, the system duplicates this data (keeps the old record and creates another with changes).

For instance (update of locality).

Load 1 ($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l dataset.csv -b true)

{OccurrenceID: 1, municipality: Sao Paulo, ...},
{OccurrenceID: 2, municipality: Sao Paulo, ...}

Process 1 (biocache$ process dr0)
Index 1 (biocache$ index dr0)

Load 2 (updated records and new records) (($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l dataset-updated.csv -b true)

{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...},
{OccurrenceID: 3, municipality: Sao Paulo, ...}

Process 2 (biocache$ process dr0)
Index 2 (biocache$ index dr0)

Results shown by ALA:

{OccurrenceID: 1, municipality: Sao Paulo, ...},
{OccurrenceID: 2, municipality: Sao Paulo, ...},
{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...}
{OccurrenceID: 3, municipality: Sao Paulo, ...}

But I expected:

{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...}
{OccurrenceID: 3, municipality: Sao Paulo, ...}

I need to delete (delete-resource function) existing data before the reload? If no, what I did wrong to generate this data duplication?

Thanks!


Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins at usp.br<mailto:daniellins at usp.br>
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>





2014-05-07 0:46 GMT-03:00 <David.Martin at csiro.au<mailto:David.Martin at csiro.au>>:
Thanks Daniel. Natasha has now left the ALA.

The uniqueness of records is determined by information stored in the collectory. See screenshot [1].
By default, "catalogNumber" is used but you can change this to any number of fields that should be stable in the data.
Using unstable fields for the ID isn't recommended (e.g. scientificName).  To update the records, the process is to just re-load the dataset.

Automatically loaded - this isnt in use and we may remove from the UI in future iterations.
Incremental Load - affects the sample/process/index steps to only run these against the new records.  Load is always incremental based on the key field(s) but if the incremental load box isn’t checked it runs the sample/process/index steps against the whole data set. This can cause a large processing overhead when there’s a minor update to a large data set.

Cheers

Dave Martin
ALA

[1] http://bit.ly/1g72HFN

________________________________
From: Daniel Lins [daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>]
Sent: 05 May 2014 15:39
To: Quimby, Natasha (CES, Black Mountain)
Cc: ala-portal at lists.gbif.org<mailto:ala-portal at lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Martin, Dave (CES, Black Mountain); Pedro Corrêa
Subject: Re: [Ala-portal] DwC-A loading problems

Hi Natasha,

I managed to import the DwC-A file following the steps reported in the previous email. Thank you!

However, when I tried to update some metadata of an occurrence record (already stored in the database), the system created a new record with these duplicated information. So I started to have several records with the same occurrenceID (I did set in the data resource configuration to use "OcurrenceID" to uniquely identify a record).

How can I update existing records in the database? For instance, the location's metadata of an occurrence record stored in my database?

I also would like to better understand the behavior of the properties "Automatically loaded" and "Incremental Load".

Thanks!!

Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins at usp.br<mailto:daniellins at usp.br>
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>


2014-04-28 3:52 GMT-03:00 Daniel Lins <daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>>:
Thanks Natasha!

I will try your recommendations. Once finished, I will contact you.

Regards

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins at usp.br<mailto:daniellins at usp.br>
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>



2014-04-28 3:26 GMT-03:00 <Natasha.Quimby at csiro.au<mailto:Natasha.Quimby at csiro.au>>:

Hi Daniel,

When you specify a local DwcA Load the archive needs to be unzipped. Try unzipping 2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip and then running the following:
sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b

If you configure the collectory to provide the dwca the biocache automatically unzips the archive for you.  You would need to configure dr7 with the following connection parameters:

"protocol":"DwCA"
"termsForUniqueKey":["occurrenceID"],
"url":"file:////data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip"

You could then load the resource by:
sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7

If you continue to have issues please let us know.

Hope that this helps.

Regards
Natasha

From: Daniel Lins <daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>>
Date: Monday, 28 April 2014 3:54 PM
To: "ala-portal at lists.gbif.org<mailto:ala-portal at lists.gbif.org>" <ala-portal at lists.gbif.org<mailto:ala-portal at lists.gbif.org><mailto:ala-portal at lists.gbif.org><mailto:ala-portal at lists.gbif.org><mailto:ala-portal at lists.gbif.org><mailto:ala-portal at lists.gbif.org>>, "dos Remedios, Nick (CES, Black Mountain)" <Nick.Dosremedios at csiro.au<mailto:Nick.Dosremedios at csiro.au>>, "Martin, Dave (CES, Black Mountain)" <David.Martin at csiro.au<mailto:David.Martin at csiro.au>>
Subject: [Ala-portal] DwC-A loading problems

Hi Nick and Dave,

We are having some problems in Biocache during the upload of DwC-A files.

As shown below, after run the method "au.org.ala.util.DwCALoader", our system returns the error message "Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field delimiter"

I accomplished tests using DwC-A files with tab-delimited text files and comma-delimited text files. In both cases the error generated was the same.

What causes these problems? (** CSV Loader works great)

tab-delimited file test

poliusp at poliusp-VirtualBox:~/dev/biocache$ sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip
2014-04-28 01:44:02,837 INFO : [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties
2014-04-28 01:44:03,090 INFO : [ConfigModule] - Initialise SOLR
2014-04-28 01:44:03,103 INFO : [ConfigModule] - Initialise name matching indexes
2014-04-28 01:44:03,605 INFO : [ConfigModule] - Initialise persistence manager
2014-04-28 01:44:03,606 INFO : [ConfigModule] - Configure complete
Loading archive /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip for resource dr7 with unique terms List(dwc:occurrenceID) stripping spaces false incremental false testing false
Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field delimiter
        at org.gbif.file.CSVReaderFactory.buildArchiveFile(CSVReaderFactory.java:129)
        at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:46)
        at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(ArchiveFactory.java:344)
        at org.gbif.dwc.text.ArchiveFactory.openArchive(ArchiveFactory.java:289)
        at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
        at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
        at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
        at au.org.ala.util.DwCALoader.main(DwCALoader.scala)


comma-delimited file test

poliusp at poliusp-VirtualBox:~/dev/biocache$ sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l ./dwca-teste3.zip
2014-04-28 01:56:04,683 INFO : [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties
2014-04-28 01:56:04,940 INFO : [ConfigModule] - Initialise SOLR
2014-04-28 01:56:04,951 INFO : [ConfigModule] - Initialise name matching indexes
2014-04-28 01:56:05,437 INFO : [ConfigModule] - Initialise persistence manager
2014-04-28 01:56:05,438 INFO : [ConfigModule] - Configure complete
Loading archive ./dwca-teste3.zip for resource dr7 with unique terms List(dwc:occurrenceID) stripping spaces false incremental false testing false
Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field delimiter
        at org.gbif.file.CSVReaderFactory.buildArchiveFile(CSVReaderFactory.java:129)
        at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:46)
        at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(ArchiveFactory.java:344)
        at org.gbif.dwc.text.ArchiveFactory.openArchive(ArchiveFactory.java:289)
        at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
        at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
        at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
        at au.org.ala.util.DwCALoader.main(DwCALoader.scala)


Thanks!

Regards.
--
Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins at usp.br<mailto:daniellins at usp.br>
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>




--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>



--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>



--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins at gmail.com<mailto:daniel.lins at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ala-portal/attachments/20140509/cad0ff38/attachment-0001.html 


More information about the Ala-portal mailing list