[Ala-portal] DwC-A loading problems
Daniel Lins
daniel.lins at gmail.com
Fri May 9 05:59:13 CEST 2014
Thanks David,
We use the DwC term "occurrenceID" to identify the records. It's a
unique key.
However, when I reload a dataset to update some DwC terms of the records,
the system duplicates this data (keeps the old record and creates another
with changes).
For instance (update of locality).
Load 1 ($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l
dataset.csv -b true)
{OccurrenceID: 1, municipality: Sao Paulo, ...},
{OccurrenceID: 2, municipality: Sao Paulo, ...}
Process 1 (biocache$ process dr0)
Index 1 (biocache$ index dr0)
Load 2 (updated records and new records) (($ java -cp .:biocache.jar
au.org.ala.util.DwcCSVLoader dr0 -l dataset-updated.csv -b true)
{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...},
{OccurrenceID: 3, municipality: Sao Paulo, ...}
Process 2 (biocache$ process dr0)
Index 2 (biocache$ index dr0)
Results shown by ALA:
{OccurrenceID: 1, municipality: Sao Paulo, ...},
{OccurrenceID: 2, municipality: Sao Paulo, ...},
{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...}
{OccurrenceID: 3, municipality: Sao Paulo, ...}
But I expected:
{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...}
{OccurrenceID: 3, municipality: Sao Paulo, ...}
I need to delete (delete-resource function) existing data before the
reload? If no, what I did wrong to generate this data duplication?
Thanks!
Regards,
Daniel Lins da Silva
(Mobile) 55 11 96144-4050
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins at usp.br
daniel.lins at gmail.com
2014-05-07 0:46 GMT-03:00 <David.Martin at csiro.au>:
> Thanks Daniel. Natasha has now left the ALA.
>
> The uniqueness of records is determined by information stored in the
> collectory. See screenshot [1].
> By default, "catalogNumber" is used but you can change this to any number
> of fields that should be stable in the data.
> Using unstable fields for the ID isn't recommended (e.g. scientificName).
> To update the records, the process is to just re-load the dataset.
>
> Automatically loaded - this isnt in use and we may remove from the UI in
> future iterations.
> Incremental Load - affects the sample/process/index steps to only run
> these against the new records. Load is always incremental based on the key
> field(s) but if the incremental load box isn’t checked it runs the
> sample/process/index steps against the whole data set. This can cause a
> large processing overhead when there’s a minor update to a large data set.
>
> Cheers
>
> Dave Martin
> ALA
>
> [1] http://bit.ly/1g72HFN
>
> ------------------------------
> *From:* Daniel Lins [daniel.lins at gmail.com]
> *Sent:* 05 May 2014 15:39
> *To:* Quimby, Natasha (CES, Black Mountain)
> *Cc:* ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black
> Mountain); Martin, Dave (CES, Black Mountain); Pedro Corrêa
> *Subject:* Re: [Ala-portal] DwC-A loading problems
>
> Hi Natasha,
>
> I managed to import the DwC-A file following the steps reported in the
> previous email. Thank you!
>
> However, when I tried to update some metadata of an occurrence record
> (already stored in the database), the system created a new record with
> these duplicated information. So I started to have several records with the
> same occurrenceID (I did set in the data resource configuration to use
> "OcurrenceID" to uniquely identify a record).
>
> How can I update existing records in the database? For instance, the
> location's metadata of an occurrence record stored in my database?
>
> I also would like to better understand the behavior of the properties
> "Automatically loaded" and "Incremental Load".
>
> Thanks!!
>
> Regards,
>
> Daniel Lins da Silva
> (Mobile) 55 11 96144-4050
> Research Center on Biodiversity and Computing (Biocomp)
> University of Sao Paulo, Brazil
> daniellins at usp.br
> daniel.lins at gmail.com
>
>
> 2014-04-28 3:52 GMT-03:00 Daniel Lins <daniel.lins at gmail.com>:
>
>> Thanks Natasha!
>>
>> I will try your recommendations. Once finished, I will contact you.
>>
>> Regards
>>
>> Daniel Lins da Silva
>> (Mobile) 55 11 96144-4050
>> Research Center on Biodiversity and Computing (Biocomp)
>> University of Sao Paulo, Brazil
>> daniellins at usp.br
>> daniel.lins at gmail.com
>>
>>
>>
>> 2014-04-28 3:26 GMT-03:00 <Natasha.Quimby at csiro.au>:
>>
>> Hi Daniel,
>>>
>>> When you specify a local DwcA Load the archive needs to be unzipped.
>>> Try unzipping *2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip *and then
>>> running the following:
>>> s*udo** java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l
>>> /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b*
>>>
>>> If you configure the collectory to provide the dwca the biocacheautomatically unzips the archive for you. You would need to configure dr7
>>> with the following connection parameters:
>>>
>>> "protocol":"DwCA"
>>> "termsForUniqueKey":["occurrenceID"],
>>> "url":"file:////data/collectory/upload/
>>> 1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip"
>>>
>>> You could then load the resource by:
>>> s*udo** java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7*
>>>
>>> If you continue to have issues please let us know.
>>>
>>> Hope that this helps.
>>>
>>> Regards
>>> Natasha
>>>
>>> From: Daniel Lins <daniel.lins at gmail.com>
>>> Date: Monday, 28 April 2014 3:54 PM
>>> To: "ala-portal at lists.gbif.org" <ala-portal at lists.gbif.org<ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org>
>>> >, "dos Remedios, Nick (CES, Black Mountain)" <Nick.Dosremedios at csiro.au
>>> >, "Martin, Dave (CES, Black Mountain)" <David.Martin at csiro.au>
>>> Subject: [Ala-portal] DwC-A loading problems
>>>
>>> Hi Nick and Dave,
>>>
>>> We are having some problems in Biocache during the upload of DwC-A
>>> files.
>>>
>>> As shown below, after run the method "au.org.ala.util.DwCALoader", our
>>> system returns the error message "Exception in thread "main" org.gbif.
>>> dwc.text.UnkownDelimitersException: Unable to detect field delimiter"
>>>
>>> I accomplished tests using DwC-A files with tab-delimited text files
>>> and comma-delimited text files. In both cases the error generated was the
>>> same.
>>>
>>> What causes these problems? (** CSV Loader works great)
>>>
>>> *tab-delimited file test*
>>>
>>> poliusp at poliusp-VirtualBox:~/dev/biocache$ s*udo java -cp
>>> .:biocache.jar au.org.ala.util.DwCALoader dr7 -l
>>> /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip*
>>> 2014-04-28 01:44:02,837 INFO : [ConfigModule] - Loading configuration
>>> from /data/biocache/config/biocache-config.properties
>>> 2014-04-28 01:44:03,090 INFO : [ConfigModule] - Initialise SOLR
>>> 2014-04-28 01:44:03,103 INFO : [ConfigModule] - Initialise name
>>> matching indexes
>>> 2014-04-28 01:44:03,605 INFO : [ConfigModule] - Initialise persistence
>>> manager
>>> 2014-04-28 01:44:03,606 INFO : [ConfigModule] - Configure complete
>>> Loading archive /data/collectory
>>> /upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip for
>>> resource dr7 with unique terms List(dwc:occurrenceID) stripping spaces
>>> false incremental false testing false
>>> *Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException:
>>> Unable to detect field delimiter*
>>> at org.gbif.file.CSVReaderFactory.buildArchiveFile(
>>> CSVReaderFactory.java:129)
>>> at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:
>>> 46)
>>> at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(
>>> ArchiveFactory.java:344)
>>> at org.gbif.dwc.text.ArchiveFactory.openArchive(
>>> ArchiveFactory.java:289)
>>> at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
>>> at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
>>> at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
>>> at au.org.ala.util.DwCALoader.main(DwCALoader.scala)
>>>
>>>
>>> *comma-delimited file test*
>>>
>>> poliusp at poliusp-VirtualBox:~/dev/biocache$ *sudo java -cp
>>> .:biocache.jar au.org.ala.util.DwCALoader dr7 -l ./dwca-teste3.zip*
>>> 2014-04-28 01:56:04,683 INFO : [ConfigModule] - Loading configuration
>>> from /data/biocache/config/biocache-config.properties
>>> 2014-04-28 01:56:04,940 INFO : [ConfigModule] - Initialise SOLR
>>> 2014-04-28 01:56:04,951 INFO : [ConfigModule] - Initialise name
>>> matching indexes
>>> 2014-04-28 01:56:05,437 INFO : [ConfigModule] - Initialise persistence
>>> manager
>>> 2014-04-28 01:56:05,438 INFO : [ConfigModule] - Configure complete
>>> Loading archive ./dwca-teste3.zip for resource dr7 with unique terms
>>> List(dwc:occurrenceID) stripping spaces false incremental false testing
>>> false
>>> *Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException:
>>> Unable to detect field delimiter*
>>> at org.gbif.file.CSVReaderFactory.buildArchiveFile(
>>> CSVReaderFactory.java:129)
>>> at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:
>>> 46)
>>> at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(
>>> ArchiveFactory.java:344)
>>> at org.gbif.dwc.text.ArchiveFactory.openArchive(
>>> ArchiveFactory.java:289)
>>> at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
>>> at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
>>> at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
>>> at au.org.ala.util.DwCALoader.main(DwCALoader.scala)
>>>
>>>
>>> Thanks!
>>>
>>> Regards.
>>> --
>>> Daniel Lins da Silva
>>> (Mobile) 55 11 96144-4050
>>> Research Center on Biodiversity and Computing (Biocomp)
>>> University of Sao Paulo, Brazil
>>> daniellins at usp.br
>>> daniel.lins at gmail.com
>>>
>>>
>>
>>
>> --
>> Daniel Lins da Silva
>> (Cel) 11 6144-4050
>> daniel.lins at gmail.com
>>
>
>
>
> --
> Daniel Lins da Silva
> (Cel) 11 6144-4050
> daniel.lins at gmail.com
>
--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ala-portal/attachments/20140509/32de9efd/attachment-0001.html
More information about the Ala-portal
mailing list