[Ala-portal] DwC-A loading problems

Daniel Lins daniel.lins at gmail.com
Fri May 9 06:27:45 CEST 2014


David,

The dr0 configuration:

https://www.dropbox.com/s/lsy11jadwmyghjj/collectoryConfig1.png

Sorry, but this server doesn't have external access yet.




2014-05-09 1:06 GMT-03:00 <David.Martin at csiro.au>:

>  As an example of what it should look like, see:
>
> http://ala-demo.gbif.org/collectory/dataResource/edit/dr8?page=contribution
>
>
>  ------------------------------
> *From:* Daniel Lins [daniel.lins at gmail.com]
>
> *Sent:* 09 May 2014 13:59
> *To:* Martin, Dave (CES, Black Mountain)
> *Cc:* ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black
> Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
> *Subject:* Re: [Ala-portal] DwC-A loading problems
>
>   Thanks David,
>
>  We use the DwC term "occurrenceID" to identify the records. It's a
> unique key.
>
>  However, when I reload a dataset to update some DwC terms of the
> records, the system duplicates this data (keeps the old record and creates
> another with changes).
>
>  For instance (update of locality).
>
>  Load 1 ($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l
> dataset.csv -b true)
>
>  {OccurrenceID: 1, municipality: Sao Paulo, ...},
> {OccurrenceID: 2, municipality: Sao Paulo, ...}
>
>  Process 1 (biocache$ process dr0)
> Index 1 (biocache$ index dr0)
>
>  Load 2 (updated records and new records) (($ java -cp .:biocache.jar
> au.org.ala.util.DwcCSVLoader dr0 -l dataset-updated.csv -b true)
>
>  {OccurrenceID: 1, municipality: Rio de Janeiro, ...},
> {OccurrenceID: 2, municipality: Rio de Janeiro, ...},
>  {OccurrenceID: 3, municipality: Sao Paulo, ...}
>
>  Process 2 (biocache$ process dr0)
> Index 2 (biocache$ index dr0)
>
>  Results shown by ALA:
>
>  {OccurrenceID: 1, municipality: Sao Paulo, ...},
> {OccurrenceID: 2, municipality: Sao Paulo, ...},
>   {OccurrenceID: 1, municipality: Rio de Janeiro, ...},
> {OccurrenceID: 2, municipality: Rio de Janeiro, ...}
>  {OccurrenceID: 3, municipality: Sao Paulo, ...}
>
>  But I expected:
>
>  {OccurrenceID: 1, municipality: Rio de Janeiro, ...},
>  {OccurrenceID: 2, municipality: Rio de Janeiro, ...}
>  {OccurrenceID: 3, municipality: Sao Paulo, ...}
>
>  I need to delete (delete-resource function) existing data before the
> reload? If no, what I did wrong to generate this data duplication?
>
>  Thanks!
>
>
>  Regards,
>
>   Daniel Lins da Silva
>  (Mobile) 55 11 96144-4050
>  Research Center on Biodiversity and Computing (Biocomp)
> University of Sao Paulo, Brazil
>  daniellins at usp.br
>  daniel.lins at gmail.com
>
>
>
>
>
> 2014-05-07 0:46 GMT-03:00 <David.Martin at csiro.au>:
>
>>   Thanks Daniel. Natasha has now left the ALA.
>>
>>  The uniqueness of records is determined by information stored in the
>> collectory. See screenshot [1].
>> By default, "catalogNumber" is used but you can change this to any number
>> of fields that should be stable in the data.
>> Using unstable fields for the ID isn't recommended (e.g. scientificName).
>>  To update the records, the process is to just re-load the dataset.
>>
>>  Automatically loaded - this isnt in use and we may remove from the UI
>> in future iterations.
>> Incremental Load - affects the sample/process/index steps to only run
>> these against the new records.  Load is always incremental based on the key
>> field(s) but if the incremental load box isn’t checked it runs the
>> sample/process/index steps against the whole data set. This can cause a
>> large processing overhead when there’s a minor update to a large data set.
>>
>>  Cheers
>>
>>  Dave Martin
>>  ALA
>>
>>  [1] http://bit.ly/1g72HFN
>>
>>  ------------------------------
>> *From:* Daniel Lins [daniel.lins at gmail.com]
>> *Sent:* 05 May 2014 15:39
>> *To:* Quimby, Natasha (CES, Black Mountain)
>> *Cc:* ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black
>> Mountain); Martin, Dave (CES, Black Mountain); Pedro Corrêa
>> *Subject:* Re: [Ala-portal] DwC-A loading problems
>>
>>     Hi Natasha,
>>
>>  I managed to import the DwC-A file following the steps reported in the
>> previous email. Thank you!
>>
>>  However, when I tried to update some metadata of an occurrence record
>> (already stored in the database), the system created a new record with
>> these duplicated information. So I started to have several records with the
>> same occurrenceID (I did set in the data resource configuration to use
>> "OcurrenceID" to uniquely identify a record).
>>
>>  How can I update existing records in the database? For instance, the
>> location's metadata of an occurrence record stored in my database?
>>
>>  I also would like to better understand the behavior of the properties
>> "Automatically loaded" and "Incremental Load".
>>
>>  Thanks!!
>>
>>  Regards,
>>
>>   Daniel Lins da Silva
>>  (Mobile) 55 11 96144-4050
>>  Research Center on Biodiversity and Computing (Biocomp)
>> University of Sao Paulo, Brazil
>>  daniellins at usp.br
>>  daniel.lins at gmail.com
>>
>>
>> 2014-04-28 3:52 GMT-03:00 Daniel Lins <daniel.lins at gmail.com>:
>>
>>> Thanks Natasha!
>>>
>>>  I will try your recommendations. Once finished, I will contact you.
>>>
>>>  Regards
>>>
>>>  Daniel Lins da Silva
>>>  (Mobile) 55 11 96144-4050
>>>  Research Center on Biodiversity and Computing (Biocomp)
>>> University of Sao Paulo, Brazil
>>>  daniellins at usp.br
>>>  daniel.lins at gmail.com
>>>
>>>
>>>
>>>  2014-04-28 3:26 GMT-03:00 <Natasha.Quimby at csiro.au>:
>>>
>>>  Hi Daniel,
>>>>
>>>>  When you specify a local DwcA Load the archive needs to be unzipped.
>>>> Try unzipping *2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip *and then
>>>> running the following:
>>>>  s*udo** java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l
>>>> /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b*
>>>>
>>>>  If you configure the collectory to provide the dwca the biocacheautomatically unzips the archive for you.  You would need to configure dr7
>>>> with the following connection parameters:
>>>>
>>>>  "protocol":"DwCA"
>>>> "termsForUniqueKey":["occurrenceID"],
>>>> "url":"file:////data/collectory/upload/
>>>> 1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip"
>>>>
>>>>  You could then load the resource by:
>>>>  s*udo** java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7*
>>>>
>>>>  If you continue to have issues please let us know.
>>>>
>>>>  Hope that this helps.
>>>>
>>>>  Regards
>>>> Natasha
>>>>
>>>>   From: Daniel Lins <daniel.lins at gmail.com>
>>>> Date: Monday, 28 April 2014 3:54 PM
>>>> To: "ala-portal at lists.gbif.org" <ala-portal at lists.gbif.org<ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org>
>>>> >, "dos Remedios, Nick (CES, Black Mountain)" <
>>>> Nick.Dosremedios at csiro.au>, "Martin, Dave (CES, Black Mountain)" <
>>>> David.Martin at csiro.au>
>>>> Subject: [Ala-portal] DwC-A loading problems
>>>>
>>>>   Hi Nick and Dave,
>>>>
>>>>  We are having some problems in Biocache during the upload of DwC-A
>>>> files.
>>>>
>>>>  As shown below, after run the method "au.org.ala.util.DwCALoader",
>>>> our system returns the error message "Exception in thread "main" org.
>>>> gbif.dwc.text.UnkownDelimitersException: Unable to detect field
>>>> delimiter"
>>>>
>>>>  I accomplished tests using DwC-A files with tab-delimited text files
>>>> and comma-delimited text files. In both cases the error generated was the
>>>> same.
>>>>
>>>>  What causes these problems? (** CSV Loader works great)
>>>>
>>>>  *tab-delimited file test*
>>>>
>>>>  poliusp at poliusp-VirtualBox:~/dev/biocache$ s*udo java -cp
>>>> .:biocache.jar au.org.ala.util.DwCALoader dr7 -l
>>>> /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip*
>>>> 2014-04-28 01:44:02,837 INFO : [ConfigModule] - Loading configuration
>>>> from /data/biocache/config/biocache-config.properties
>>>> 2014-04-28 01:44:03,090 INFO : [ConfigModule] - Initialise SOLR
>>>> 2014-04-28 01:44:03,103 INFO : [ConfigModule] - Initialise name
>>>> matching indexes
>>>> 2014-04-28 01:44:03,605 INFO : [ConfigModule] - Initialise persistence
>>>> manager
>>>> 2014-04-28 01:44:03,606 INFO : [ConfigModule] - Configure complete
>>>> Loading archive /data/collectory
>>>> /upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip for
>>>> resource dr7 with unique terms List(dwc:occurrenceID) stripping spaces
>>>> false incremental false testing false
>>>> *Exception in thread "main"
>>>> org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field
>>>> delimiter*
>>>>         at org.gbif.file.CSVReaderFactory.buildArchiveFile(
>>>> CSVReaderFactory.java:129)
>>>>         at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:
>>>> 46)
>>>>         at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(
>>>> ArchiveFactory.java:344)
>>>>         at org.gbif.dwc.text.ArchiveFactory.openArchive(
>>>> ArchiveFactory.java:289)
>>>>         at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
>>>>         at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
>>>>         at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
>>>>         at au.org.ala.util.DwCALoader.main(DwCALoader.scala)
>>>>
>>>>
>>>>  *comma-delimited file test*
>>>>
>>>>  poliusp at poliusp-VirtualBox:~/dev/biocache$ *sudo java -cp
>>>> .:biocache.jar au.org.ala.util.DwCALoader dr7 -l ./dwca-teste3.zip*
>>>> 2014-04-28 01:56:04,683 INFO : [ConfigModule] - Loading configuration
>>>> from /data/biocache/config/biocache-config.properties
>>>> 2014-04-28 01:56:04,940 INFO : [ConfigModule] - Initialise SOLR
>>>> 2014-04-28 01:56:04,951 INFO : [ConfigModule] - Initialise name
>>>> matching indexes
>>>> 2014-04-28 01:56:05,437 INFO : [ConfigModule] - Initialise persistence
>>>> manager
>>>> 2014-04-28 01:56:05,438 INFO : [ConfigModule] - Configure complete
>>>> Loading archive ./dwca-teste3.zip for resource dr7 with unique terms
>>>> List(dwc:occurrenceID) stripping spaces false incremental false
>>>> testing false
>>>> *Exception in thread "main"
>>>> org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field
>>>> delimiter*
>>>>         at org.gbif.file.CSVReaderFactory.buildArchiveFile(
>>>> CSVReaderFactory.java:129)
>>>>         at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:
>>>> 46)
>>>>         at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(
>>>> ArchiveFactory.java:344)
>>>>         at org.gbif.dwc.text.ArchiveFactory.openArchive(
>>>> ArchiveFactory.java:289)
>>>>         at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
>>>>         at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
>>>>         at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
>>>>         at au.org.ala.util.DwCALoader.main(DwCALoader.scala)
>>>>
>>>>
>>>>  Thanks!
>>>>
>>>>  Regards.
>>>> --
>>>>  Daniel Lins da Silva
>>>> (Mobile) 55 11 96144-4050
>>>>  Research Center on Biodiversity and Computing (Biocomp)
>>>> University of Sao Paulo, Brazil
>>>>  daniellins at usp.br
>>>> daniel.lins at gmail.com
>>>>
>>>>
>>>
>>>
>>>  --
>>> Daniel Lins da Silva
>>> (Cel) 11 6144-4050
>>> daniel.lins at gmail.com
>>>
>>
>>
>>
>>  --
>> Daniel Lins da Silva
>> (Cel) 11 6144-4050
>> daniel.lins at gmail.com
>>
>
>
>
>  --
> Daniel Lins da Silva
> (Cel) 11 6144-4050
> daniel.lins at gmail.com
>



-- 
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ala-portal/attachments/20140509/a9abfb5d/attachment-0001.html 


More information about the Ala-portal mailing list