[Ala-portal] DwC-A loading problems
Daniel Lins
daniel.lins at gmail.com
Fri May 9 07:38:48 CEST 2014
Thanks Dave! I will test in our server.
Regards,
Daniel Lins da Silva
(Mobile) 55 11 96144-4050
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins at usp.br
daniel.lins at gmail.com
2014-05-09 2:14 GMT-03:00 <David.Martin at csiro.au>:
> Thanks Daniel.
>
> I've spotted the problem:
>
> java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l
> dataset-updated.csv -b true
>
> this bypasses lookups against the collectory for the metadata.
>
> To load this dataset, you can use the biocache commandline tool like so:
>
> $ java -cp /usr/lib/biocache:/usr/lib/biocache/biocache
> -store-1.1-assembly.jar -Xms2g -Xmx2g
> au.org.ala.biocache.cmd.CommandLineTool
>
>
> ----------------------------
>
> | Biocache management tool |
>
> ----------------------------
>
> Please supply a command or hit ENTER to view command list.
>
> biocache> ingest dr8
>
> This will:
>
> 1) Retrieve the metadata from the configured instance of the collectory
> 2) Load, process, sample (if there are layers configured and available)
> and index
>
> Cheers
>
> Dave
>
> ------------------------------
> *From:* Daniel Lins [daniel.lins at gmail.com]
> *Sent:* 09 May 2014 14:27
>
> *To:* Martin, Dave (CES, Black Mountain)
> *Cc:* ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black
> Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
> *Subject:* Re: [Ala-portal] DwC-A loading problems
>
> David,
>
> The dr0 configuration:
>
> https://www.dropbox.com/s/lsy11jadwmyghjj/collectoryConfig1.png
>
> Sorry, but this server doesn't have external access yet.
>
>
>
>
> 2014-05-09 1:06 GMT-03:00 <David.Martin at csiro.au>:
>
>> As an example of what it should look like, see:
>>
>>
>> http://ala-demo.gbif.org/collectory/dataResource/edit/dr8?page=contribution
>>
>>
>> ------------------------------
>> *From:* Daniel Lins [daniel.lins at gmail.com]
>>
>> *Sent:* 09 May 2014 13:59
>> *To:* Martin, Dave (CES, Black Mountain)
>> *Cc:* ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black
>> Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
>> *Subject:* Re: [Ala-portal] DwC-A loading problems
>>
>> Thanks David,
>>
>> We use the DwC term "occurrenceID" to identify the records. It's a
>> unique key.
>>
>> However, when I reload a dataset to update some DwC terms of the
>> records, the system duplicates this data (keeps the old record and creates
>> another with changes).
>>
>> For instance (update of locality).
>>
>> Load 1 ($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l
>> dataset.csv -b true)
>>
>> {OccurrenceID: 1, municipality: Sao Paulo, ...},
>> {OccurrenceID: 2, municipality: Sao Paulo, ...}
>>
>> Process 1 (biocache$ process dr0)
>> Index 1 (biocache$ index dr0)
>>
>> Load 2 (updated records and new records) (($ java -cp .:biocache.jar
>> au.org.ala.util.DwcCSVLoader dr0 -l dataset-updated.csv -b true)
>>
>> {OccurrenceID: 1, municipality: Rio de Janeiro, ...},
>> {OccurrenceID: 2, municipality: Rio de Janeiro, ...},
>> {OccurrenceID: 3, municipality: Sao Paulo, ...}
>>
>> Process 2 (biocache$ process dr0)
>> Index 2 (biocache$ index dr0)
>>
>> Results shown by ALA:
>>
>> {OccurrenceID: 1, municipality: Sao Paulo, ...},
>> {OccurrenceID: 2, municipality: Sao Paulo, ...},
>> {OccurrenceID: 1, municipality: Rio de Janeiro, ...},
>> {OccurrenceID: 2, municipality: Rio de Janeiro, ...}
>> {OccurrenceID: 3, municipality: Sao Paulo, ...}
>>
>> But I expected:
>>
>> {OccurrenceID: 1, municipality: Rio de Janeiro, ...},
>> {OccurrenceID: 2, municipality: Rio de Janeiro, ...}
>> {OccurrenceID: 3, municipality: Sao Paulo, ...}
>>
>> I need to delete (delete-resource function) existing data before the
>> reload? If no, what I did wrong to generate this data duplication?
>>
>> Thanks!
>>
>>
>> Regards,
>>
>> Daniel Lins da Silva
>> (Mobile) 55 11 96144-4050
>> Research Center on Biodiversity and Computing (Biocomp)
>> University of Sao Paulo, Brazil
>> daniellins at usp.br
>> daniel.lins at gmail.com
>>
>>
>>
>>
>>
>> 2014-05-07 0:46 GMT-03:00 <David.Martin at csiro.au>:
>>
>>> Thanks Daniel. Natasha has now left the ALA.
>>>
>>> The uniqueness of records is determined by information stored in the
>>> collectory. See screenshot [1].
>>> By default, "catalogNumber" is used but you can change this to any
>>> number of fields that should be stable in the data.
>>> Using unstable fields for the ID isn't recommended (e.g. scientificName).
>>> To update the records, the process is to just re-load the dataset.
>>>
>>> Automatically loaded - this isnt in use and we may remove from the UI
>>> in future iterations.
>>> Incremental Load - affects the sample/process/index steps to only run
>>> these against the new records. Load is always incremental based on the key
>>> field(s) but if the incremental load box isn’t checked it runs the
>>> sample/process/index steps against the whole data set. This can cause a
>>> large processing overhead when there’s a minor update to a large data set.
>>>
>>> Cheers
>>>
>>> Dave Martin
>>> ALA
>>>
>>> [1] http://bit.ly/1g72HFN
>>>
>>> ------------------------------
>>> *From:* Daniel Lins [daniel.lins at gmail.com]
>>> *Sent:* 05 May 2014 15:39
>>> *To:* Quimby, Natasha (CES, Black Mountain)
>>> *Cc:* ala-portal at lists.gbif.org; dos Remedios, Nick (CES, Black
>>> Mountain); Martin, Dave (CES, Black Mountain); Pedro Corrêa
>>> *Subject:* Re: [Ala-portal] DwC-A loading problems
>>>
>>> Hi Natasha,
>>>
>>> I managed to import the DwC-A file following the steps reported in the
>>> previous email. Thank you!
>>>
>>> However, when I tried to update some metadata of an occurrence record
>>> (already stored in the database), the system created a new record with
>>> these duplicated information. So I started to have several records with the
>>> same occurrenceID (I did set in the data resource configuration to use
>>> "OcurrenceID" to uniquely identify a record).
>>>
>>> How can I update existing records in the database? For instance, the
>>> location's metadata of an occurrence record stored in my database?
>>>
>>> I also would like to better understand the behavior of the properties
>>> "Automatically loaded" and "Incremental Load".
>>>
>>> Thanks!!
>>>
>>> Regards,
>>>
>>> Daniel Lins da Silva
>>> (Mobile) 55 11 96144-4050
>>> Research Center on Biodiversity and Computing (Biocomp)
>>> University of Sao Paulo, Brazil
>>> daniellins at usp.br
>>> daniel.lins at gmail.com
>>>
>>>
>>> 2014-04-28 3:52 GMT-03:00 Daniel Lins <daniel.lins at gmail.com>:
>>>
>>>> Thanks Natasha!
>>>>
>>>> I will try your recommendations. Once finished, I will contact you.
>>>>
>>>> Regards
>>>>
>>>> Daniel Lins da Silva
>>>> (Mobile) 55 11 96144-4050
>>>> Research Center on Biodiversity and Computing (Biocomp)
>>>> University of Sao Paulo, Brazil
>>>> daniellins at usp.br
>>>> daniel.lins at gmail.com
>>>>
>>>>
>>>>
>>>> 2014-04-28 3:26 GMT-03:00 <Natasha.Quimby at csiro.au>:
>>>>
>>>> Hi Daniel,
>>>>>
>>>>> When you specify a local DwcA Load the archive needs to be unzipped.
>>>>> Try unzipping *2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip *and then
>>>>> running the following:
>>>>> s*udo** java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l
>>>>> /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b*
>>>>>
>>>>> If you configure the collectory to provide the dwca the biocacheautomatically unzips the archive for you. You would need to configure dr7
>>>>> with the following connection parameters:
>>>>>
>>>>> "protocol":"DwCA"
>>>>> "termsForUniqueKey":["occurrenceID"],
>>>>> "url":"file:////data/collectory/upload/
>>>>> 1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip"
>>>>>
>>>>> You could then load the resource by:
>>>>> s*udo** java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7*
>>>>>
>>>>> If you continue to have issues please let us know.
>>>>>
>>>>> Hope that this helps.
>>>>>
>>>>> Regards
>>>>> Natasha
>>>>>
>>>>> From: Daniel Lins <daniel.lins at gmail.com>
>>>>> Date: Monday, 28 April 2014 3:54 PM
>>>>> To: "ala-portal at lists.gbif.org" <ala-portal at lists.gbif.org<ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org><ala-portal at lists.gbif.org>
>>>>> >, "dos Remedios, Nick (CES, Black Mountain)" <
>>>>> Nick.Dosremedios at csiro.au>, "Martin, Dave (CES, Black Mountain)" <
>>>>> David.Martin at csiro.au>
>>>>> Subject: [Ala-portal] DwC-A loading problems
>>>>>
>>>>> Hi Nick and Dave,
>>>>>
>>>>> We are having some problems in Biocache during the upload of DwC-A
>>>>> files.
>>>>>
>>>>> As shown below, after run the method "au.org.ala.util.DwCALoader",
>>>>> our system returns the error message "Exception in thread "main" org.
>>>>> gbif.dwc.text.UnkownDelimitersException: Unable to detect field
>>>>> delimiter"
>>>>>
>>>>> I accomplished tests using DwC-A files with tab-delimited text files
>>>>> and comma-delimited text files. In both cases the error generated was the
>>>>> same.
>>>>>
>>>>> What causes these problems? (** CSV Loader works great)
>>>>>
>>>>> *tab-delimited file test*
>>>>>
>>>>> poliusp at poliusp-VirtualBox:~/dev/biocache$ s*udo java -cp
>>>>> .:biocache.jar au.org.ala.util.DwCALoader dr7 -l
>>>>> /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip*
>>>>> 2014-04-28 01:44:02,837 INFO : [ConfigModule] - Loading configuration
>>>>> from /data/biocache/config/biocache-config.properties
>>>>> 2014-04-28 01:44:03,090 INFO : [ConfigModule] - Initialise SOLR
>>>>> 2014-04-28 01:44:03,103 INFO : [ConfigModule] - Initialise name
>>>>> matching indexes
>>>>> 2014-04-28 01:44:03,605 INFO : [ConfigModule] - Initialise
>>>>> persistence manager
>>>>> 2014-04-28 01:44:03,606 INFO : [ConfigModule] - Configure complete
>>>>> Loading archive /data/collectory
>>>>> /upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip for
>>>>> resource dr7 with unique terms List(dwc:occurrenceID) stripping
>>>>> spaces false incremental false testing false
>>>>> *Exception in thread "main"
>>>>> org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field
>>>>> delimiter*
>>>>> at org.gbif.file.CSVReaderFactory.buildArchiveFile(
>>>>> CSVReaderFactory.java:129)
>>>>> at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:
>>>>> 46)
>>>>> at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(
>>>>> ArchiveFactory.java:344)
>>>>> at org.gbif.dwc.text.ArchiveFactory.openArchive(
>>>>> ArchiveFactory.java:289)
>>>>> atau.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
>>>>> at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
>>>>> at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
>>>>> at au.org.ala.util.DwCALoader.main(DwCALoader.scala)
>>>>>
>>>>>
>>>>> *comma-delimited file test*
>>>>>
>>>>> poliusp at poliusp-VirtualBox:~/dev/biocache$ *sudo java -cp
>>>>> .:biocache.jar au.org.ala.util.DwCALoader dr7 -l ./dwca-teste3.zip*
>>>>> 2014-04-28 01:56:04,683 INFO : [ConfigModule] - Loading configuration
>>>>> from /data/biocache/config/biocache-config.properties
>>>>> 2014-04-28 01:56:04,940 INFO : [ConfigModule] - Initialise SOLR
>>>>> 2014-04-28 01:56:04,951 INFO : [ConfigModule] - Initialise name
>>>>> matching indexes
>>>>> 2014-04-28 01:56:05,437 INFO : [ConfigModule] - Initialise
>>>>> persistence manager
>>>>> 2014-04-28 01:56:05,438 INFO : [ConfigModule] - Configure complete
>>>>> Loading archive ./dwca-teste3.zip for resource dr7 with unique terms
>>>>> List(dwc:occurrenceID) stripping spaces false incremental false
>>>>> testing false
>>>>> *Exception in thread "main"
>>>>> org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field
>>>>> delimiter*
>>>>> at org.gbif.file.CSVReaderFactory.buildArchiveFile(
>>>>> CSVReaderFactory.java:129)
>>>>> at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:
>>>>> 46)
>>>>> at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(
>>>>> ArchiveFactory.java:344)
>>>>> at org.gbif.dwc.text.ArchiveFactory.openArchive(
>>>>> ArchiveFactory.java:289)
>>>>> atau.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
>>>>> at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
>>>>> at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
>>>>> at au.org.ala.util.DwCALoader.main(DwCALoader.scala)
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Regards.
>>>>> --
>>>>> Daniel Lins da Silva
>>>>> (Mobile) 55 11 96144-4050
>>>>> Research Center on Biodiversity and Computing (Biocomp)
>>>>> University of Sao Paulo, Brazil
>>>>> daniellins at usp.br
>>>>> daniel.lins at gmail.com
>>>>>
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ala-portal/attachments/20140509/0a7ce920/attachment-0001.html
More information about the Ala-portal
mailing list