Re: [Ala-portal] DwC-A loading problems

27 Jun 2014

      Thanks for resending Pedro.

We are recommending using the Ansible scripts [1] for installation of components.
If this isn't possible, its worth checking out the templates in use by these scripts [2].
There is a property at the bottom of this file that disables the API key check.

security.apikey.checkEnabled=false

That said, that error looks like a network issue between the machine that is running the collectory and the machine you are loading the data to.
It's worth checking to see if the ports are open between these machines.

Cheers

Dave

[1] https://github.com/gbif/ala-install<https://github.com/gbif/ala-install/blob/master/ansible/roles/collectory/templates/config/collectory-config.properties>
[2] https://github.com/gbif/ala-install/blob/master/ansible/roles/collectory/tem...

________________________________
From: Daniel Lins [daniel.lins@gmail.com]
Sent: 27 June 2014 13:54
To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org; Pedro Corrêa
Subject: Re: [Ala-portal] DwC-A loading problems

Hi Dave,

Did you see this mail? Do you think this issue can be something related to the configuration of api_key property?

Thanks.

Regards,

2014-06-25 2:14 GMT-03:00 Daniel Lins <daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>>:
Hi Dave,

Thanks for the support.

The data loading in the biocache is working properly now. But error continues during the update of collectory (see below).

java.net.SocketTimeoutException: Read timed out
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at scalaj.http.Http$Request.liftedTree1$1(Http.scala:107)
at scalaj.http.Http$Request.process(Http.scala:103)
at scalaj.http.Http$Request.responseCode(Http.scala:120)
at au.org.ala.biocache.load.DataLoader$class.updateLastChecked(DataLoader.scala:354)
at au.org.ala.biocache.load.DwCALoader.updateLastChecked(DwCALoader.scala:74)
at au.org.ala.biocache.load.DwCALoader.load(DwCALoader.scala:103)
at au.org.ala.biocache.load.Loader.load(Loader.scala:75)
at au.org.ala.biocache.cmd.CMD$$anonfun$executeCommand$7.apply(CMD.scala:69)
at au.org.ala.biocache.cmd.CMD$$anonfun$executeCommand$7.apply(CMD.scala:69)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at au.org.ala.biocache.cmd.CMD$.executeCommand(CMD.scala:69)
at au.org.ala.biocache.cmd.CommandLineTool$.main(CommandLineTool.scala:22)
at au.org.ala.biocache.cmd.CommandLineTool.main(CommandLineTool.scala)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at scalaj.http.Http$Request$$anonfun$responseCode$1.apply(Http.scala:120)
atscala j.http.Http$Request$$anonfun$responseCode$1.apply(Http.scala:120)
at scalaj.http.Http$Request.liftedTree1$1(Http.scala:104)
... 13 more

In the external configuration file (/data/biocache/config/biocache-config.properties) the property registry.url is correct (registry.url=http://192.168.15.132:8080/<http://192.168.15.132:8080/>collectory/ws), indicating the URL of the collectory WS page.

It could be something related to permission for external access? How works this api_key property in the collectory?

Thanks!

Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-06-20 2:26 GMT-03:00 <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>:
Hi Daniel.

There is a version updated of biocache-store of 1.1.1 that helped fix some of the problems Burke spotted when loading darwin core archives downloaded from the GBIF portal. The symptom where similar (only one record loaded for a dataset).

The exception in point 3) indicates the URL you have configured for the collectory (registry.url in biocache.properties) is either incorrect, or the collectory can not be accessed for some reason. At the end of data load, the collectory is updated to indicate the last loaded date for that dataset. This is done using a webservice.

One thing to mention - if you want to remove all data from your database, the easiest thing to do is use the cassandra-cli and run the command:
...
...
truncate occ;
This will remove all occurrence records from the database, but not from the index.

The warnings you are seeing in the processing phase e.g.

2014-06-20 01:51:20,505 WARN : [ALANameSearcher] - Unable to parse Abaca bunchy top  (Babuvirus). Name of type virus unparsable: Abaca bunchy top  (Babuvirus)

are normal. This is referring to the sensitive species list in use.

Cheers

Dave

________________________________
From: Daniel Lins [daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>]
Sent: 20 June 2014 15:11
To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Pedro Corrêa
Subject: Re: [Ala-portal] DwC-A loading problems

Hi Dave, thanks for the information from the last email.

I'm following your advice and performing the update of our test environment for biocache version 1.1. But I'm having some problems and I would like to know if you or anyone has already found this issue and know a solution.

To update the biocache version I did these steps below (based on the Vagrant/Ansible installation process):

1. Cleaning of the database and index through delete-resource function (delete-resource dr0 dr1 dr2 ...);
2. An update of the Biocache config file (/data/biocache/config/biocache-config.properties) (copied from the Vagrant VM, with some configuration changes);
3. An update of the biocache build file (biocache. jar) (copied from the Vagrant VM - /usr/lib/biocache);
4. Deployment of the new biocache-service build (copied from the Vagrant VM - tomcat7/webapps/biocache-service.war)
5. An update of the Solr config files (schema.xml, solrconfig.xml) (copied from the Vagrant VM - /data/solr/biocache);
6. Exclusion of the indexing folder of Biocache Core (/data/solr/biocache/data);

Notes 1 ** No change was made in the Hubs-Webapp and Collectory.

Notes 2 **  The import of CSV files is working (using load-local-csv dr0 /<file_location>/xxx.csv).

I tried to import a Darwin Core Archive by following these steps:

1. Created a data resource (dr0);

2. Uploaded a DWC-A zip file into the DR using the "Upload File" option.

Protocol:DarwinCore archive
Location URL:file:////data/collectory/upload/1403239521145/dwca-ocorrencias_lobo_guara_1.zip
Automatically loaded:false
DwC terms that uniquely identify a record: occurrenceID
Strip whitespaces in key: false
Incremental Load: false

3. Used the Command Line Tool (Biocache) to Load (load dr0), Process (process dr0) and Index (index dr0) data.

During the data loading phase, the system generated these errors:

...
2014-06-20 01:49:12,506 INFO : [DataLoader] - Finished DwC loader. Records processed: 32
java.net.SocketTimeoutException: Read timed out
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at scalaj.http.Http$Request.liftedTree1$1(Http.scala:107)
at scalaj.http.Http$Request.process(Http.scala:103)
at scalaj.http.Http$Request.responseCode(Http.scala:120)
at au.org.ala.biocache.load.DataLoader$class.updateLastChecked(DataLoader.scala:354)
at au.org.ala.biocache.load.DwCALoader.updateLastChecked(DwCALoader.scala:74)
at au.org.ala.biocache.load.DwCALoader.load(DwCALoader.scala:103)
at au.org.ala.biocache.load.Loader.load(Loader.scala:75)
at au.org.ala.biocache.cmd.CMD$$anonfun$executeCommand$7.apply(CMD.scala:69)
at au.org.ala.biocache.cmd.CMD$$anonfun$executeCommand$7.apply(CMD.scala:69)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at au.org.ala.biocache.cmd.CMD$.executeCommand(CMD.scala:69)
at au.org.ala.biocache.cmd.CommandLineTool$.main(CommandLineTool.scala:22)
at au.org.ala.biocache.cmd.CommandLineTool.main(CommandLineTool.scala)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at scalaj.http.Http$Request$$anonfun$responseCode$1.apply(Http.scala:120)
at scalaj.http.Http$Request$$anonfun$responseCode$1.apply(Http.scala:120)
at scalaj.http.Http$Request.liftedTree1$1(Http.scala:104)
... 13 more

And in Cassandra was saved only one record:

cqlsh:occ> select * from occ;

 key      | portalId | uuid
----------+----------+--------------------------------------
 dr0|null |     null | 1b5b21fc-594a-46e6-b8db-cf37c50b8f7b

During the data processing phase, the system generated these additional errors:

...
Jun 20, 2014 1:51:08 AM org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory createDataSource
INFO: Building new data source for org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory
Jun 20, 2014 1:51:08 AM org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory createBackingStore
INFO: Building backing store for org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory
2014-06-20 01:51:20,505 WARN : [ALANameSearcher] - Unable to parse Abaca bunchy top  (Babuvirus). Name of type virus unparsable: Abaca bunchy top  (Babuvirus)
2014-06-20 01:51:20,509 WARN : [ALANameSearcher] - Unable to parse Abaca mosaic, sugarcane mosaic (Potyvirus). Name of type virus unparsable: Abaca mosaic, sugarcane mosaic (Potyvirus)
2014-06-20 01:51:21,210 WARN : [ALANameSearcher] - Unable to parse Acute bee paralysis  (Cripavirus). Name of type virus unparsable: Acute bee paralysis  (Cripavirus)
2014-06-20 01:51:21,255 WARN : [ALANameSearcher] - Unable to parse Agropyron mosaic  (Rymovirus). Name of type virus unparsable: Agropyron mosaic  (Rymovirus)
2014-06-20 01:51:21,289 WARN : [ALANameSearcher] - Unable to parse Alphacrytovirus vicia. Name of type virus unparsable: Alphacrytovirus vicia
2014-06-20 01:51:21,334 WARN : [ALANameSearcher] - Unable to parse American plum line pattern  (APLPV, Ilaravirus). Name of type virus unparsable: American plum line pattern  (APLPV, Ilaravirus)
2014-06-20 01:51:21,525 WARN : [ALANameSearcher] - Unable to parse Apis iridescent  (Iridovirus). Name of type virus unparsable: Apis iridescent  (Iridovirus)
2014-06-20 01:51:21,546 WARN : [ALANameSearcher] - Unable to parse Apricot ring pox  (Unassigned). Name of type blacklisted unparsable: Apricot ring pox  (Unassigned)
2014-06-20 01:51:21,549 WARN : [ALANameSearcher] - Unable to parse Arabis mosaic  (Nepovirus). Name of type virus unparsable: Arabis mosaic  (Nepovirus)
2014-06-20 01:51:21,623 WARN : [ALANameSearcher] - Unable to parse Artichoke Italian latent  (Nepovirus). Name of type virus unparsable: Artichoke Italian latent  (Nepovirus)
2014-06-20 01:51:21,640 WARN : [ALANameSearcher] - Unable to parse Asparagus   (Ilarvirus). Name of type virus unparsable: Asparagus   (Ilarvirus)
2014-06-20 01:51:21,641 WARN : [ALANameSearcher] - Unable to parse Asparagus   (Potyvirus). Name of type virus unparsable: Asparagus   (Potyvirus)
...

During the last phase there were no errors. However, only one record was indexed.

2014-06-20 01:54:07,739 INFO : [SolrIndexDAO] - >>>>>>>>>>>>> Document count of index: 1
2014-06-20 01:54:07,741 INFO : [SolrIndexDAO] - Finalise finished.

I attached a file with the complete messages generated by Biocache during this test.

Thanks!

Cheers.

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-06-18 6:15 GMT-03:00 <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>:
Hi Daniel,
...
From what you've said, Im not clear on what customisations you have made so its difficult to make a call on the impact of migrating to 1.1. We also do not know what subversion revisions you started with.
We can tell you that functionally there wasn't a great deal of difference between the later snapshots of 1.0 and 1.1.
The changes where largely structural i.e. a clean up of packages, removal of redundant code. We did this largely because we needed to (this code base is now over 5 years old) and we wanted to clean things up before other projects started to work with the software.

Upgrading to biocache-service 1.1 and biocache-store shouldnt require any changes to cassandra, but it may require and upgrade of SOLR. If this is the case, you'll need to regenerate your index using the biocache commandline tool. Upgrading to 1.1 shouldnt require any changes to hubs-webapp if you've customised this component.

I'd really recommend move to 1.1 sooner rather than later as it'll give you a stable baseline to work against.

Hope this helps,

Dave Martin
ALA

________________________________
From: Daniel Lins [daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>]
Sent: 18 June 2014 15:54

To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
Subject: Re: [Ala-portal] DwC-A loading problems

Hi Dave,

How can I update the Biocache-1.0-SNAPSHOT to the version 1.1? I updated the biocache-store (biocache.jar) and the config file (/data/biocache/conf/config.properties-biocache) but I still have problems. Which other steps I need  to do? Apparently this new version of the biocache configuration file generates impacts directly in my Biocache-Services and Solr.

This update will generate some impacts in other components?

I cannot use the installation process based on the Vagrant/Ansible because our environment is different and already have customizations. So I would like to update the biocache with minimum impact, if possible. After we will have to plan the update of the other components.

Can you advise me as to the best way forward?

Thanks!!

Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-05-26 3:58 GMT-03:00 <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>:
Thanks Daniel.

I'd recommend upgrading to 1.1 and I'd recommend installation with the ansible scripts. This will give you base line configuration.
The scripts can be tested on a local machine using vagrant.
The configuration between 1.0 and 1.1 changed significantly - removal of redundant, legacy properties, adoption of standard format for property names.
Heres the template used for the configuration file in the ansible scripts:

https://github.com/gbif/ala-install/blob/master/ansible/roles/biocache-servi...

Cheers

Dave Martin
ALA

________________________________
From: Daniel Lins [daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>]
Sent: 26 May 2014 15:02
To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
Subject: Re: [Ala-portal] DwC-A loading problems

Hi Dave,

When I ran the ingest command (ingest dr0), the system showed errors like these below. However, after the error messages, I ran the index command (index dr0), and data were published on the Portal.

2014-05-20 14:15:05,412 ERROR: [Grid] - cannot find GRID: /data/ala/data/layers/ready/diva/worldclim_bio_19
2014-05-20 14:15:05,414 ERROR: [Grid] - java.io.FileNotFoundException: /data/ala/data/layers/ready/diva/worldclim_bio_19.gri (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:122)
at org.ala.layers.intersect.Grid.getValues3(Grid.java:1017)
at org.ala.layers.intersect.SamplingThread.intersectGrid(SamplingThread.java:112)
at org.ala.layers.intersect.SamplingThread.sample(SamplingThread.java:97)
at org.ala.layers.intersect.SamplingThread.run(SamplingThread.java:67)

2014-05-20 14:15:05,447 INFO : [Sampling] - ********* END - TEST BATCH SAMPLING FROM FILE ***************
2014-05-20 14:15:05,496 INFO : [Sampling] - Finished loading: /tmp/sampling-dr0.txt in 49ms
2014-05-20 14:15:05,496 INFO : [Sampling] - Removing temporary file: /tmp/sampling-dr0.txt
2014-05-20 14:15:05,553 INFO : [Consumer] - Initialising thread: 0
2014-05-20 14:15:05,575 INFO : [Consumer] - Initialising thread: 1
2014-05-20 14:15:05,575 INFO : [Consumer] - Initialising thread: 2
2014-05-20 14:15:05,577 INFO : [Consumer] - In thread: 0
2014-05-20 14:15:05,579 INFO : [Consumer] - Initialising thread: 3
2014-05-20 14:15:05,579 INFO : [ProcessWithActors] - Starting with dr0| endingwith dr0|~
2014-05-20 14:15:05,581 INFO : [Consumer] - In thread: 2
2014-05-20 14:15:05,581 INFO : [Consumer] - In thread: 1
2014-05-20 14:15:05,584 INFO : [Consumer] - In thread: 3
2014-05-20 14:15:05,592 INFO : [ProcessWithActors] - Initialised actors...
2014-05-20 14:15:05,647 INFO : [ProcessWithActors] - First rowKey processed: dr0|urn:lsid:icmbio.gov.br:icmbio.parnaso.occurrence:MA120999
2014-05-20 14:15:05,998 INFO : [ProcessWithActors] - Last row key processed: dr0|urn:lsid:icmbio.gov.br:icmbio.parnaso.occurrence:MA99991
2014-05-20 14:15:06,006 INFO : [ProcessWithActors] - Finished.
2014-05-20 14:15:06,015 INFO : [AttributionDAO] - Calling web service for dr0
2014-05-20 14:15:06,017 INFO : [Consumer] - Killing (Actor.act) thread: 3
2014-05-20 14:15:06,016 INFO : [Consumer] - Killing (Actor.act) thread: 2
2014-05-20 14:15:06,015 INFO : [Consumer] - Killing (Actor.act) thread: 1
2014-05-20 14:15:06,289 INFO : [AttributionDAO] - Looking up collectory web service for ICMBIO|PARNASO
May 20, 2014 2:15:10 PM org.geotools.referencing.factory.epsg.ThreadedEpsgFactory <init>
INFO: Setting the EPSG factory org.geotools.referencing.factory.epsg.DefaultFactory to a 1800000ms timeout
May 20, 2014 2:15:10 PM org.geotools.referencing.factory.epsg.ThreadedEpsgFactory <init>
INFO: Setting the EPSG factory org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory to a 1800000ms timeout
May 20, 2014 2:15:10 PM org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory createDataSource
INFO: Building new data source for org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory
May 20, 2014 2:15:10 PM org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory createBackingStore
INFO: Building backing store for org.geotools.referencing.factory.epsg.ThreadedHsqlEpsgFactory
2014-05-20 14:15:32,105 INFO : [Consumer] - Killing (Actor.act) thread: 0
Indexing live with URL: null, and params: null&dataResource=dr0
java.lang.NullPointerException
at au.org.ala.util.CMD$.au$org$ala$util$CMD$$indexDataResourceLive$1(CommandLineTool.scala:371)
at au.org.ala.util.CMD$$anonfun$executeCommand$2.apply(CommandLineTool.scala:90)
at au.org.ala.util.CMD$$anonfun$executeCommand$2.apply(CommandLineTool.scala:86)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:105)
at au.org.ala.util.CMD$.executeCommand(CommandLineTool.scala:86)
at au.org.ala.util.CommandLineTool$.main(CommandLineTool.scala:26)
at au.org.ala.util.CommandLineTool.main(CommandLineTool.scala)

Nowadays we use the biocache-1.0-SNAPSHOT in our environment. But in your last mail you mentioned the version biocache-1.1-assembly.

I did download of this newer version, but when I ran it (ingest dr0) in our environment the system showed many errors (see below).

log4j:WARN custom level class [org.ala.client.appender.RestLevel] not found.
Exception in thread "main" java.lang.ExceptionInInitializerError
at au.org.ala.biocache.load.DataLoader$class.$init$(DataLoader.scala:28)
at au.org.ala.biocache.load.Loader.<init>(Loader.scala:34)
at au.org.ala.biocache.cmd.CMD$.executeCommand(CMD.scala:29)
at au.org.ala.biocache.cmd.CommandLineTool$.main(CommandLineTool.scala:22)
at au.org.ala.biocache.cmd.CommandLineTool.main(CommandLineTool.scala)
Caused by: com.google.inject.CreationException: Guice creation errors:

1) No implementation for java.lang.Integer annotated with @com.google.inject.name.Named(value=cassandra.max.connections) was bound.
  while locating java.lang.Integer annotated with @com.google.inject.name.Named(value=cassandra.max.connections)
    for parameter 4 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

2) No implementation for java.lang.Integer annotated with @com.google.inject.name.Named(value=cassandra.max.retries) was bound.
  while locating java.lang.Integer annotated with @com.google.inject.name.Named(value=cassandra.max.retries)
    for parameter 5 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

3) No implementation for java.lang.Integer annotated with @com.google.inject.name.Named(value=cassandra.port) was bound.
  while locating java.lang.Integer annotated with @com.google.inject.name.Named(value=cassandra.port)
    for parameter 1 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

4) No implementation for java.lang.Integer annotated with @com.google.inject.name.Named(value=thrift.operation.timeout) was bound.
  while locating java.lang.Integer annotated with @com.google.inject.name.Named(value=thrift.operation.timeout)
    for parameter 6 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

5) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=cassandra.hosts) was bound.
  while locating java.lang.String annotated with @com.google.inject.name.Named(value=cassandra.hosts)
    for parameter 0 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

6) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=cassandra.keyspace) was bound.
  while locating java.lang.String annotated with @com.google.inject.name.Named(value=cassandra.keyspace)
    for parameter 3 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

7) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=cassandra.pool) was bound.
  while locating java.lang.String annotated with @com.google.inject.name.Named(value=cassandra.pool)
    for parameter 2 at au.org.ala.biocache.persistence.CassandraPersistenceManager.<init>(CassandraPersistenceManager.scala:24)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:184)

8) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=exclude.sensitive.values) was bound.
  while locating java.lang.String annotated with @com.google.inject.name.Named(value=exclude.sensitive.values)
    for parameter 1 at au.org.ala.biocache.index.SolrIndexDAO.<init>(SolrIndexDAO.scala:28)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:164)

9) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=extra.misc.fields) was bound.
  while locating java.lang.String annotated with @com.google.inject.name.Named(value=extra.misc.fields)
    for parameter 2 at au.org.ala.biocache.index.SolrIndexDAO.<init>(SolrIndexDAO.scala:28)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:164)

10) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=solr.home) was bound.
  while locating java.lang.String annotated with @com.google.inject.name.Named(value=solr.home)
    for parameter 0 at au.org.ala.biocache.index.SolrIndexDAO.<init>(SolrIndexDAO.scala:28)
  at au.org.ala.biocache.ConfigModule.configure(Config.scala:164)

10 errors
at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:354)
at com.google.inject.InjectorBuilder.initializeStatically(InjectorBuilder.java:152)
at com.google.inject.InjectorBuilder.build(InjectorBuilder.java:105)
at com.google.inject.Guice.createInjector(Guice.java:92)
at com.google.inject.Guice.createInjector(Guice.java:69)
at com.google.inject.Guice.createInjector(Guice.java:59)
at au.org.ala.biocache.Config$.<init>(Config.scala:24)
at au.org.ala.biocache.Config$.<clinit>(Config.scala)
... 5 more

Related to these specific issues (data update and incremental load), I will need to upgrade the biocache version (1.1 or newer) or I could work with the version 1.0-SNAPSHOT? If I update this version, I will have compatibility with the other components? How should I proceed?

Which layer files should I include in my environment to run these tests?

Thanks!

Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-05-09 2:14 GMT-03:00 <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>:
Thanks Daniel.

I've spotted the problem:

java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l dataset-updated.csv -b true

this bypasses lookups against the collectory for the metadata.

To load this dataset, you can use the biocache commandline tool like so:

$ java -cp /usr/lib/biocache:/usr/lib/biocache/biocache-store-1.1-assembly.jar -Xms2g -Xmx2g au.org.ala.biocache.cmd.CommandLineTool

----------------------------

| Biocache management tool |

----------------------------

Please supply a command or hit ENTER to view command list.

biocache> ingest dr8

This will:

1) Retrieve the metadata from the configured instance of the collectory
2) Load, process, sample (if there are layers configured and available) and index

Cheers

Dave

________________________________
From: Daniel Lins [daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>]
Sent: 09 May 2014 14:27

To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
Subject: Re: [Ala-portal] DwC-A loading problems

David,

The dr0 configuration:

https://www.dropbox.com/s/lsy11jadwmyghjj/collectoryConfig1.png

Sorry, but this server doesn't have external access yet.

2014-05-09 1:06 GMT-03:00 <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>:
As an example of what it should look like, see:

http://ala-demo.gbif.org/collectory/dataResource/edit/dr8?page=contribution

________________________________
From: Daniel Lins [daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>]

Sent: 09 May 2014 13:59
To: Martin, Dave (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Pedro Corrêa; Nicholls, Miles (CES, Black Mountain)
Subject: Re: [Ala-portal] DwC-A loading problems

Thanks David,

We use the DwC term "occurrenceID" to identify the records. It's a unique key.

However, when I reload a dataset to update some DwC terms of the records, the system duplicates this data (keeps the old record and creates another with changes).

For instance (update of locality).

Load 1 ($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l dataset.csv -b true)

{OccurrenceID: 1, municipality: Sao Paulo, ...},
{OccurrenceID: 2, municipality: Sao Paulo, ...}

Process 1 (biocache$ process dr0)
Index 1 (biocache$ index dr0)

Load 2 (updated records and new records) (($ java -cp .:biocache.jar au.org.ala.util.DwcCSVLoader dr0 -l dataset-updated.csv -b true)

{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...},
{OccurrenceID: 3, municipality: Sao Paulo, ...}

Process 2 (biocache$ process dr0)
Index 2 (biocache$ index dr0)

Results shown by ALA:

{OccurrenceID: 1, municipality: Sao Paulo, ...},
{OccurrenceID: 2, municipality: Sao Paulo, ...},
{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...}
{OccurrenceID: 3, municipality: Sao Paulo, ...}

But I expected:

{OccurrenceID: 1, municipality: Rio de Janeiro, ...},
{OccurrenceID: 2, municipality: Rio de Janeiro, ...}
{OccurrenceID: 3, municipality: Sao Paulo, ...}

I need to delete (delete-resource function) existing data before the reload? If no, what I did wrong to generate this data duplication?

Thanks!

Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-05-07 0:46 GMT-03:00 <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>:
Thanks Daniel. Natasha has now left the ALA.

The uniqueness of records is determined by information stored in the collectory. See screenshot [1].
By default, "catalogNumber" is used but you can change this to any number of fields that should be stable in the data.
Using unstable fields for the ID isn't recommended (e.g. scientificName).  To update the records, the process is to just re-load the dataset.

Automatically loaded - this isnt in use and we may remove from the UI in future iterations.
Incremental Load - affects the sample/process/index steps to only run these against the new records.  Load is always incremental based on the key field(s) but if the incremental load box isn't checked it runs the sample/process/index steps against the whole data set. This can cause a large processing overhead when there's a minor update to a large data set.

Cheers

Dave Martin
ALA

[1] http://bit.ly/1g72HFN

________________________________
From: Daniel Lins [daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>]
Sent: 05 May 2014 15:39
To: Quimby, Natasha (CES, Black Mountain)
Cc: ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>; dos Remedios, Nick (CES, Black Mountain); Martin, Dave (CES, Black Mountain); Pedro Corrêa
Subject: Re: [Ala-portal] DwC-A loading problems

Hi Natasha,

I managed to import the DwC-A file following the steps reported in the previous email. Thank you!

However, when I tried to update some metadata of an occurrence record (already stored in the database), the system created a new record with these duplicated information. So I started to have several records with the same occurrenceID (I did set in the data resource configuration to use "OcurrenceID" to uniquely identify a record).

How can I update existing records in the database? For instance, the location's metadata of an occurrence record stored in my database?

I also would like to better understand the behavior of the properties "Automatically loaded" and "Incremental Load".

Thanks!!

Regards,

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-04-28 3:52 GMT-03:00 Daniel Lins <daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>>:
Thanks Natasha!

I will try your recommendations. Once finished, I will contact you.

Regards

Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

2014-04-28 3:26 GMT-03:00 <Natasha.Quimby@csiro.au<mailto:Natasha.Quimby@csiro.au>>:

Hi Daniel,

When you specify a local DwcA Load the archive needs to be unzipped. Try unzipping 2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip and then running the following:
sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b

If you configure the collectory to provide the dwca the biocache automatically unzips the archive for you.  You would need to configure dr7 with the following connection parameters:

"protocol":"DwCA"
"termsForUniqueKey":["occurrenceID"],
"url":"file:////data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip"

You could then load the resource by:
sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7

If you continue to have issues please let us know.

Hope that this helps.

Regards
Natasha

From: Daniel Lins <daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>>
Date: Monday, 28 April 2014 3:54 PM
To: "ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org>" <ala-portal@lists.gbif.org<mailto:ala-portal@lists.gbif.org><mailto:ala-portal@lists.gbif.org><mailto:ala-portal@lists.gbif.org><mailto:ala-portal@lists.gbif.org><mailto:ala-portal@lists.gbif.org><mailto:ala-portal@lists.gbif.org><mailto:ala-portal@lists.gbif.org>>, "dos Remedios, Nick (CES, Black Mountain)" <Nick.Dosremedios@csiro.au<mailto:Nick.Dosremedios@csiro.au>>, "Martin, Dave (CES, Black Mountain)" <David.Martin@csiro.au<mailto:David.Martin@csiro.au>>
Subject: [Ala-portal] DwC-A loading problems

Hi Nick and Dave,

We are having some problems in Biocache during the upload of DwC-A files.

As shown below, after run the method "au.org.ala.util.DwCALoader", our system returns the error message "Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field delimiter"

I accomplished tests using DwC-A files with tab-delimited text files and comma-delimited text files. In both cases the error generated was the same.

What causes these problems? (** CSV Loader works great)

tab-delimited file test

poliusp@poliusp-VirtualBox:~/dev/biocache$ sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip
2014-04-28 01:44:02,837 INFO : [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties
2014-04-28 01:44:03,090 INFO : [ConfigModule] - Initialise SOLR
2014-04-28 01:44:03,103 INFO : [ConfigModule] - Initialise name matching indexes
2014-04-28 01:44:03,605 INFO : [ConfigModule] - Initialise persistence manager
2014-04-28 01:44:03,606 INFO : [ConfigModule] - Configure complete
Loading archive /data/collectory/upload/1398658607824/2f676abc-4503-489e-8f0c-fcb6e1bc554b.zip for resource dr7 with unique terms List(dwc:occurrenceID) stripping spaces false incremental false testing false
Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field delimiter
        at org.gbif.file.CSVReaderFactory.buildArchiveFile(CSVReaderFactory.java:129)
        at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:46)
        at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(ArchiveFactory.java:344)
        at org.gbif.dwc.text.ArchiveFactory.openArchive(ArchiveFactory.java:289)
        at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
        at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
        at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
        at au.org.ala.util.DwCALoader.main(DwCALoader.scala)

comma-delimited file test

poliusp@poliusp-VirtualBox:~/dev/biocache$ sudo java -cp .:biocache.jar au.org.ala.util.DwCALoader dr7 -l ./dwca-teste3.zip
2014-04-28 01:56:04,683 INFO : [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties
2014-04-28 01:56:04,940 INFO : [ConfigModule] - Initialise SOLR
2014-04-28 01:56:04,951 INFO : [ConfigModule] - Initialise name matching indexes
2014-04-28 01:56:05,437 INFO : [ConfigModule] - Initialise persistence manager
2014-04-28 01:56:05,438 INFO : [ConfigModule] - Configure complete
Loading archive ./dwca-teste3.zip for resource dr7 with unique terms List(dwc:occurrenceID) stripping spaces false incremental false testing false
Exception in thread "main" org.gbif.dwc.text.UnkownDelimitersException: Unable to detect field delimiter
        at org.gbif.file.CSVReaderFactory.buildArchiveFile(CSVReaderFactory.java:129)
        at org.gbif.file.CSVReaderFactory.build(CSVReaderFactory.java:46)
        at org.gbif.dwc.text.ArchiveFactory.readFileHeaders(ArchiveFactory.java:344)
        at org.gbif.dwc.text.ArchiveFactory.openArchive(ArchiveFactory.java:289)
        at au.org.ala.util.DwCALoader.loadArchive(DwCALoader.scala:129)
        at au.org.ala.util.DwCALoader.loadLocal(DwCALoader.scala:106)
        at au.org.ala.util.DwCALoader$.main(DwCALoader.scala:52)
        at au.org.ala.util.DwCALoader.main(DwCALoader.scala)

Thanks!

Regards.
--
Daniel Lins da Silva
(Mobile) 55 11 96144-4050<tel:55%2011%2096144-4050>
Research Center on Biodiversity and Computing (Biocomp)
University of Sao Paulo, Brazil
daniellins@usp.br<mailto:daniellins@usp.br>
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

--
Daniel Lins da Silva
(Cel) 11 6144-4050
daniel.lins@gmail.com<mailto:daniel.lins@gmail.com>

Re: [Ala-portal] DwC-A loading problems

David.Martin＠csiro.au