Problems loading huge DwC Archive
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the file. Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie
Hi Marie
Do you see any output at all? E.g. Are there log lines like: 1000 >> last key : 123, UUID: Š records per sec: 4567 2000 >> last key : 423, UUID: Š records per sec: 4568
As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader which is just an iterator and should not use significant memory for that.
I notice it appears to be batching [2]. I am not familiar with that code but I¹d expect to find config for a batch size, and perhaps if it is not set it might default to unlimited?
The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a zip file). You could try unzipping it using standard tools and then rather than using a URL, reference a local directory so that it calls the loadLocal [3] instead of downloading, unzipping and then loading which is what a URL will invoke.
I¹m sorry I have not loaded a large file myself, but I know it should be possible.
I hope this helps provide some ideas at least.
Many thanks, Tim
[1] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 [2] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 [3] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
On 01/06/16 05:05, "Ala-portal on behalf of melecoq" <ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the file. Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie _______________________________________________ Ala-portal mailing list Ala-portal@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ala-portal
Hi Tim, all,
Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... .
The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.
Thanks for your answer, Cheers, Marie
---------------------------- root@vm-6:/home/ubuntu# biocache load dr179 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO 2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching 2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager 2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete 2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000 2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store 2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs 2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179 2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading 2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... 2016-06-01 15:57:09,235 INFO : [DataLoader] - Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip 2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6...
On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Marie
Do you see any output at all? E.g. Are there log lines like: 1000 >> last key : 123, UUID: Š records per sec: 4567 2000 >> last key : 423, UUID: Š records per sec: 4568
As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader which is just an iterator and should not use significant memory for that.
I notice it appears to be batching [2]. I am not familiar with that code but I¹d expect to find config for a batch size, and perhaps if it is not set it might default to unlimited?
The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a zip file). You could try unzipping it using standard tools and then rather than using a URL, reference a local directory so that it calls the loadLocal [3] instead of downloading, unzipping and then loading which is what a URL will invoke.
I¹m sorry I have not loaded a large file myself, but I know it should be possible.
I hope this helps provide some ideas at least.
Many thanks, Tim
[1] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 [2] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 [3] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
On 01/06/16 05:05, "Ala-portal on behalf of melecoq" <ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the file. Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie _______________________________________________ Ala-portal mailing list Ala-portal@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ala-portal
--
Hi all,
If it can help, you can find attached the error that I got in the collectory.
Cheers, Marie
On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq melecoq@gbif.fr wrote:
Hi Tim, all,
Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... .
The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.
Thanks for your answer, Cheers, Marie
root@vm-6:/home/ubuntu# biocache load dr179 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO 2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching 2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager 2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete 2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000 2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store 2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs 2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179 2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading 2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... 2016-06-01 15:57:09,235 INFO : [DataLoader] - Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip 2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6...
On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Marie
Do you see any output at all? E.g. Are there log lines like: 1000 >> last key : 123, UUID: Š records per sec: 4567 2000 >> last key : 423, UUID: Š records per sec: 4568
As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader which is just an iterator and should not use significant memory for that.
I notice it appears to be batching [2]. I am not familiar with that code but I¹d expect to find config for a batch size, and perhaps if it is not set it might default to unlimited?
The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a zip file). You could try unzipping it using standard tools and then rather than using a URL, reference a local directory so that it calls the loadLocal [3] instead of downloading, unzipping and then loading which is what a URL will invoke.
I¹m sorry I have not loaded a large file myself, but I know it should be possible.
I hope this helps provide some ideas at least.
Many thanks, Tim
[1]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 [2]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 [3]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130 https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
On 01/06/16 05:05, "Ala-portal on behalf of melecoq" <ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the file. Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie _______________________________________________ Ala-portal mailing list Ala-portal@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ala-portal
--
--
Thanks Marie.
Looks like a bug with the collectory supplying large archives to the biocache CMD tool. Im surprised because we've loaded large archives for the ALA (10 GB zip files with images), so we'll need to investigate. If you can log a bug we'll look at it as soon as we can.
In the meantime, what you could do as workaround is put the archive somewhere web accessible, grab a URL to this web accessible place and put this URL into the config for the data resource.
See the "LocationURL" field in "Connection parameters".
This way the archive wont be downloaded from the collectory as such, but from another webserver (which hopefully wont have the problem serving a large file).
Hope this helps,
Dave
________________________________ From: Ala-portal ala-portal-bounces@lists.gbif.org on behalf of Marie Elise Lecoq melecoq@gbif.fr Sent: 02 June 2016 02:58 To: Tim Robertson Cc: Ala-portal@lists.gbif.org Subject: Re: [Ala-portal] Problems loading huge DwC Archive
Hi all,
If it can help, you can find attached the error that I got in the collectory.
Cheers, Marie
On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq <melecoq@gbif.frmailto:melecoq@gbif.fr> wrote: Hi Tim, all,
Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6....
The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.
Thanks for your answer, Cheers, Marie
---------------------------- root@vm-6:/home/ubuntu# biocache load dr179 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO 2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching 2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager 2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete 2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000 2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store 2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs 2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179 2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading 2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... 2016-06-01 15:57:09,235 INFO : [DataLoader] - Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip 2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6...
On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote: Hi Marie
Do you see any output at all? E.g. Are there log lines like: 1000 >> last key : 123, UUID: S records per sec: 4567 2000 >> last key : 423, UUID: S records per sec: 4568
As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader which is just an iterator and should not use significant memory for that.
I notice it appears to be batching [2]. I am not familiar with that code but I¹d expect to find config for a batch size, and perhaps if it is not set it might default to unlimited?
The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a zip file). You could try unzipping it using standard tools and then rather than using a URL, reference a local directory so that it calls the loadLocal [3] instead of downloading, unzipping and then loading which is what a URL will invoke.
I¹m sorry I have not loaded a large file myself, but I know it should be possible.
I hope this helps provide some ideas at least.
Many thanks, Tim
[1] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 [2] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 [3] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
On 01/06/16 05:05, "Ala-portal on behalf of melecoq" <ala-portal-bounces@lists.gbif.orgmailto:ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.frmailto:melecoq@gbif.fr> wrote:
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the file. Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie _______________________________________________ Ala-portal mailing list Ala-portal@lists.gbif.orgmailto:Ala-portal@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ala-portal
-- [https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&t...]
-- [https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&t...]
I just found a fix (no totally perfect because that causes an error on SRSF Filter that I think we can ignore for now).
I change this line :
https://github.com/AtlasOfLivingAustralia/collectory-plugin/blob/ec60b0e23a2... )
with this code :
file.withInputStream { response.outputStream << it }
I used this code ( https://github.com/pdorobisz/grails-file-server/blob/master/grails-app/contr...) to help me to understand the system.
Basically, it just process the file by chunks.
The loading has begun 10 minutes ago, so fingers crossed!
Cheers, Marie
On Wed, Jun 1, 2016 at 10:13 AM, David.Martin@csiro.au wrote:
Thanks Marie.
Looks like a bug with the collectory supplying large archives to the biocache CMD tool. Im surprised because we've loaded large archives for the ALA (10 GB zip files with images), so we'll need to investigate. If you can log a bug we'll look at it as soon as we can.
In the meantime, what you could do as workaround is put the archive somewhere web accessible, grab a URL to this web accessible place and put this URL into the config for the data resource.
See the "LocationURL" field in "Connection parameters".
This way the archive wont be downloaded from the collectory as such, but from another webserver (which hopefully wont have the problem serving a large file).
Hope this helps,
Dave
*From:* Ala-portal ala-portal-bounces@lists.gbif.org on behalf of Marie Elise Lecoq melecoq@gbif.fr *Sent:* 02 June 2016 02:58 *To:* Tim Robertson *Cc:* Ala-portal@lists.gbif.org *Subject:* Re: [Ala-portal] Problems loading huge DwC Archive
Hi all,
If it can help, you can find attached the error that I got in the collectory.
Cheers, Marie
On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq melecoq@gbif.fr wrote:
Hi Tim, all,
Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... .
The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.
Thanks for your answer, Cheers, Marie
root@vm-6:/home/ubuntu# biocache load dr179 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties 2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO 2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching 2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager 2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete 2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000 2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store 2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs 2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179 2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading 2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6... 2016-06-01 15:57:09,235 INFO : [DataLoader] - Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip 2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda6...
On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson trobertson@gbif.org wrote:
Hi Marie
Do you see any output at all? E.g. Are there log lines like: 1000 >> last key : 123, UUID: Š records per sec: 4567 2000 >> last key : 423, UUID: Š records per sec: 4568
As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader which is just an iterator and should not use significant memory for that.
I notice it appears to be batching [2]. I am not familiar with that code but I¹d expect to find config for a batch size, and perhaps if it is not set it might default to unlimited?
The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a zip file). You could try unzipping it using standard tools and then rather than using a URL, reference a local directory so that it calls the loadLocal [3] instead of downloading, unzipping and then loading which is what a URL will invoke.
I¹m sorry I have not loaded a large file myself, but I know it should be possible.
I hope this helps provide some ideas at least.
Many thanks, Tim
[1]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 [2]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 [3]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130 https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
On 01/06/16 05:05, "Ala-portal on behalf of melecoq" <ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the
file.
Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for
it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie _______________________________________________ Ala-portal mailing list Ala-portal@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ala-portal
--
--
--
participants (4)
-
David.Martin@csiro.au
-
Marie Elise Lecoq
-
melecoq
-
Tim Robertson