[Ala-portal] Problems loading huge DwC Archive

Marie Elise Lecoq melecoq at gbif.fr
Wed Jun 1 18:58:33 CEST 2016


Hi all,

If it can help, you can find attached the error that I got in the
collectory.

Cheers,
Marie

On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq <melecoq at gbif.fr> wrote:

> Hi Tim, all,
>
> Unfortunately, the error occurs before the beginning of the loading. You
> can find below the logging track when I executed "biocache load -dr179"
> command. The error comes from the upload, I got the same error when I tried
> to access this URL :
> http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
> .
>
> The VM that hosts the collectory modules, used to have 4Go of RAM. I asked
> if it was possible to increase it (for a limited time) and I have got 32
> Go, now. Even with this increase of memory, we still got the error.
>
> Thanks for your answer,
> Cheers,
> Marie
>
> ----------------------------
> root at vm-6:/home/ubuntu# biocache load dr179
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file:
> /data/biocache/config/biocache-config.properties
> 2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from
> /data/biocache/config/biocache-config.properties
> 2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO
> 2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR
> 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching
> indexes
> 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from
> /data/lucene/namematching
> 2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence
> manager
> 2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete
> 2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] -
> Initialising cassandra connection pool with pool name: biocache-store-pool
> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
> Initialising cassandra connection pool with hosts: 192.168.0.19
> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
> Initialising cassandra connection pool with port: 9160
> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
> Initialising cassandra connection pool with max connections: -1
> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
> Initialising cassandra connection pool with max retries: 6
> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
> Initialising cassandra connection pool with operation timeout: 80000
> 2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store
> 2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of
> blacklisted media URLs
> 2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179
> 2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading
> 2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from
> http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
> 2016-06-01 15:57:09,235 INFO : [DataLoader] -  Content-Disposition:
> attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
> 2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP
> response code: 500 for URL:
> http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
>
> On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <trobertson at gbif.org>
> wrote:
>
>> Hi Marie
>>
>>
>> Do you see any output at all?  E.g. Are there log lines like:
>>   1000 >> last key : 123, UUID: Š records per sec: 4567
>>   2000 >> last key : 423, UUID: Š records per sec: 4568
>>
>> As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader
>> which is just an iterator and should not use significant memory for that.
>>
>> I notice it appears to be batching [2].  I am not familiar with that code
>> but I¹d expect to find config for a batch size, and perhaps if it is not
>> set it might default to unlimited?
>>
>> The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a
>> zip file).  You could try unzipping it using standard tools and then
>> rather than using a URL, reference a local directory so that it calls the
>> loadLocal [3] instead of downloading, unzipping and then loading which is
>> what a URL will invoke.
>>
>> I¹m sorry I have not loaded a large file myself, but I know it should be
>> possible.
>>
>> I hope this helps provide some ideas at least.
>>
>> Many thanks,
>> Tim
>>
>>
>>
>> [1]
>>
>> https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
>> in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159
>> <https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159>
>> [2]
>>
>> https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
>> in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325
>> <https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325>
>> [3]
>>
>> https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
>> in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
>> <https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130>
>>
>> On 01/06/16 05:05, "Ala-portal on behalf of melecoq"
>> <ala-portal-bounces at lists.gbif.org on behalf of melecoq at gbif.fr> wrote:
>>
>> > Dear all,
>> > I'm still hold with my dataset with more than 20 millions occurrences.
>> >
>> > I understood that the issue was due to the large size of the Zipfile.
>> > It's too big for the ZipFile Java Api.
>> > I did a little trick and I was able to create the data resource. I
>> > integrate the DwC Archive with occurrence and verbatim files with just
>> > 15 occurrences, then I changed those files with the real ones and it
>> > seems to work.
>> >
>> > Now, the problem is when I try to load the Zip File into the Cassandra
>> > using load function from the biocache, I got a Java out of memory heap
>> > error because the code use the RAM to download, unzip and read the file.
>> > Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
>> >
>> > Do you know if there is another way to do it ? Can I unzip the file and
>> > run the loading after it ? Or can I "manually" integrate data into the
>> > Cassandra ?
>> >
>> > Thanks in advance for your help.
>> > Cheers,
>> > Marie
>> >_______________________________________________
>> >Ala-portal mailing list
>> >Ala-portal at lists.gbif.org
>> >http://lists.gbif.org/mailman/listinfo/ala-portal
>>
>>
>
>
> --
>
>


--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/ala-portal/attachments/20160601/e05611e8/attachment-0001.html>
-------------- next part --------------
org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Executing action [fileDownload] of controller [au.org.ala.collectory.DataController] in plugin [collectory] caused exception: Runtime error executing action
	at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198)
	at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
	at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Runtime error executing action
	... 6 more
Caused by: java.lang.reflect.InvocationTargetException
	... 6 more
Caused by: java.lang.OutOfMemoryError
	at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at au.org.ala.collectory.DataController$_closure20.doCall(DataController.groovy:291)
	... 6 more
2016-06-01 16:19:41,895 ERROR [GrailsExceptionResolver]  OutOfMemoryError occurred when processing request: [GET] /upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
Stacktrace follows:
org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Executing action [fileDownload] of controller [au.org.ala.collectory.DataController] in plugin [collectory] caused exception: Runtime error executing action
	at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198)
	at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
	at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Runtime error executing action
	... 6 more
Caused by: java.lang.reflect.InvocationTargetException
	... 6 more
Caused by: java.lang.OutOfMemoryError
	at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at au.org.ala.collectory.DataController$_closure20.doCall(DataController.groovy:291)
	... 6 more


More information about the Ala-portal mailing list