[Ala-portal] Problems loading huge DwC Archive

Marie Elise Lecoq melecoq at gbif.fr
Wed Jun 1 20:06:48 CEST 2016


I just found a fix (no totally perfect because that causes an error on SRSF
Filter that I think  we can ignore for now).

I change this line :

https://github.com/AtlasOfLivingAustralia/collectory-plugin/blob/ec60b0e23a26419a3dd27ad9c67bf762bb4e94c1/grails-app/controllers/au/org/ala/collectory/DataController.groovy#L291
)

with this code :

file.withInputStream {
                response.outputStream << it
            }


I used this code (
https://github.com/pdorobisz/grails-file-server/blob/master/grails-app/controllers/org/grails/plugins/fileserver/FileController.groovy)
to help me to understand the system.

Basically, it just process the file by chunks.

The loading has begun 10 minutes ago, so fingers crossed!

Cheers,
Marie




On Wed, Jun 1, 2016 at 10:13 AM, <David.Martin at csiro.au> wrote:

> Thanks Marie.
>
>
> Looks like a bug with the collectory supplying large archives to the
> biocache CMD tool.  Im surprised because we've loaded large archives for
> the ALA (10 GB zip files with images), so we'll need to investigate. If
> you can log a bug we'll look at it as soon as we can.
>
>
> In the meantime, what you could do as workaround is put the archive
> somewhere web accessible, grab a URL to this web accessible place and put
> this URL into the config for the data resource.
>
> See the "LocationURL" field in "Connection parameters".
>
>
> This way the archive wont be downloaded from the collectory as such, but
> from another webserver (which hopefully wont have the problem serving a
> large file).
>
>
> Hope this helps,
>
>
> Dave
>
>
> ------------------------------
> *From:* Ala-portal <ala-portal-bounces at lists.gbif.org> on behalf of Marie
> Elise Lecoq <melecoq at gbif.fr>
> *Sent:* 02 June 2016 02:58
> *To:* Tim Robertson
> *Cc:* Ala-portal at lists.gbif.org
> *Subject:* Re: [Ala-portal] Problems loading huge DwC Archive
>
> Hi all,
>
> If it can help, you can find attached the error that I got in the
> collectory.
>
> Cheers,
> Marie
>
> On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq <melecoq at gbif.fr> wrote:
>
>> Hi Tim, all,
>>
>> Unfortunately, the error occurs before the beginning of the loading. You
>> can find below the logging track when I executed "biocache load -dr179"
>> command. The error comes from the upload, I got the same error when I tried
>> to access this URL :
>> http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
>> .
>>
>> The VM that hosts the collectory modules, used to have 4Go of RAM. I
>> asked if it was possible to increase it (for a limited time) and I have got
>> 32 Go, now. Even with this increase of memory, we still got the error.
>>
>> Thanks for your answer,
>> Cheers,
>> Marie
>>
>> ----------------------------
>> root at vm-6:/home/ubuntu# biocache load dr179
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file:
>> /data/biocache/config/biocache-config.properties
>> 2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration
>> from /data/biocache/config/biocache-config.properties
>> 2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO
>> 2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR
>> 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name
>> matching indexes
>> 2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from
>> /data/lucene/namematching
>> 2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence
>> manager
>> 2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete
>> 2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] -
>> Initialising cassandra connection pool with pool name: biocache-store-pool
>> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
>> Initialising cassandra connection pool with hosts: 192.168.0.19
>> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
>> Initialising cassandra connection pool with port: 9160
>> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
>> Initialising cassandra connection pool with max connections: -1
>> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
>> Initialising cassandra connection pool with max retries: 6
>> 2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] -
>> Initialising cassandra connection pool with operation timeout: 80000
>> 2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store
>> 2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of
>> blacklisted media URLs
>> 2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179
>> 2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading
>> 2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from
>> http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
>> 2016-06-01 15:57:09,235 INFO : [DataLoader] -  Content-Disposition:
>> attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
>> 2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP
>> response code: 500 for URL:
>> http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
>>
>> On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <trobertson at gbif.org>
>> wrote:
>>
>>> Hi Marie
>>>
>>>
>>> Do you see any output at all?  E.g. Are there log lines like:
>>>   1000 >> last key : 123, UUID: Š records per sec: 4567
>>>   2000 >> last key : 423, UUID: Š records per sec: 4568
>>>
>>> As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader
>>> which is just an iterator and should not use significant memory for that.
>>>
>>> I notice it appears to be batching [2].  I am not familiar with that code
>>> but I¹d expect to find config for a batch size, and perhaps if it is not
>>> set it might default to unlimited?
>>>
>>> The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a
>>> zip file).  You could try unzipping it using standard tools and then
>>> rather than using a URL, reference a local directory so that it calls the
>>> loadLocal [3] instead of downloading, unzipping and then loading which is
>>> what a URL will invoke.
>>>
>>> I¹m sorry I have not loaded a large file myself, but I know it should be
>>> possible.
>>>
>>> I hope this helps provide some ideas at least.
>>>
>>> Many thanks,
>>> Tim
>>>
>>>
>>>
>>> [1]
>>>
>>> https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
>>> in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159
>>> <https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159>
>>> [2]
>>>
>>> https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
>>> in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325
>>> <https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325>
>>> [3]
>>>
>>> https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
>>> in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
>>> <https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130>
>>>
>>> On 01/06/16 05:05, "Ala-portal on behalf of melecoq"
>>> <ala-portal-bounces at lists.gbif.org on behalf of melecoq at gbif.fr> wrote:
>>>
>>> > Dear all,
>>> > I'm still hold with my dataset with more than 20 millions occurrences.
>>> >
>>> > I understood that the issue was due to the large size of the Zipfile.
>>> > It's too big for the ZipFile Java Api.
>>> > I did a little trick and I was able to create the data resource. I
>>> > integrate the DwC Archive with occurrence and verbatim files with just
>>> > 15 occurrences, then I changed those files with the real ones and it
>>> > seems to work.
>>> >
>>> > Now, the problem is when I try to load the Zip File into the Cassandra
>>> > using load function from the biocache, I got a Java out of memory heap
>>> > error because the code use the RAM to download, unzip and read the
>>> file.
>>> > Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for
>>> it.
>>> >
>>> > Do you know if there is another way to do it ? Can I unzip the file and
>>> > run the loading after it ? Or can I "manually" integrate data into the
>>> > Cassandra ?
>>> >
>>> > Thanks in advance for your help.
>>> > Cheers,
>>> > Marie
>>> >_______________________________________________
>>> >Ala-portal mailing list
>>> >Ala-portal at lists.gbif.org
>>> >http://lists.gbif.org/mailman/listinfo/ala-portal
>>>
>>>
>>
>>
>> --
>>
>>
>
>
> --
>
>


--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/ala-portal/attachments/20160601/e77f3ea3/attachment-0001.html>


More information about the Ala-portal mailing list