Looks like a bug with the collectory supplying large archives to the biocache CMD tool. Im surprised because we've loaded large archives for the ALA (10 GB zip files with images), so we'll need to investigate. If you can log a bug we'll look at it as soon as we can.

In the meantime, what you could do as workaround is put the archive somewhere web accessible, grab a URL to this web accessible place and put this URL into the config for the data resource.

This way the archive wont be downloaded from the collectory as such, but from another webserver (which hopefully wont have the problem serving a large file).

From: Ala-portal <ala-portal-bounces@lists.gbif.org> on behalf of Marie Elise Lecoq <melecoq@gbif.fr>
Sent: 02 June 2016 02:58
To: Tim Robertson
Cc: Ala-portal@lists.gbif.org
Subject: Re: [Ala-portal] Problems loading huge DwC Archive

Hi all,

If it can help, you can find attached the error that I got in the collectory.

Cheers,

Marie

On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq <melecoq@gbif.fr> wrote:

Hi Tim, all,

Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip.

The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.

Thanks for your answer,

Cheers,

Marie

----------------------------

root@vm-6:/home/ubuntu# biocache load dr179

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties

2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties

2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO

2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR

2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes

2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching

2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager

2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete

2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000

2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store

2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs

2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179

2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading

2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

2016-06-01 15:57:09,235 INFO : [DataLoader] - Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <trobertson@gbif.org> wrote:

Hi Marie

Do you see any output at all? E.g. Are there log lines like:
1000 >> last key : 123, UUID: Š records per sec: 4567
2000 >> last key : 423, UUID: Š records per sec: 4568

As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader
which is just an iterator and should not use significant memory for that.

I notice it appears to be batching [2]. I am not familiar with that code
but I¹d expect to find config for a batch size, and perhaps if it is not
set it might default to unlimited?

The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a
zip file). You could try unzipping it using standard tools and then
rather than using a URL, reference a local directory so that it calls the
loadLocal [3] instead of downloading, unzipping and then loading which is
what a URL will invoke.

I¹m sorry I have not loaded a large file myself, but I know it should be
possible.

I hope this helps provide some ideas at least.

Many thanks,
Tim

[1]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159
[2]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325
[3]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130

On 01/06/16 05:05, "Ala-portal on behalf of melecoq"

<ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:

> Dear all,
> I'm still hold with my dataset with more than 20 millions occurrences.
>
> I understood that the issue was due to the large size of the Zipfile.
> It's too big for the ZipFile Java Api.
> I did a little trick and I was able to create the data resource. I
> integrate the DwC Archive with occurrence and verbatim files with just
> 15 occurrences, then I changed those files with the real ones and it
> seems to work.
>
> Now, the problem is when I try to load the Zip File into the Cassandra
> using load function from the biocache, I got a Java out of memory heap
> error because the code use the RAM to download, unzip and read the file.
> Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
>
> Do you know if there is another way to do it ? Can I unzip the file and
> run the loading after it ? Or can I "manually" integrate data into the
> Cassandra ?
>
> Thanks in advance for your help.
> Cheers,
> Marie
>_______________________________________________
>Ala-portal mailing list
>Ala-portal@lists.gbif.org
>http://lists.gbif.org/mailman/listinfo/ala-portal

--