Hi Tim, all,

Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip.

The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.

Thanks for your answer,

Cheers,

Marie

----------------------------

root@vm-6:/home/ubuntu# biocache load dr179

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties

2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties

2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO

2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR

2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes

2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching

2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager

2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete

2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6

2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000

2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store

2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs

2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179

2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading

2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

2016-06-01 15:57:09,235 INFO : [DataLoader] - Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <trobertson@gbif.org> wrote:

Hi Marie

Do you see any output at all? E.g. Are there log lines like:
1000 >> last key : 123, UUID: Š records per sec: 4567
2000 >> last key : 423, UUID: Š records per sec: 4568

As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader
which is just an iterator and should not use significant memory for that.

I notice it appears to be batching [2]. I am not familiar with that code
but I¹d expect to find config for a batch size, and perhaps if it is not
set it might default to unlimited?

The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a
zip file). You could try unzipping it using standard tools and then
rather than using a URL, reference a local directory so that it calls the
loadLocal [3] instead of downloading, unzipping and then loading which is
what a URL will invoke.

I¹m sorry I have not loaded a large file myself, but I know it should be
possible.

I hope this helps provide some ideas at least.

Many thanks,
Tim

[1]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159
[2]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325
[3]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130

On 01/06/16 05:05, "Ala-portal on behalf of melecoq"

<ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:

> Dear all,
> I'm still hold with my dataset with more than 20 millions occurrences.
>
> I understood that the issue was due to the large size of the Zipfile.
> It's too big for the ZipFile Java Api.
> I did a little trick and I was able to create the data resource. I
> integrate the DwC Archive with occurrence and verbatim files with just
> 15 occurrences, then I changed those files with the real ones and it
> seems to work.
>
> Now, the problem is when I try to load the Zip File into the Cassandra
> using load function from the biocache, I got a Java out of memory heap
> error because the code use the RAM to download, unzip and read the file.
> Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
>
> Do you know if there is another way to do it ? Can I unzip the file and
> run the loading after it ? Or can I "manually" integrate data into the
> Cassandra ?
>
> Thanks in advance for your help.
> Cheers,
> Marie
>_______________________________________________
>Ala-portal mailing list
>Ala-portal@lists.gbif.org
>http://lists.gbif.org/mailman/listinfo/ala-portal