[Ala-portal] Problems loading huge DwC Archive

Wed Jun 1 19:13:56 CEST 2016

Thanks Marie.

Looks like a bug with the collectory supplying large archives to the biocache CMD tool.  Im surprised because we've loaded large archives for the ALA (10 GB zip files with images), so we'll need to investigate. If you can log a bug we'll look at it as soon as we can.

In the meantime, what you could do as workaround is put the archive somewhere web accessible, grab a URL to this web accessible place and put this URL into the config for the data resource.

See the "LocationURL" field in "Connection parameters".

This way the archive wont be downloaded from the collectory as such, but from another webserver (which hopefully wont have the problem serving a large file).

Hope this helps,

Dave

________________________________
From: Ala-portal <ala-portal-bounces at lists.gbif.org> on behalf of Marie Elise Lecoq <melecoq at gbif.fr>
Sent: 02 June 2016 02:58
To: Tim Robertson
Cc: Ala-portal at lists.gbif.org
Subject: Re: [Ala-portal] Problems loading huge DwC Archive

Hi all,

If it can help, you can find attached the error that I got in the collectory.

Cheers,
Marie

On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq <melecoq at gbif.fr<mailto:melecoq at gbif.fr>> wrote:
Hi Tim, all,

Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL : http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip.

The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error.

Thanks for your answer,
Cheers,
Marie

----------------------------
root at vm-6:/home/ubuntu# biocache load dr179
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties
2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties
2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO
2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR
2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes
2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching
2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager
2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete
2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool
2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19
2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160
2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1
2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6
2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000
2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store
2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs
2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179
2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading
2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
2016-06-01 15:57:09,235 INFO : [DataLoader] -  Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip
2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL: http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip

On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <trobertson at gbif.org<mailto:trobertson at gbif.org>> wrote:
Hi Marie

Do you see any output at all?  E.g. Are there log lines like:
  1000 >> last key : 123, UUID: S records per sec: 4567
  2000 >> last key : 423, UUID: S records per sec: 4568

As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader
which is just an iterator and should not use significant memory for that.

I notice it appears to be batching [2].  I am not familiar with that code
but I¹d expect to find config for a batch size, and perhaps if it is not
set it might default to unlimited?

The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a
zip file).  You could try unzipping it using standard tools and then
rather than using a URL, reference a local directory so that it calls the
loadLocal [3] instead of downloading, unzipping and then loading which is
what a URL will invoke.

I¹m sorry I have not loaded a large file myself, but I know it should be
possible.

I hope this helps provide some ideas at least.

Many thanks,
Tim

[1]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159<https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159>
[2]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325<https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325>
[3]
https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma
in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130<https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130>

On 01/06/16 05:05, "Ala-portal on behalf of melecoq"
<ala-portal-bounces at lists.gbif.org<mailto:ala-portal-bounces at lists.gbif.org> on behalf of melecoq at gbif.fr<mailto:melecoq at gbif.fr>> wrote:

> Dear all,
> I'm still hold with my dataset with more than 20 millions occurrences.
>
> I understood that the issue was due to the large size of the Zipfile.
> It's too big for the ZipFile Java Api.
> I did a little trick and I was able to create the data resource. I
> integrate the DwC Archive with occurrence and verbatim files with just
> 15 occurrences, then I changed those files with the real ones and it
> seems to work.
>
> Now, the problem is when I try to load the Zip File into the Cassandra
> using load function from the biocache, I got a Java out of memory heap
> error because the code use the RAM to download, unzip and read the file.
> Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
>
> Do you know if there is another way to do it ? Can I unzip the file and
> run the loading after it ? Or can I "manually" integrate data into the
> Cassandra ?
>
> Thanks in advance for your help.
> Cheers,
> Marie
>_______________________________________________
>Ala-portal mailing list
>Ala-portal at lists.gbif.org<mailto:Ala-portal at lists.gbif.org>
>http://lists.gbif.org/mailman/listinfo/ala-portal

--
[https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&th=143720bd12d267c4&attid=0.1&disp=inline&safe=1&attbid=ANGjdJ-dPPvdDXYTQLEz3sMkSC8MXmmlvNMhRcnZ-5COf76BRKDzNBzALARmrD-ZLTteeCriuqRYcwDCZxnWA6ZjOt8rVgydWnc6h2aRU_hfLDYFdDAPI0uUiC8Do9o&ats=1389188740078&rm=143720bd12d267c4&zw&sz=w1325-h522]

--
[https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&th=143720bd12d267c4&attid=0.1&disp=inline&safe=1&attbid=ANGjdJ-dPPvdDXYTQLEz3sMkSC8MXmmlvNMhRcnZ-5COf76BRKDzNBzALARmrD-ZLTteeCriuqRYcwDCZxnWA6ZjOt8rVgydWnc6h2aRU_hfLDYFdDAPI0uUiC8Do9o&ats=1389188740078&rm=143720bd12d267c4&zw&sz=w1325-h522]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/ala-portal/attachments/20160601/540dc750/attachment.html>