[Ala-portal] Problems loading huge DwC Archive

Tim Robertson trobertson at gbif.org
Wed Jun 1 09:21:42 CEST 2016

Hi Marie

Do you see any output at all?  E.g. Are there log lines like:
  1000 >> last key : 123, UUID: Š records per sec: 4567
  2000 >> last key : 423, UUID: Š records per sec: 4568

As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader
which is just an iterator and should not use significant memory for that.

I notice it appears to be batching [2].  I am not familiar with that code
but I¹d expect to find config for a batch size, and perhaps if it is not
set it might default to unlimited?

The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a
zip file).  You could try unzipping it using standard tools and then
rather than using a URL, reference a local directory so that it calls the
loadLocal [3] instead of downloading, unzipping and then loading which is
what a URL will invoke.

I¹m sorry I have not loaded a large file myself, but I know it should be

I hope this helps provide some ideas at least.

Many thanks,


On 01/06/16 05:05, "Ala-portal on behalf of melecoq"
<ala-portal-bounces at lists.gbif.org on behalf of melecoq at gbif.fr> wrote:

> Dear all,
> I'm still hold with my dataset with more than 20 millions occurrences.
> I understood that the issue was due to the large size of the Zipfile.
> It's too big for the ZipFile Java Api.
> I did a little trick and I was able to create the data resource. I
> integrate the DwC Archive with occurrence and verbatim files with just
> 15 occurrences, then I changed those files with the real ones and it
> seems to work.
> Now, the problem is when I try to load the Zip File into the Cassandra
> using load function from the biocache, I got a Java out of memory heap
> error because the code use the RAM to download, unzip and read the file.
> Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
> Do you know if there is another way to do it ? Can I unzip the file and
> run the loading after it ? Or can I "manually" integrate data into the
> Cassandra ?
> Thanks in advance for your help.
> Cheers,
> Marie
>Ala-portal mailing list
>Ala-portal at lists.gbif.org

More information about the Ala-portal mailing list