Hi Marie
Do you see any output at all? E.g. Are there log lines like: 1000 >> last key : 123, UUID: Š records per sec: 4567 2000 >> last key : 423, UUID: Š records per sec: 4568
As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader which is just an iterator and should not use significant memory for that.
I notice it appears to be batching [2]. I am not familiar with that code but I¹d expect to find config for a batch size, and perhaps if it is not set it might default to unlimited?
The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a zip file). You could try unzipping it using standard tools and then rather than using a URL, reference a local directory so that it calls the loadLocal [3] instead of downloading, unzipping and then loading which is what a URL will invoke.
I¹m sorry I have not loaded a large file myself, but I know it should be possible.
I hope this helps provide some ideas at least.
Many thanks, Tim
[1] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159 [2] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325 [3] https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130
On 01/06/16 05:05, "Ala-portal on behalf of melecoq" <ala-portal-bounces@lists.gbif.org on behalf of melecoq@gbif.fr> wrote:
Dear all, I'm still hold with my dataset with more than 20 millions occurrences.
I understood that the issue was due to the large size of the Zipfile. It's too big for the ZipFile Java Api. I did a little trick and I was able to create the data resource. I integrate the DwC Archive with occurrence and verbatim files with just 15 occurrences, then I changed those files with the real ones and it seems to work.
Now, the problem is when I try to load the Zip File into the Cassandra using load function from the biocache, I got a Java out of memory heap error because the code use the RAM to download, unzip and read the file. Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.
Do you know if there is another way to do it ? Can I unzip the file and run the loading after it ? Or can I "manually" integrate data into the Cassandra ?
Thanks in advance for your help. Cheers, Marie _______________________________________________ Ala-portal mailing list Ala-portal@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ala-portal