<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none"><!-- p { margin-top: 0px; margin-bottom: 0px; }--></style>

</head>

<body dir="ltr" style="font-size:12pt;color:#000000;background-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">

<p>Thanks Marie.<br>

</p>

<p><br>

</p>

<p>Looks like a bug with the collectory supplying large archives to the biocache CMD tool.  Im surprised because we've loaded large archives for the ALA (10 GB zip files with images), so we'll need to investigate. <span style="font-size: 12pt;">If you ca</span><span style="font-size: 12pt;">n

 log a bug </span><span style="font-size: 12pt;">we'll lo</span><span style="font-size: 12pt;">ok at it as soon as we can.</span></p>

<p><br>

</p>

<p>In the meantime, what you could do as workaround is put the archive somewhere web accessible, grab a URL to this web accessible place and put this URL into the config for the data resource.<br>

</p>

<p>See the "<span style="color: rgb(87, 87, 86); font-family: "Pontano Sans", sans-serif; font-size: 15px; line-height: 21.3px; background-color: rgb(255, 255, 255);">LocationURL</span>" field in "Connection parameters".<br>

</p>

<p><br>

</p>

<p>This way the archive wont be downloaded from the collectory as such, but from another webserver (which hopefully wont have the problem serving a large file).<br>

</p>

<p><br>

</p>

<p>Hope this helps,<br>

</p>

<p><br>

</p>

<p>Dave</p>

<p><br>

</p>

<div style="color: rgb(33, 33, 33);">

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Ala-portal <ala-portal-bounces@lists.gbif.org> on behalf of Marie Elise Lecoq <melecoq@gbif.fr><br>

<b>Sent:</b> 02 June 2016 02:58<br>

<b>To:</b> Tim Robertson<br>

<b>Cc:</b> Ala-portal@lists.gbif.org<br>

<b>Subject:</b> Re: [Ala-portal] Problems loading huge DwC Archive</font>

<div> </div>

</div>

<div>

<div dir="ltr">Hi all,

<div><br>

</div>

<div>If it can help, you can find attached the error that I got in the collectory.</div>

<div><br>

</div>

<div>Cheers,</div>

<div>Marie</div>

</div>

<div class="gmail_extra"><br>

<div class="gmail_quote">On Wed, Jun 1, 2016 at 9:43 AM, Marie Elise Lecoq <span dir="ltr">

<<a href="mailto:melecoq@gbif.fr" target="_blank">melecoq@gbif.fr</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex; border-left:1px #ccc solid; padding-left:1ex">

<div dir="ltr">

<div>

<div>Hi Tim, all,</div>

<div><br>

</div>

<div>Unfortunately, the error occurs before the beginning of the loading. You can find below the logging track when I executed "biocache load -dr179" command. The error comes from the upload, I got the same error when I tried to access this URL :

<a href="http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip" target="_blank">

http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip</a>. </div>

<div><br>

</div>

<div>The VM that hosts the collectory modules, used to have 4Go of RAM. I asked if it was possible to increase it (for a limited time) and I have got 32 Go, now. Even with this increase of memory, we still got the error. </div>

<div><br>

</div>

<div>Thanks for your answer,</div>

<div>Cheers,</div>

<div>Marie</div>

</div>

<div><br>

</div>

<div>----------------------------</div>

<div>

<div>root@vm-6:/home/ubuntu# biocache load dr179</div>

<div>SLF4J: Class path contains multiple SLF4J bindings.</div>

<div>SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]</div>

<div>SLF4J: Found binding in [jar:file:/usr/lib/biocache/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]</div>

<div>SLF4J: See <a href="http://www.slf4j.org/codes.html#multiple_bindings" target="_blank">

http://www.slf4j.org/codes.html#multiple_bindings</a> for an explanation.</div>

<div>SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]</div>

<div>2016-06-01 15:57:04,029 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties</div>

<div>2016-06-01 15:57:04,036 DEBUG: [ConfigModule] - Loading configuration from /data/biocache/config/biocache-config.properties</div>

<div>2016-06-01 15:57:04,203 DEBUG: [ConfigModule] - Initialising DAO</div>

<div>2016-06-01 15:57:04,210 DEBUG: [ConfigModule] - Initialising SOLR</div>

<div>2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Initialising name matching indexes</div>

<div>2016-06-01 15:57:04,212 DEBUG: [ConfigModule] - Loading name index from /data/lucene/namematching</div>

<div>2016-06-01 15:57:04,796 DEBUG: [ConfigModule] - Initialising persistence manager</div>

<div>2016-06-01 15:57:04,798 DEBUG: [ConfigModule] - Configure complete</div>

<div>2016-06-01 15:57:05,064 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with pool name: biocache-store-pool</div>

<div>2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with hosts: 192.168.0.19</div>

<div>2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with port: 9160</div>

<div>2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max connections: -1</div>

<div>2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with max retries: 6</div>

<div>2016-06-01 15:57:05,065 DEBUG: [CassandraPersistenceManager] - Initialising cassandra connection pool with operation timeout: 80000</div>

<div>2016-06-01 15:57:05,361 DEBUG: [Config] - Using local media store</div>

<div>2016-06-01 15:57:05,419 INFO : [Config] - Using the default set of blacklisted media URLs</div>

<div>2016-06-01 15:57:05,763 INFO : [Loader] - Starting to load resource: dr179</div>

<div>2016-06-01 15:57:06,748 INFO : [DataLoader] - Darwin core archive loading</div>

<div>2016-06-01 15:57:06,952 INFO : [DataLoader] - Downloading zip file from <a href="http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip" target="_blank">

http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip</a></div>

<div>2016-06-01 15:57:09,235 INFO : [DataLoader] -  Content-Disposition: attachment;filename=75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip</div>

<div>2016-06-01 15:57:09,236 ERROR: [DataLoader] - Server returned HTTP response code: 500 for URL:

<a href="http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip" target="_blank">

http://metadonnee.gbif.fr/upload/1464722424642/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d.zip</a></div>

</div>

</div>

<div class="gmail_extra">

<div>

<div class="h5"><br>

<div class="gmail_quote">On Wed, Jun 1, 2016 at 12:21 AM, Tim Robertson <span dir="ltr">

<<a href="mailto:trobertson@gbif.org" target="_blank">trobertson@gbif.org</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex; border-left:1px #ccc solid; padding-left:1ex">

Hi Marie<br>

<br>

<br>

Do you see any output at all?  E.g. Are there log lines like:<br>

  1000 >> last key : 123, UUID: Š records per sec: 4567<br>

  2000 >> last key : 423, UUID: Š records per sec: 4568<br>

<br>

As far as I can tell, the code [1] is using the usual GBIF DwC-A Reader<br>

which is just an iterator and should not use significant memory for that.<br>

<br>

I notice it appears to be batching [2].  I am not familiar with that code<br>

but Išd expect to find config for a batch size, and perhaps if it is not<br>

set it might default to unlimited?<br>

<br>

The DwC-A reader also handles pre-deflated zips (i.e. A directory, not a<br>

zip file).  You could try unzipping it using standard tools and then<br>

rather than using a URL, reference a local directory so that it calls the<br>

loadLocal [3] instead of downloading, unzipping and then loading which is<br>

what a URL will invoke.<br>

<br>

Išm sorry I have not loaded a large file myself, but I know it should be<br>

possible.<br>

<br>

I hope this helps provide some ideas at least.<br>

<br>

Many thanks,<br>

Tim<br>

<br>

<br>

<br>

[1]<br>

<a href="https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L159" rel="noreferrer" target="_blank">https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma<br>

in/scala/au/org/ala/biocache/load/DwCALoader.scala#L159</a><br>

[2]<br>

<a href="https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L325" rel="noreferrer" target="_blank">https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma<br>

in/scala/au/org/ala/biocache/load/DwCALoader.scala#L325</a><br>

[3]<br>

<a href="https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/load/DwCALoader.scala#L130" rel="noreferrer" target="_blank">https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/ma<br>

in/scala/au/org/ala/biocache/load/DwCALoader.scala#L130</a><br>

<br>

On 01/06/16 05:05, "Ala-portal on behalf of melecoq"<br>

<div>

<div><<a href="mailto:ala-portal-bounces@lists.gbif.org" target="_blank">ala-portal-bounces@lists.gbif.org</a> on behalf of

<a href="mailto:melecoq@gbif.fr" target="_blank">melecoq@gbif.fr</a>> wrote:<br>

<br>

> Dear all,<br>

> I'm still hold with my dataset with more than 20 millions occurrences.<br>

><br>

> I understood that the issue was due to the large size of the Zipfile.<br>

> It's too big for the ZipFile Java Api.<br>

> I did a little trick and I was able to create the data resource. I<br>

> integrate the DwC Archive with occurrence and verbatim files with just<br>

> 15 occurrences, then I changed those files with the real ones and it<br>

> seems to work.<br>

><br>

> Now, the problem is when I try to load the Zip File into the Cassandra<br>

> using load function from the biocache, I got a Java out of memory heap<br>

> error because the code use the RAM to download, unzip and read the file.<br>

> Unfortunately, 4 Go (zip file) and 23 Go (unzip file) is too big for it.<br>

><br>

> Do you know if there is another way to do it ? Can I unzip the file and<br>

> run the loading after it ? Or can I "manually" integrate data into the<br>

> Cassandra ?<br>

><br>

> Thanks in advance for your help.<br>

> Cheers,<br>

> Marie<br>

>_______________________________________________<br>

>Ala-portal mailing list<br>

><a href="mailto:Ala-portal@lists.gbif.org" target="_blank">Ala-portal@lists.gbif.org</a><br>

><a href="http://lists.gbif.org/mailman/listinfo/ala-portal" rel="noreferrer" target="_blank">http://lists.gbif.org/mailman/listinfo/ala-portal</a><br>

<br>

</div>

</div>

</blockquote>

</div>

<br>

<br clear="all">

<div><br>

</div>

</div>

</div>

<span class="HOEnZb"><font color="#888888">-- <br>

<div>

<div dir="ltr"><img src="https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&th=143720bd12d267c4&attid=0.1&disp=inline&safe=1&attbid=ANGjdJ-dPPvdDXYTQLEz3sMkSC8MXmmlvNMhRcnZ-5COf76BRKDzNBzALARmrD-ZLTteeCriuqRYcwDCZxnWA6ZjOt8rVgydWnc6h2aRU_hfLDYFdDAPI0uUiC8Do9o&ats=1389188740078&rm=143720bd12d267c4&zw&sz=w1325-h522"><br>

</div>

</div>

</font></span></div>

</blockquote>

</div>

<br>

<br clear="all">

<div><br>

</div>

-- <br>

<div class="gmail_signature">

<div dir="ltr"><img src="https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&th=143720bd12d267c4&attid=0.1&disp=inline&safe=1&attbid=ANGjdJ-dPPvdDXYTQLEz3sMkSC8MXmmlvNMhRcnZ-5COf76BRKDzNBzALARmrD-ZLTteeCriuqRYcwDCZxnWA6ZjOt8rVgydWnc6h2aRU_hfLDYFdDAPI0uUiC8Do9o&ats=1389188740078&rm=143720bd12d267c4&zw&sz=w1325-h522"><br>

</div>

</div>

</div>

</div>

</div>

</body>

</html>