Hi all !
I have few questions about the indexation :
1. It seems that some occurrences are wrongly indexed. For example, if I search "Pica Pica", the three first results will be not relevant ( http://recherche.gbif.fr/occurrences/search?taxa=Pica+pica). Do I need to change something on the nameindexer ? I don't have a BIE instance on our system, do I need to install one in order to help ?
2. We have some provider codes with punctuation (e.g. comma, dot ). It's seems that the link between collection, institution and dataresource is not made due to this. It works with accents.
3. I try to index a data resource with more than 20 million occurrences and I have a NullPointerException, it's seems that guid is not found. I can upload data resource with much less data inside so I guess the problem comme from the data resource itself (size ?). Do you have a special way to deal with huge data resource ?
Thanks in advance for your help :-)! Cheers, Marie
--
Thanks Marie. Just quick answers (im currently on leave)
1. BIE isnt required, but there should be an index on the biocache service machine in the usual place (/data/lucence/namematching). This will then be used for taxon resolution.
2. Im surprised this causes an issue. Whitespace in those codes can be an issue.
3. Can you supply more detail ? A NPE would suggest a bug or bad config. The way we index large datasets is to use the offline method of indexing using the "bulk-processor" option in the command line tool.
Dave
________________________________ From: Ala-portal ala-portal-bounces@lists.gbif.org on behalf of Marie Elise Lecoq melecoq@gbif.fr Sent: 25 May 2016 03:36 To: ala-portal@lists.gbif.org Subject: [Ala-portal] [Indexation] Questions
Hi all !
I have few questions about the indexation :
1. It seems that some occurrences are wrongly indexed. For example, if I search "Pica Pica", the three first results will be not relevant (http://recherche.gbif.fr/occurrences/search?taxa=Pica+pica). Do I need to change something on the nameindexer ? I don't have a BIE instance on our system, do I need to install one in order to help ?
2. We have some provider codes with punctuation (e.g. comma, dot ). It's seems that the link between collection, institution and dataresource is not made due to this. It works with accents.
3. I try to index a data resource with more than 20 million occurrences and I have a NullPointerException, it's seems that guid is not found. I can upload data resource with much less data inside so I guess the problem comme from the data resource itself (size ?). Do you have a special way to deal with huge data resource ?
Thanks in advance for your help :-)! Cheers, Marie
-- [https://mail.google.com/mail/u/0/?ui=2&ik=f2990c326c&view=fimg&t...]
Thanks a lot Dave, especially if you are currently in leave :-)!
1. This index should be Catalog of Life if I have understood well. Maybe, I should create a new name index (using the nameindexer tool) with the backbone taxonomy list from GBIF.
2. It works with others codes that contain whitespaces. The only difference that I can see between those codes are punctuation.
3. Sorry for my first explanation not really helpful ! :-). Actually, I was wrong, it's not a NFE. The error takes place before the indexation itself, it happens when I try to create the data resource (using GBIF tool or directly by creating a dataresource and then uploading a ZIP file). The DwC is downloaded and directly after, I got the error (see error track below). I think that the error come from this function ( https://github.com/AtlasOfLivingAustralia/collectory-plugin/blob/master/grai...) so I guess it is when the zip file is unzipped.
-----------------------------------------------------
2016-05-23 16:56:08,179 INFO [DataResourceController] Downloading file: http://api.gbif.org/v1/occurrence/download/request/0007506-160118175350007.z... 2016-05-23 16:56:37,965 INFO [org.jasig.cas.services.DefaultServicesManagerImpl] - <Reloading registered services.> 2016-05-23 16:56:37,976 DEBUG [org.jasig.cas.services.DefaultServicesManagerImpl] - <Adding registered service ^(https?|imaps?)://.*> 2016-05-23 16:56:37,976 INFO [org.jasig.cas.services.DefaultServicesManagerImpl] - <Loaded 1 services.> 2016-05-23 16:57:57,911 INFO [GbifService] dr172 null null 2016-05-23 16:57:58,155 ERROR [DataResourceController] JSONObject["guid"] not found. org.codehaus.groovy.grails.web.json.JSONException: JSONObject["guid"] not found. at au.org.ala.collectory.GbifService.createOrUpdateGBIFResource(GbifService.groovy:324) at au.org.ala.collectory.GbifService.createGBIFResourceFromArchiveURL(GbifService.groovy:294) at au.org.ala.collectory.ProviderGroupController$_closure23.doCall(ProviderGroupController.groovy:557) at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198) at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63) at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Cheers, Marie
On Wed, May 25, 2016 at 12:26 PM, David.Martin@csiro.au wrote:
Thanks Marie. Just quick answers (im currently on leave)
- BIE isnt required, but there should be an index on the biocache service
machine in the usual place (/data/lucence/namematching). This will then be used for taxon resolution.
- Im surprised this causes an issue. Whitespace in those codes can be an
issue.
- Can you supply more detail ? A NPE would suggest a bug or bad config.
The way we index large datasets is to use the offline method of indexing using the "bulk-processor" option in the command line tool.
Dave
*From:* Ala-portal ala-portal-bounces@lists.gbif.org on behalf of Marie Elise Lecoq melecoq@gbif.fr *Sent:* 25 May 2016 03:36 *To:* ala-portal@lists.gbif.org *Subject:* [Ala-portal] [Indexation] Questions
Hi all !
I have few questions about the indexation :
- It seems that some occurrences are wrongly indexed. For example, if I
search "Pica Pica", the three first results will be not relevant ( http://recherche.gbif.fr/occurrences/search?taxa=Pica+pica). Do I need to change something on the nameindexer ? I don't have a BIE instance on our system, do I need to install one in order to help ?
- We have some provider codes with punctuation (e.g. comma, dot ). It's
seems that the link between collection, institution and dataresource is not made due to this. It works with accents.
- I try to index a data resource with more than 20 million occurrences
and I have a NullPointerException, it's seems that guid is not found. I can upload data resource with much less data inside so I guess the problem comme from the data resource itself (size ?). Do you have a special way to deal with huge data resource ?
Thanks in advance for your help :-)! Cheers, Marie
--
--
participants (2)
-
David.Martin@csiro.au
-
Marie Elise Lecoq