[API-users] wrong faceted API results/counts (at least in our datasets) and lowercased facet values (everywhere)

Herbario SANT sant.herbarium at gmail.com
Fri Feb 10 22:24:36 CET 2017


Hello.

I am trying to play with faceted results from the occurrence api, but
returned values are very odd IMHO.

Perhaps I am misunderstanding how faceting should work? Or there might be
some problem with the indexing of these particular datasets.
I am pretty lost.  This is what I found:


*(1) RESULTS COUNT NOT MATCHING SUM OF ALL FACET COUNTS*

I put a simple example so everything is returned in one page.

http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=10&collectionCode=SANT-Lich&genusKey=2581943

The count value is 4, the number of results is 4.
But the number of facets is 1, and its count is 2.

The faceted term ScientificName is a mandatory field, so no null values
should happen. I would expect every occurrence having a value for it.
And the number of values is short, so everything is returned in one request
(no paging needed).
So, in such a case shouldn't the sum of facet counts be equal to the number
of results?
Why the count of the faceted name is not 4?



*(2) LOWERCASE FACETS (facets values not matching results values):*
Look at the same api request above (plant names)

results:
Scientificname: "Generic_name specific_name (Basionym_Authors) Name_Authors"

facets:
name: "generic_name specific_name (basionym_authors) name_authors"

Why are the facets names always in lowercase?
I would say that is an error which shouldn't happen.

But I reported it some days ago and got no answer, so I wonder if this is
the intended api behaviour.

http://dev.gbif.org/issues/browse/PF-2758

Not only scientific names are lowercased. This also happens to
collectionCode in the next question.


*(3) FACETING COLLECTIONCODE VALUES of a single institution fails depending
on filtering parameter used to match the institution (code or uuid):*

Our institution (uuid= def87a70-0837-11d9-acb2-b8a03c50a862 ,
institutionCode=SANT) serves datasets from 4 collections, which should sum
up more than 100000 records.

Why do I get only 2 of our 4 datasets faceted in the following request,
which uses our publishingOrg uuid? (uuid should be the preferred option to
do this, as code might not be unique for our institution)

http://api.gbif.org/v1/occurrence/search?facet=collectionCode&limit=0&publishingOrg=def87a70-0837-11d9-acb2-b8a03c50a862

Why do I got 4 of 4 if I filter the request using institutionCode instead?
(fortunately, nobody else uses the same institutionCode yet, so numbers are
correct)

http://api.gbif.org/v1/occurrence/search?facet=collectionCode&limit=0&institutionCode=SANT

And why do counts differ for the same facet value (sant-lich) in those two
requests?
(9960 in the 1st request, 10007 in the 2nd one)

Why are facet values lowercase again? ("sant-lich" instead of "SANT-Lich")


*(4) FACETING SCIENTIFICNAME FAILS FOR SOME DATASETS, but works as expected
for others: *

More than 1000 faceted Scientificnames returned for our SANT-Lich and
SANT-Algae collections. Both of them look correct results:

http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&collectionCode=SANT-Lich&ScientificName.facetLimit=50000&ScientificName.facetOffset=0

http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&collectionCode=SANT-Algae&ScientificName.facetLimit=50000&ScientificName.facetOffset=0

But no facets returned for SANT-Bryo (which contains several hundred
distinct scientificname values):

http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&collectionCode=SANT-Bryo&ScientificName.facetLimit=50000&ScientificName.facetOffset=0

And only 7 facets for SANT scientificnames (should be over 10 thousand, as
this is by far our largest dataset):

http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&collectionCode=SANT&ScientificName.facetLimit=50000&ScientificName.facetOffset=0



Other than the lowercase facets issue (2), I couldn't reproduce issues
1,3,4 in other institutions datasets.
So I wonder if all this is somehow related to a wrong indexing of our IPT.

Has anyone else detected these problems?

Thanks a lot in advance for your help

David

-- 
David García San León
Herbario SANT
Universidade de Santiago de Compostela
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/api-users/attachments/20170210/1117f41f/attachment.html>


More information about the API-users mailing list