Hello.
I am trying to play with faceted results from the occurrence api, but returned values are very odd IMHO.
Perhaps I am misunderstanding how faceting should work? Or there might be some problem with the indexing of these particular datasets. I am pretty lost. This is what I found:
*(1) RESULTS COUNT NOT MATCHING SUM OF ALL FACET COUNTS*
I put a simple example so everything is returned in one page.
http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=10&a...
The count value is 4, the number of results is 4. But the number of facets is 1, and its count is 2.
The faceted term ScientificName is a mandatory field, so no null values should happen. I would expect every occurrence having a value for it. And the number of values is short, so everything is returned in one request (no paging needed). So, in such a case shouldn't the sum of facet counts be equal to the number of results? Why the count of the faceted name is not 4?
*(2) LOWERCASE FACETS (facets values not matching results values):* Look at the same api request above (plant names)
results: Scientificname: "Generic_name specific_name (Basionym_Authors) Name_Authors"
facets: name: "generic_name specific_name (basionym_authors) name_authors"
Why are the facets names always in lowercase? I would say that is an error which shouldn't happen.
But I reported it some days ago and got no answer, so I wonder if this is the intended api behaviour.
http://dev.gbif.org/issues/browse/PF-2758
Not only scientific names are lowercased. This also happens to collectionCode in the next question.
*(3) FACETING COLLECTIONCODE VALUES of a single institution fails depending on filtering parameter used to match the institution (code or uuid):*
Our institution (uuid= def87a70-0837-11d9-acb2-b8a03c50a862 , institutionCode=SANT) serves datasets from 4 collections, which should sum up more than 100000 records.
Why do I get only 2 of our 4 datasets faceted in the following request, which uses our publishingOrg uuid? (uuid should be the preferred option to do this, as code might not be unique for our institution)
http://api.gbif.org/v1/occurrence/search?facet=collectionCode&limit=0&am...
Why do I got 4 of 4 if I filter the request using institutionCode instead? (fortunately, nobody else uses the same institutionCode yet, so numbers are correct)
http://api.gbif.org/v1/occurrence/search?facet=collectionCode&limit=0&am...
And why do counts differ for the same facet value (sant-lich) in those two requests? (9960 in the 1st request, 10007 in the 2nd one)
Why are facet values lowercase again? ("sant-lich" instead of "SANT-Lich")
*(4) FACETING SCIENTIFICNAME FAILS FOR SOME DATASETS, but works as expected for others: *
More than 1000 faceted Scientificnames returned for our SANT-Lich and SANT-Algae collections. Both of them look correct results:
http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&am...
http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&am...
But no facets returned for SANT-Bryo (which contains several hundred distinct scientificname values):
http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&am...
And only 7 facets for SANT scientificnames (should be over 10 thousand, as this is by far our largest dataset):
http://api.gbif.org/v1/occurrence/search?facet=ScientificName&limit=0&am...
Other than the lowercase facets issue (2), I couldn't reproduce issues 1,3,4 in other institutions datasets. So I wonder if all this is somehow related to a wrong indexing of our IPT.
Has anyone else detected these problems?
Thanks a lot in advance for your help
David