[API-users] funny characters in common names / vernaculars for species

Guido Sautter sautter at ipd.uka.de
Tue Nov 24 06:11:57 CET 2015

Hi Jorrit,
> Ok. Sounds like we are on the same page. What do you think would be the most 
> effective way to document this content issue?
collecting a bunch of links to API responses that include mangled characters 
looks like a good option to me.

Also, you might want to follow the links to the datasets and their providers, 
and all the way back to the dataset source pages (some three links to follow or 
so) and see if the mangled characters show up as well on the pages of the 
original data providers.

If the latter is the case, it's likely the providers' responsibility to fix the 
data. If not, there might be an issue along the transfer routes between the 
original providers and GBIF.

Just a thought,

>> On Nov 23, 2015, at 3:35 PM, Guido Sautter <sautter at ipd.uka.de 
>> <mailto:sautter at ipd.uka.de>> wrote:
>> Hi Jorrit,
>>> Thanks for your reply.
>> welcome as can be.
>>> Thanks for confirming that there’s an character conversion issue happening 
>>> somewhere.
>>> Since the mangled characters appear in both html and json provided by GBIF, 
>>> I’d say it is probably a gbif issue.
>> Well, what we can say at this point is that GBIF _has_ mangled characters ... 
>> which doesn't mean the mangling necessarily happened at their facilities.
>>> Is there a way to find out whether the invalid character handling occurs in 
>>> a data provider or within GBIF itself?
>> Sorry to say, no. That's why I stated that characters got mangled "at some 
>> point". All we can say is that it happened upstream from GBIF's API.
>> Best,
>> Guido
>>>> On Nov 23, 2015, at 3:14 PM, Guido Sautter <sautter at ipd.uka.de 
>>>> <mailto:sautter at ipd.uka.de>> wrote:
>>>> That usually happens when, at some point, UTF-8 encoded text is read as 
>>>> ANSI. It only happens if the text contains characters above 127 (0x79), 
>>>> however.
>>>> Hope that helps,
>>>> Guido
>>>>> Hey y’all:
>>>>> I am noticing some funny characters (e.g. "Wintergrün”) for species 
>>>>> available here:
>>>>> http://www.gbif.org/species/2882753/vernaculars
>>>>> Same is observed using the api:
>>>>> http://api.gbif.org/v1/species/2882753/vernacularNames
>>>>> I am assuming that the actual common name should be something like 
>>>>> “Wintergrün”.
>>>>> While I was looking into this, I also noticed that no characterset is 
>>>>> specified in http response headers.
>>>>> Please confirm that this is expected behavior.
>>>>> thx,
>>>>> -jorrit
>>>>> _______________________________________________
>>>>> API-users mailing list
>>>>> API-users at lists.gbif.org
>>>>> http://lists.gbif.org/mailman/listinfo/api-users
>>>> _______________________________________________
>>>> API-users mailing list
>>>> API-users at lists.gbif.org <mailto:API-users at lists.gbif.org>
>>>> http://lists.gbif.org/mailman/listinfo/api-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/api-users/attachments/20151124/81e04271/attachment.html>

More information about the API-users mailing list