funny characters in common names / vernaculars for species
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
That usually happens when, at some point, UTF-8 encoded text is read as ANSI. It only happens if the text contains characters above 127 (0x79), however.
Hope that helps, Guido
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Guido:
Thanks for your reply.
Thanks for confirming that there’s an character conversion issue happening somewhere.
Since the mangled characters appear in both html and json provided by GBIF, I’d say it is probably a gbif issue.
Is there a way to find out whether the invalid character handling occurs in a data provider or within GBIF itself?
thx, -jorrit
On Nov 23, 2015, at 3:14 PM, Guido Sautter sautter@ipd.uka.de wrote:
That usually happens when, at some point, UTF-8 encoded text is read as ANSI. It only happens if the text contains characters above 127 (0x79), however.
Hope that helps, Guido
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Jorrit,
Thanks for your reply.
welcome as can be.
Thanks for confirming that there’s an character conversion issue happening somewhere.
Since the mangled characters appear in both html and json provided by GBIF, I’d say it is probably a gbif issue.
Well, what we can say at this point is that GBIF _has_ mangled characters ... which doesn't mean the mangling necessarily happened at their facilities.
Is there a way to find out whether the invalid character handling occurs in a data provider or within GBIF itself?
Sorry to say, no. That's why I stated that characters got mangled "at some point". All we can say is that it happened upstream from GBIF's API.
Best, Guido
On Nov 23, 2015, at 3:14 PM, Guido Sautter <sautter@ipd.uka.de mailto:sautter@ipd.uka.de> wrote:
That usually happens when, at some point, UTF-8 encoded text is read as ANSI. It only happens if the text contains characters above 127 (0x79), however.
Hope that helps, Guido
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Ok. Sounds like we are on the same page. What do you think would be the most effective way to document this content issue?
thx, -jorrit
On Nov 23, 2015, at 3:35 PM, Guido Sautter sautter@ipd.uka.de wrote:
Hi Jorrit,
Thanks for your reply.
welcome as can be.
Thanks for confirming that there’s an character conversion issue happening somewhere.
Since the mangled characters appear in both html and json provided by GBIF, I’d say it is probably a gbif issue.
Well, what we can say at this point is that GBIF _has_ mangled characters ... which doesn't mean the mangling necessarily happened at their facilities.
Is there a way to find out whether the invalid character handling occurs in a data provider or within GBIF itself?
Sorry to say, no. That's why I stated that characters got mangled "at some point". All we can say is that it happened upstream from GBIF's API.
Best, Guido
On Nov 23, 2015, at 3:14 PM, Guido Sautter <sautter@ipd.uka.de mailto:sautter@ipd.uka.de> wrote:
That usually happens when, at some point, UTF-8 encoded text is read as ANSI. It only happens if the text contains characters above 127 (0x79), however.
Hope that helps, Guido
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users http://lists.gbif.org/mailman/listinfo/api-users
Hi Jorrit,
Ok. Sounds like we are on the same page. What do you think would be the most effective way to document this content issue?
collecting a bunch of links to API responses that include mangled characters looks like a good option to me.
Also, you might want to follow the links to the datasets and their providers, and all the way back to the dataset source pages (some three links to follow or so) and see if the mangled characters show up as well on the pages of the original data providers.
If the latter is the case, it's likely the providers' responsibility to fix the data. If not, there might be an issue along the transfer routes between the original providers and GBIF.
Just a thought, Guido
On Nov 23, 2015, at 3:35 PM, Guido Sautter <sautter@ipd.uka.de mailto:sautter@ipd.uka.de> wrote:
Hi Jorrit,
Thanks for your reply.
welcome as can be.
Thanks for confirming that there’s an character conversion issue happening somewhere.
Since the mangled characters appear in both html and json provided by GBIF, I’d say it is probably a gbif issue.
Well, what we can say at this point is that GBIF _has_ mangled characters ... which doesn't mean the mangling necessarily happened at their facilities.
Is there a way to find out whether the invalid character handling occurs in a data provider or within GBIF itself?
Sorry to say, no. That's why I stated that characters got mangled "at some point". All we can say is that it happened upstream from GBIF's API.
Best, Guido
On Nov 23, 2015, at 3:14 PM, Guido Sautter <sautter@ipd.uka.de mailto:sautter@ipd.uka.de> wrote:
That usually happens when, at some point, UTF-8 encoded text is read as ANSI. It only happens if the text contains characters above 127 (0x79), however.
Hope that helps, Guido
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi both,
the JSON specification defaults to UTF8, thats why you often do not see that encoding being specified in http again: https://en.wikipedia.org/wiki/JSON#Data_portability_issues https://en.wikipedia.org/wiki/JSON#Data_portability_issues
Still as you have spotted this is not correct UTF8 as you can see on our portal page: http://api.gbif.org/v1/species/2882753/vernacularNames http://api.gbif.org/v1/species/2882753/vernacularNames
The species is from our backbone which claims it is taken from the GRIN dataset, where you can see the same problem: http://www.gbif.org/species/101354008 http://www.gbif.org/species/101354008 http://www.gbif.org/species/101354008/vernaculars http://www.gbif.org/species/101354008/vernaculars
Also in the verbatim data as it came in: http://www.gbif.org/species/101354008/verbatim http://www.gbif.org/species/101354008/verbatim
Ill try to see how that ended up there. Markus
On 24 Nov 2015, at 06:11, Guido Sautter sautter@ipd.uka.de wrote:
Hi Jorrit,
Ok. Sounds like we are on the same page. What do you think would be the most effective way to document this content issue?
collecting a bunch of links to API responses that include mangled characters looks like a good option to me.
Also, you might want to follow the links to the datasets and their providers, and all the way back to the dataset source pages (some three links to follow or so) and see if the mangled characters show up as well on the pages of the original data providers.
If the latter is the case, it's likely the providers' responsibility to fix the data. If not, there might be an issue along the transfer routes between the original providers and GBIF.
Just a thought, Guido
On Nov 23, 2015, at 3:35 PM, Guido Sautter < mailto:sautter@ipd.uka.desautter@ipd.uka.de mailto:sautter@ipd.uka.de> wrote:
Hi Jorrit,
Thanks for your reply.
welcome as can be.
Thanks for confirming that there’s an character conversion issue happening somewhere.
Since the mangled characters appear in both html and json provided by GBIF, I’d say it is probably a gbif issue.
Well, what we can say at this point is that GBIF _has_ mangled characters ... which doesn't mean the mangling necessarily happened at their facilities.
Is there a way to find out whether the invalid character handling occurs in a data provider or within GBIF itself?
Sorry to say, no. That's why I stated that characters got mangled "at some point". All we can say is that it happened upstream from GBIF's API.
Best, Guido
On Nov 23, 2015, at 3:14 PM, Guido Sautter <sautter@ipd.uka.de mailto:sautter@ipd.uka.de> wrote:
That usually happens when, at some point, UTF-8 encoded text is read as ANSI. It only happens if the text contains characters above 127 (0x79), however.
Hope that helps, Guido
Hey y’all:
I am noticing some funny characters (e.g. "Wintergrün”) for species available here:
http://www.gbif.org/species/2882753/vernaculars http://www.gbif.org/species/2882753/vernaculars
Same is observed using the api:
http://api.gbif.org/v1/species/2882753/vernacularNames http://api.gbif.org/v1/species/2882753/vernacularNames
I am assuming that the actual common name should be something like “Wintergrün”.
While I was looking into this, I also noticed that no characterset is specified in http response headers.
Please confirm that this is expected behavior.
thx, -jorrit
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
participants (3)
-
Guido Sautter
-
jorrit poelen
-
Markus Döring