HI all,
Not sure where is best to ask this... so here goes. Let me know if there's a better place.
The following are examples some users have highlighted for me as leading to confusion when searching for taxa.
1. Macrozamia platyrachis (http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis (http://www.gbif.org/species/2683551)
Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.
Anyway, for users e.g., of the R client, this is a bit confusing. I had thought the backbone taxonomy would only have one master taxon key and name for each real taxon, but here it seems like there's two?
2. Cycas circinalis (http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis (http://www.gbif.org/species/3594916 )
Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.
3. Isolona perrieri (http://www.gbif.org/species/3648546 ) vs Isolona perrierii (http://www.gbif.org/species/6308376 )
Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
--------
Should I advise users to when searching on the backbone taxonomy to limit to COL to avoid any confusion about names?
Best, Scott Chamberlain
Hi Scott,
I’m sure that someone with more direct knowledge of the GBIF taxonomy backbone will answer more specifically. But in general, essentially all large taxonomic databases have these sorts of duplicate records due to spelling variations, etc. Most such databases began by harvesting lists of (messy) text-string names from various sources, with the early emphasis being on quantity rather than quality. In recent years, the emphasis has shifted towards improving quality, and to greater or lesser degrees, most large databases and aggregators have made tremendous progress in reconciling and correcting these sorts of issues. However, these kind of lexical variants (i.e., two slightly different spellings being mistakenly represented as separate names) continue to exist, and probably will continue for quite some time (especially in large taxonomic aggregators, such as GIBIF). The Global Names Architecture has current NSF funding (PI: Dima Mozzherin) to develop tools to help reconcile these sorts of lexical variants, and we have another NSF grant pending that will flesh those cleaned/reconciled text-string names out into metadata-rich names and name-usages… so there is some additional hope of accelerated clean-up in the next few years. But until then, I’m afraid these kinds of duplicates will continued to be discovered and addressed on a case-by-case basis.
Not sure if that helps…. But if you do restrict to a single source (like CoL), you’re less likely to encounter these kinds of duplicates, and the presumption is that linking to either one will eventually get straightened out.
Aloha,
Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences | Associate Zoologist in Ichthyology | Dive Safety Officer Department of Natural Sciences, Bishop Museum, 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
From: API-users [mailto:api-users-bounces@lists.gbif.org] On Behalf Of Scott Chamberlain Sent: Wednesday, May 11, 2016 11:23 AM To: api-users@lists.gbif.org Cc: juli g. pausas Subject: [API-users] Scientific names questions
HI all,
Not sure where is best to ask this... so here goes. Let me know if there's a better place.
The following are examples some users have highlighted for me as leading to confusion when searching for taxa.
1. Macrozamia platyrachis ( http://www.gbif.org/species/4928834 http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis ( http://www.gbif.org/species/2683551 http://www.gbif.org/species/2683551)
Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.
Anyway, for users e.g., of the R client, this is a bit confusing. I had thought the backbone taxonomy would only have one master taxon key and name for each real taxon, but here it seems like there's two?
2. Cycas circinalis ( http://www.gbif.org/species/2683264 http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis ( http://www.gbif.org/species/3594916 http://www.gbif.org/species/3594916 )
Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.
3. Isolona perrieri ( http://www.gbif.org/species/3648546 http://www.gbif.org/species/3648546 ) vs Isolona perrierii ( http://www.gbif.org/species/6308376 http://www.gbif.org/species/6308376 )
Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
--------
Should I advise users to when searching on the backbone taxonomy to limit to COL to avoid any confusion about names?
Best,
Scott Chamberlain
Hi Scott,
these are indeed unwanted duplications of species in our backbone. The software that builds the backbone so far does not yet try to synonymize these spelling variations automatically as it is quite easy to get that wrong. We will work on this in an improved version of the algorithm, the open issue is here: http://dev.gbif.org/issues/browse/POR-2812
… which is part of the next round of improving the backbone building: http://dev.gbif.org/issues/browse/POR-3029
Until then please let us know about those duplicate names. It helps understanding the problem better and as a last resort we could add those names to our patch list as known spelling variations, i.e. synonyms. They then get synonymized in future backbone versions: https://github.com/gbif/backbone-patch
Many thanks, Markus
On 11 May 2016, at 23:22, Scott Chamberlain <myrmecocystus@gmail.commailto:myrmecocystus@gmail.com> wrote:
HI all,
Not sure where is best to ask this... so here goes. Let me know if there's a better place.
The following are examples some users have highlighted for me as leading to confusion when searching for taxa.
1. Macrozamia platyrachis (http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis (http://www.gbif.org/species/2683551)
Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.
Anyway, for users e.g., of the R client, this is a bit confusing. I had thought the backbone taxonomy would only have one master taxon key and name for each real taxon, but here it seems like there's two?
2. Cycas circinalis (http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis (http://www.gbif.org/species/3594916 )
Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.
3. Isolona perrieri (http://www.gbif.org/species/3648546 ) vs Isolona perrierii (http://www.gbif.org/species/6308376 )
Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
--------
Should I advise users to when searching on the backbone taxonomy to limit to COL to avoid any confusion about names?
Best, Scott Chamberlain _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Richard,
Thanks for the response! All makes sense. I'll work on making sure my users are aware of these issues and give them options according to their use case.
---
Markus,
Thanks, I'll follow those issues.
please let us know about those duplicate names.
Where? in JIRA, or a github repo issues?
Best, Scott
On Thu, May 12, 2016 at 2:13 AM Markus Döring mdoering@gbif.org wrote:
Hi Scott,
these are indeed unwanted duplications of species in our backbone. The software that builds the backbone so far does not yet try to synonymize these spelling variations automatically as it is quite easy to get that wrong. We will work on this in an improved version of the algorithm, the open issue is here: http://dev.gbif.org/issues/browse/POR-2812
… which is part of the next round of improving the backbone building: http://dev.gbif.org/issues/browse/POR-3029
Until then please let us know about those duplicate names. It helps understanding the problem better and as a last resort we could add those names to our patch list as known spelling variations, i.e. synonyms. They then get synonymized in future backbone versions: https://github.com/gbif/backbone-patch
Many thanks, Markus
On 11 May 2016, at 23:22, Scott Chamberlain myrmecocystus@gmail.com wrote:
HI all,
Not sure where is best to ask this... so here goes. Let me know if there's a better place.
The following are examples some users have highlighted for me as leading to confusion when searching for taxa.
- Macrozamia platyrachis (http://www.gbif.org/species/4928834) vs.
Macrozamia platyrhachis (http://www.gbif.org/species/2683551)
Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.
Anyway, for users e.g., of the R client, this is a bit confusing. I had thought the backbone taxonomy would only have one master taxon key and name for each real taxon, but here it seems like there's two?
- Cycas circinalis (http://www.gbif.org/species/2683264 ) vs. Cycas
circinnalis (http://www.gbif.org/species/3594916 )
Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.
- Isolona perrieri (http://www.gbif.org/species/3648546 ) vs Isolona
perrierii (http://www.gbif.org/species/6308376 )
Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
Should I advise users to when searching on the backbone taxonomy to limit to COL to avoid any confusion about names?
Best, Scott Chamberlain
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Scott, you can use githib issues in this repo if you wanna provide names to be added/corrected in the backbone: https://github.com/gbif/backbone-patch/issues
Otherwise feel free to just send them to me by mail in whatever way is comfortable.
Markus
On 12 May 2016, at 19:00, Scott Chamberlain <myrmecocystus@gmail.commailto:myrmecocystus@gmail.com> wrote:
Richard,
Thanks for the response! All makes sense. I'll work on making sure my users are aware of these issues and give them options according to their use case.
---
Markus,
Thanks, I'll follow those issues.
please let us know about those duplicate names.
Where? in JIRA, or a github repo issues?
Best, Scott
On Thu, May 12, 2016 at 2:13 AM Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote: Hi Scott,
these are indeed unwanted duplications of species in our backbone. The software that builds the backbone so far does not yet try to synonymize these spelling variations automatically as it is quite easy to get that wrong. We will work on this in an improved version of the algorithm, the open issue is here: http://dev.gbif.org/issues/browse/POR-2812
… which is part of the next round of improving the backbone building: http://dev.gbif.org/issues/browse/POR-3029
Until then please let us know about those duplicate names. It helps understanding the problem better and as a last resort we could add those names to our patch list as known spelling variations, i.e. synonyms. They then get synonymized in future backbone versions: https://github.com/gbif/backbone-patch
Many thanks, Markus
On 11 May 2016, at 23:22, Scott Chamberlain <myrmecocystus@gmail.commailto:myrmecocystus@gmail.com> wrote:
HI all,
Not sure where is best to ask this... so here goes. Let me know if there's a better place.
The following are examples some users have highlighted for me as leading to confusion when searching for taxa.
1. Macrozamia platyrachis (http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis (http://www.gbif.org/species/2683551)
Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.
Anyway, for users e.g., of the R client, this is a bit confusing. I had thought the backbone taxonomy would only have one master taxon key and name for each real taxon, but here it seems like there's two?
2. Cycas circinalis (http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis (http://www.gbif.org/species/3594916 )
Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.
3. Isolona perrieri (http://www.gbif.org/species/3648546 ) vs Isolona perrierii (http://www.gbif.org/species/6308376 )
Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
--------
Should I advise users to when searching on the backbone taxonomy to limit to COL to avoid any confusion about names?
Best, Scott Chamberlain _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
participants (3)
-
Markus Döring
-
Richard Pyle
-
Scott Chamberlain