Hi Scott,
I’m sure that someone with more direct knowledge of the GBIF taxonomy backbone will answer more specifically. But in general, essentially all large taxonomic databases have these sorts of duplicate records due to spelling variations, etc. Most such databases began by harvesting lists of (messy) text-string names from various sources, with the early emphasis being on quantity rather than quality. In recent years, the emphasis has shifted towards improving quality, and to greater or lesser degrees, most large databases and aggregators have made tremendous progress in reconciling and correcting these sorts of issues. However, these kind of lexical variants (i.e., two slightly different spellings being mistakenly represented as separate names) continue to exist, and probably will continue for quite some time (especially in large taxonomic aggregators, such as GIBIF). The Global Names Architecture has current NSF funding (PI: Dima Mozzherin) to develop tools to help reconcile these sorts of lexical variants, and we have another NSF grant pending that will flesh those cleaned/reconciled text-string names out into metadata-rich names and name-usages… so there is some additional hope of accelerated clean-up in the next few years. But until then, I’m afraid these kinds of duplicates will continued to be discovered and addressed on a case-by-case basis.
Not sure if that helps…. But if you do restrict to a single source (like CoL), you’re less likely to encounter these kinds of duplicates, and the presumption is that linking to either one will eventually get straightened out.
Aloha,
Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences | Associate Zoologist in Ichthyology | Dive Safety Officer Department of Natural Sciences, Bishop Museum, 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
From: API-users [mailto:api-users-bounces@lists.gbif.org] On Behalf Of Scott Chamberlain Sent: Wednesday, May 11, 2016 11:23 AM To: api-users@lists.gbif.org Cc: juli g. pausas Subject: [API-users] Scientific names questions
HI all,
Not sure where is best to ask this... so here goes. Let me know if there's a better place.
The following are examples some users have highlighted for me as leading to confusion when searching for taxa.
1. Macrozamia platyrachis ( http://www.gbif.org/species/4928834 http://www.gbif.org/species/4928834) vs. Macrozamia platyrhachis ( http://www.gbif.org/species/2683551 http://www.gbif.org/species/2683551)
Here, the two spellings (with/without h) are accepted, and exact matches. The sci. authority seems to differ with F. M. Bailey vs. F.M.Bailey. The first is from GRIN taxonomy and the second from COL.
Anyway, for users e.g., of the R client, this is a bit confusing. I had thought the backbone taxonomy would only have one master taxon key and name for each real taxon, but here it seems like there's two?
2. Cycas circinalis ( http://www.gbif.org/species/2683264 http://www.gbif.org/species/2683264 ) vs. Cycas circinnalis ( http://www.gbif.org/species/3594916 http://www.gbif.org/species/3594916 )
Here, the two spellings (with 1 or 2 "n"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from COL and the second from IPNI taxonomy.
3. Isolona perrieri ( http://www.gbif.org/species/3648546 http://www.gbif.org/species/3648546 ) vs Isolona perrierii ( http://www.gbif.org/species/6308376 http://www.gbif.org/species/6308376 )
Here, the two spellings (with 1 or 2 "i"'s) are accepted, and exact matches. The sci. authorities here are exactly the same. The first is from TPL and the second from COL
--------
Should I advise users to when searching on the backbone taxonomy to limit to COL to avoid any confusion about names?
Best,
Scott Chamberlain