NamesIndex missing translations?
Dear list,
I'm using the very handy NamesIndex in order to translate between different taxonomic sources. In the case of CoL and iNaturalist, I found that only roughly half of the entries (from a recent export) have both the CoL ID and iNaturalist ID set. For the entries pertaining to the rank "species" the situation is particularly bad, only roughly 10% (4353 of 41109) of the entries do any real translation.
1. Why does the nidx export contain rows with either of the IDs empty at all? 2. Why are so many of the species in particular not correctly mapped? Is this due to missing authorship in the iNaturalist checklist coupled with a strict matching algorithm?
Best regards,
--
Raffael Mancini
IT administrator and developer
Service d'information digital sur le patrimoine naturel (SIDPNAT)
Musée National d'Histoire Naturelle Luxembourg
T: +352 247 66667 - https://mnhn.lu
Thanks Raffael,
There is a parameter "min" that defines how many sources must have the same name before the name gets included in the results. By default this is just one, so either iNat or COL. If you only want those where both identifiers exist you would POST to this:
POST https://api.checklistbank.org/nidx/export?datasetKey=3LR&datasetKey=1398...
On first glance I can lots of species names with 2 versions. One with authorship (COL) and one without (iNat). These are different entries in the names index, as the "canonical" one without authorship acts as a superset of all such names with any authorship.
For example these first rows in the result:
species Aa achalensis Schltr. 8H9J species Aa achalensis https://www.inaturalist.org/taxa/602054 species Aa argyrolepis Rchb.f. 8H9K species Aa argyrolepis https://www.inaturalist.org/taxa/829647
So this is expected behavior. To align also these names one would have to use the regular name matching instead.
To match all of iNats names to COL one would do this instead:
POST https://api.checklistbank.org/dataset/3LR/match/nameusage/job?sourceDatasetK...
In a similar manner one can also upload CSV files with names to be matched against any of the datasets in ChecklistBank.
Best, Markus
PS: here are the results of above's calls
NIDX EXPORT, MIN=1: https://download.checklistbank.org/job/34/34a68b4c-3a6a-41fc-a681-654dfcdbf2... [81.5 MB] NIDX EXPORT, MIN=2: https://download.checklistbank.org/job/0a/0af45acd-f723-4b27-8189-cdacda64b7... [1.9 MB] NAME MATCH iNat->COL: https://download.checklistbank.org/job/6b/6bbb3e5f-2a99-4d56-a87d-a5d2a3af6b... [36.6 MB]
With the following dataset keys: 9923: Catalogue of Life Checklist, version 2023-08-17 139831: iNaturalist Taxonomy, version 2023-08-01
On 14. Sep 2023, at 14:20, MANCINI Raffael via COL-Users col-users@lists.gbif.org wrote:
Dear list,
I'm using the very handy NamesIndex in order to translate between different taxonomic sources. In the case of CoL and iNaturalist, I found that only roughly half of the entries (from a recent export) have both the CoL ID and iNaturalist ID set. For the entries pertaining to the rank "species" the situation is particularly bad, only roughly 10% (4353 of 41109) of the entries do any real translation.
• Why does the nidx export contain rows with either of the IDs empty at all? • Why are so many of the species in particular not correctly mapped? Is this due to missing authorship in the iNaturalist checklist coupled with a strict matching algorithm?
Best regards,
-- Raffael Mancini IT administrator and developer Service d'information digital sur le patrimoine naturel (SIDPNAT) Musée National d'Histoire Naturelle Luxembourg T: +352 247 66667 - https://mnhn.lu
COL-Users mailing list COL-Users@lists.gbif.org https://lists.gbif.org/mailman/listinfo/col-users
Dear Markus,
thanks for the quick reply. I still have some questions left:
1. How does the NameIndex match names? Only by strict correspondence of the scientific name and authorship (modulo white-space and special characters)? 2. Do I understand correctly that the min parameter only relates to the fact that a name gets included in the result, not if it gets "collapsed"/"matched" into a single row? The second NIDX export with min=2 still has a lot of rows with only one of the IDs set. 3. The name "Abelona gigliotosi" shows up in your export with min=2 but it's only present in CoL not iNat (confirmed by a manual search on CLB). How come? To my understanding it should only show up in the min=1 export. 4. Are either of the matching algorithms (NameIndex and the regular /match/nameusages/job matching) documented somewhere?
In general, if I want some rather liberal matching, should I just go for the matching api instead of the NameIndex facility?
Best regards!
--
Raffael Mancini
IT administrator and developer
Service d'information digital sur le patrimoine naturel (SIDPNAT)
Musée National d'Histoire Naturelle Luxembourg
T: +352 247 66667 - https://mnhn.lu
________________________________ From: Markus Döring mdoering@gbif.org Sent: 15 September 2023 08:41:03 To: MANCINI Raffael; Catalogue of Life user announcements and discussion Cc: Tim Robertson Subject: Re: [COL-Users] NamesIndex missing translations?
Thanks Raffael,
There is a parameter "min" that defines how many sources must have the same name before the name gets included in the results. By default this is just one, so either iNat or COL. If you only want those where both identifiers exist you would POST to this:
POST https://api.checklistbank.org/nidx/export?datasetKey=3LR&datasetKey=1398...
On first glance I can lots of species names with 2 versions. One with authorship (COL) and one without (iNat). These are different entries in the names index, as the "canonical" one without authorship acts as a superset of all such names with any authorship.
For example these first rows in the result:
species Aa achalensis Schltr. 8H9J species Aa achalensis https://www.inaturalist.org/taxa/602054 species Aa argyrolepis Rchb.f. 8H9K species Aa argyrolepis https://www.inaturalist.org/taxa/829647
So this is expected behavior. To align also these names one would have to use the regular name matching instead.
To match all of iNats names to COL one would do this instead:
POST https://api.checklistbank.org/dataset/3LR/match/nameusage/job?sourceDatasetK...
In a similar manner one can also upload CSV files with names to be matched against any of the datasets in ChecklistBank.
Best, Markus
PS: here are the results of above's calls
NIDX EXPORT, MIN=1: https://download.checklistbank.org/job/34/34a68b4c-3a6a-41fc-a681-654dfcdbf2... [81.5 MB] NIDX EXPORT, MIN=2: https://download.checklistbank.org/job/0a/0af45acd-f723-4b27-8189-cdacda64b7... [1.9 MB] NAME MATCH iNat->COL: https://download.checklistbank.org/job/6b/6bbb3e5f-2a99-4d56-a87d-a5d2a3af6b... [36.6 MB]
With the following dataset keys: 9923: Catalogue of Life Checklist, version 2023-08-17 139831: iNaturalist Taxonomy, version 2023-08-01
On 14. Sep 2023, at 14:20, MANCINI Raffael via COL-Users col-users@lists.gbif.org wrote:
Dear list,
I'm using the very handy NamesIndex in order to translate between different taxonomic sources. In the case of CoL and iNaturalist, I found that only roughly half of the entries (from a recent export) have both the CoL ID and iNaturalist ID set. For the entries pertaining to the rank "species" the situation is particularly bad, only roughly 10% (4353 of 41109) of the entries do any real translation.
• Why does the nidx export contain rows with either of the IDs empty at all? • Why are so many of the species in particular not correctly mapped? Is this due to missing authorship in the iNaturalist checklist coupled with a strict matching algorithm?
Best regards,
-- Raffael Mancini IT administrator and developer Service d'information digital sur le patrimoine naturel (SIDPNAT) Musée National d'Histoire Naturelle Luxembourg T: +352 247 66667 - https://mnhn.lu
COL-Users mailing list COL-Users@lists.gbif.org https://lists.gbif.org/mailman/listinfo/col-users
participants (2)
-
MANCINI Raffael
-
Markus Döring