[Ala-portal] Names Generator Issues

Allan Koch allan.kv at gmail.com
Wed Apr 2 19:53:07 CEST 2014


Hi Natasha,

I ran DwcaNameIndexer with default CoL list, with any modification, and it
works correctly.

But, it not works when I ran with a modified file. I merged into the CoL
taxon concepts list and vernacular names our taxon concepts and vernacular
names.

For now, I not include the IRMNG_DWC_HOMONYMS.

Sou, I ran this command:

 java -cp .:names.jar au.org.ala.checklist.lucene.DwcaNameIndexer --all
--dwca /[ABSOLUT_PATH]/dwca-col-merge-animais/col_dwc.txt
--target /[ABSOLUT_PATH]/index/mergeColAnimaisCommon/
--common /[ABSOLUT_PATH]/col_vernacular_merge_animais.txt

This is the output:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
Apr 02, 2014 1:54:17 PM au.org.ala.checklist.lucene.DwcaNameIndexer
createLoadingIndex
INFO: Starting to create the temporary loading index.
Apr 02, 2014 1:59:55 PM au.org.ala.checklist.lucene.DwcaNameIndexer
createLoadingIndex
INFO: Finished creating the temporary load index with 2474452 concepts
Apr 02, 2014 2:05:00 PM au.org.ala.checklist.lucene.CBCreateLuceneIndex
createALAIndexDocument
WARNING: urn:lsid:catalogueoflife.org:taxon:e0bc6ece-2dc5-11e0-98c6-2ce70255a436:col20120124
X humeralis has issues creating a soundex: String index out of range: -1
Apr 02, 2014 2:05:00 PM au.org.ala.checklist.lucene.CBCreateLuceneIndex
createALAIndexDocument
WARNING: urn:lsid:catalogueoflife.org:taxon:e0aac25d-2dc5-11e0-98c6-2ce70255a436:col20120124
X cinerea has issues creating a soundex: String index out of range: -1
Apr 02, 2014 2:05:00 PM au.org.ala.checklist.lucene.CBCreateLuceneIndex
createALAIndexDocument
WARNING: urn:lsid:catalogueoflife.org:taxon:e0c2f680-2dc5-11e0-98c6-2ce70255a436:col20120124
X lineatus has issues creating a soundex: String index out of range: -1
Apr 02, 2014 2:05:00 PM au.org.ala.checklist.lucene.CBCreateLuceneIndex
createALAIndexDocument
WARNING: urn:lsid:catalogueoflife.org:taxon:e1590b50-2dc5-11e0-98c6-2ce70255a436:col20120124
X has issues creating a soundex: String index out of range: -1
Apr 02, 2014 2:21:18 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d74fcd5e-29c1-102b-9a4a-00304854f820:col20120124
Animalia 1 2455490
Apr 02, 2014 2:26:37 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d755b8fe-29c1-102b-9a4a-00304854f820:col20120124
Plantae 2455491 3065868
Apr 02, 2014 2:26:50 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d755c2e0-29c1-102b-9a4a-00304854f820:col20120124
Bacteria 3065869 3089350
Apr 02, 2014 2:27:10 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d76c1e8c-29c1-102b-9a4a-00304854f820:col20120124
Protozoa 3089351 3127338
Apr 02, 2014 2:27:20 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d7656902-29c1-102b-9a4a-00304854f820:col20120124
Chromista 3127339 3145930
Apr 02, 2014 2:27:22 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d76df252-29c1-102b-9a4a-00304854f820:col20120124
Viruses 3145931 3150922
Apr 02, 2014 2:28:21 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d770c428-29c1-102b-9a4a-00304854f820:col20120124
Fungi 3150923 3262754
Apr 02, 2014 2:28:22 PM au.org.ala.checklist.lucene.DwcaNameIndexer
generateIndex
INFO: Finished loading
urn:lsid:catalogueoflife.org:taxon:d77cda9c-29c1-102b-9a4a-00304854f820:col20120124
Archaea 3262755 3263584
*java.lang.NullPointerException*
*        at
au.org.ala.checklist.lucene.CBCreateLuceneIndex.isBlacklisted(CBCreateLuceneIndex.java:769)*
*        at
au.org.ala.checklist.lucene.CBCreateLuceneIndex.createALAIndexDocument(CBCreateLuceneIndex.java:779)*
*        at
au.org.ala.checklist.lucene.CBCreateLuceneIndex.createALAIndexDocument(CBCreateLuceneIndex.java:747)*
*        at
au.org.ala.checklist.lucene.DwcaNameIndexer.addIndex(DwcaNameIndexer.java:321)*
*        at
au.org.ala.checklist.lucene.DwcaNameIndexer.generateIndex(DwcaNameIndexer.java:252)*
*        at
au.org.ala.checklist.lucene.DwcaNameIndexer.create(DwcaNameIndexer.java:85)*
*        at
au.org.ala.checklist.lucene.DwcaNameIndexer.main(DwcaNameIndexer.java:386)*

Any idea about what coud be the problem?
If you want I can send you col_dwc.txt and col_vernacular_merge_animais.txt
files that I used.

Thank you,


Allan Koch Veiga

Núcleo de Pesquisa em Biodiversidade e Computação - BioComp
Laboratório de Automação Agrícola - LAA
Depto. de Engenharia de Computação e Sistemas Digitais - PCS
Engenharia Elétrica - Escola Politécnica da USP
Celular: +55 11 8401-2277
Email: allan.kv at usp.br

"*Stay hungry, stay foolish.*" Stewart Brand


2014-03-20 18:44 GMT-03:00 <Natasha.Quimby at csiro.au>:

>  Hi Allan,
>
>  We have finished writing a DWCA Names index generator.  It requires a
> single DwCA that contains all the scientific names that you wish to add
> (including synonyms).
>
>  There is an example Catalogue of Life DwcA that can be downloaded here:
> http://biocache.ala.org.au/archives/dwca-col.zip You will need to modify
> the col_dwc.txt file to include any additional species.
>
>  The name matching index can also support common name. Here are the
> Catalogue of Life common names that can be loaded in conjunction to the
> Darwin Core Archive:
> http://biocache.ala.org.au/archives/col_vernacular.txt.zip
>
>  The name matching supports homonym detection. Homonym detection is
> supported through the using of IRMNG. You can download the IRMNG DwCA for
> homonyms from the following URL:
> www.cmar.csiro.au/datacentre/downloads/IRMNG_DWC_HOMONYMS.zip
>
>  Here is the code for the DwcaNameIndexer :
> http://code.google.com/p/ala-portal/source/browse/trunk/ala-name-matching/src/main/java/au/org/ala/checklist/lucene/DwcaNameIndexer.java
>
>  An assembly jar file for this can be downloaded from our maven
> repository :
> http://maven.ala.org.au/repository/au/org/ala/ala-name-matching/1.3-SNAPSHOT/ala-name-matching-1.3-SNAPSHOT-assembly.jar
>
>  To generate the name using the assembly jar:
> 1) Rename the jar :
> mv ala-name-matching-1.3-SNAPSHOT-assembly.jar names.jar
>
>  2) Extract the lib directory:
>  jar -xf names.jar lib
>
>  3) Generate the names index - here is the command that I used.
> java -cp .:names.jar au.org.ala.checklist.lucene.DwcaNameIndexer --all
> --dwca /data/bie-staging/names-lists/dwca-col --target
> /data/lucene/testdwc-namematching --irmng
> /data/bie-staging/irmng/IRMNG_DWC_HOMONYMS --common
> /data/bie-staging/ala-names/col_vernacular.txt
>
>  Please be aware that the names indexing could take over an hour to
> complete.
>
>  Let me know if you have any questions.
>
>  Regards
> Natasha
>
>   From: Allan Koch <allan.kv at gmail.com>
> Date: Wednesday, 19 March 2014 12:44 AM
>
> To: Natasha Carter <natasha.quimby at csiro.au>
> Cc: "Martin, Dave (CES, Black Mountain)" <David.Martin at csiro.au>, "
> Ala-portal at lists.gbif.org" <Ala-portal at lists.gbif.org>
> Subject: Re: [Ala-portal] Names Generator Issues
>
>    Hi Natasha,
>
>  That´s great. Thank you very much.
> We are waiting for the DwC-A with CoL list for we merging our names.
>
>  We are excited with this news.
> Thank you again,
>
>  Allan Koch Veiga
>
> Núcleo de Pesquisa em Biodiversidade e Computação - BioComp
> Laboratório de Automação Agrícola - LAA
> Depto. de Engenharia de Computação e Sistemas Digitais - PCS
> Engenharia Elétrica - Escola Politécnica da USP
> Celular: +55 11 8401-2277
> Email: allan.kv at usp.br
>
> "*Stay hungry, stay foolish.*" Stewart Brand
>
>
> 2014-03-18 3:03 GMT-03:00 <Natasha.Quimby at csiro.au>:
>
>>  Hi Allan,
>>
>>  We would provide a DwCA with the Catalogue of Life species in it.  Yes,
>> you will need to add your species to this file in the same format. If you
>> want your species to be merged into the Catalogue of life hierarchy you
>> will need to provide appropriate parentIds.
>>
>>  We would provide a tool within the ala-name-matching (available as a
>> jar file in our maven repository) to generate a list based on a DwCA. You
>> would need to run the tool pointing at your modified DwCA.
>>
>>  We will let you know when this is available.
>>
>>  Hope that this all makes sense.
>>
>>  Regards
>> Natasha
>>
>>   From: Allan Koch <allan.kv at gmail.com>
>> Date: Tuesday, 18 March 2014 7:34 AM
>> To: Natasha Carter <natasha.quimby at csiro.au>
>> Cc: "Martin, Dave (CES, Black Mountain)" <David.Martin at csiro.au>, "
>> Ala-portal at lists.gbif.org" <Ala-portal at lists.gbif.org>
>>
>> Subject: Re: [Ala-portal] Names Generator Issues
>>
>>    Hi Natasha,
>>
>>  It would be great. We have 109 species that we need to include in the
>> current name matching index.
>>
>>  It would be great if you send me a DwC-A with CoL names. I will need
>> just to add these 109 species in that archive in the same format, right?
>>
>>  How will work this proposed solution?
>> Will you provide a souce code, a compiled program (JAR) or we will send
>> to you the DwC-A and you will generate the Lucene index?
>>
>>  Thank you very much for helping.
>>
>>  Best regards,
>>
>>  Allan Koch Veiga
>>
>> Núcleo de Pesquisa em Biodiversidade e Computação - BioComp
>> Laboratório de Automação Agrícola - LAA
>> Depto. de Engenharia de Computação e Sistemas Digitais - PCS
>> Engenharia Elétrica - Escola Politécnica da USP
>> Celular: +55 11 8401-2277
>> Email: allan.kv at usp.br
>>
>> "*Stay hungry, stay foolish.*" Stewart Brand
>>
>>
>> 2014-03-17 3:12 GMT-03:00 <Natasha.Quimby at csiro.au>:
>>
>>>  Hi Allan,
>>>
>>>  The ala-name-generator is useful if you want to use the Australian
>>> National Species list as the main source for your namematching index.  We
>>> would not suggest using this to supplement the name matching index with
>>> additional species.
>>>
>>>  In order to support custom species lists we are planning an
>>> enhancement to generate the namematching index from a DarwinCore Archive.
>>> We would envision that all the species would be provided as a single DWCA
>>> with the attached meta.xml.   We think that this could be achieved in the
>>> 1-2 week window that you mentioned.  We could provide a DWCA which contains
>>> Catalogue of Life  as a basis for you to start with. You can then add
>>> additional names to the DWCA as you please. Do you think that this would
>>> suit you needs?
>>>
>>>  Regards
>>> Natasha
>>>
>>>  From: Allan Koch <allan.kv at gmail.com>
>>> Date: Saturday, 15 March 2014 4:19 AM
>>> To: "Martin, Dave (CES, Black Mountain)" <David.Martin at csiro.au>
>>>
>>> Cc: "Ala-portal at lists.gbif.org" <Ala-portal at lists.gbif.org>
>>> Subject: Re: [Ala-portal] Names Generator Issues
>>>
>>>     I thank you for the quick answer David and Tim .
>>>
>>> We have studied the process to create the namematching index based on
>>> the National List of Australia and I see that reproducing the same process
>>> to create a new National List is quite complex.
>>>
>>> But, for now, we just have the demand to include some names that aren´t
>>> included in current namematching index.
>>>
>>> If we understood, we need at first, run the Names Generator with the
>>> input of a set of CSVs from APNI, APC e AFD. Based on the output of the
>>> Names Generator we run the Name Mathcing for creating the LUCENE index,
>>> right?
>>>
>>> If it´s right, it would be great If we could execute this same standard
>>> process, but with de input CSVs modified, with our set of names included in
>>> these CSVs (in the same format).
>>>
>>> In the future (after this 3 months) we can study the possibility to
>>> generate our complete National List .
>>>
>>> But for now, we need to include a set of names in the namematching index.
>>> It could be possible to be realized in a short time, in one or two weeks?
>>>
>>>  Best regards,
>>>
>>>  Allan Koch Veiga
>>>
>>> Núcleo de Pesquisa em Biodiversidade e Computação - BioComp
>>> Laboratório de Automação Agrícola - LAA
>>> Depto. de Engenharia de Computação e Sistemas Digitais - PCS
>>> Engenharia Elétrica - Escola Politécnica da USP
>>> Celular: +55 11 8401-2277
>>> Email: allan.kv at usp.br
>>>
>>> "*Stay hungry, stay foolish.*" Stewart Brand
>>>
>>>
>>> 2014-03-12 23:15 GMT-03:00 <David.Martin at csiro.au>:
>>>
>>>>  Thanks Allan, Paulo, Tim.
>>>>
>>>>  We appreciate your efforts in setting this software upm locally, and
>>>> thanks for emailing the list.
>>>>
>>>>  1) Versioning
>>>>
>>>>  While we are on the track of making this software re-usable by other
>>>> projects/organisations, it is still very early days. Versioning and
>>>> packaging are things that we need to tackle properly in the 3 month
>>>> evaulation period [2] and we are working with GBIF on the best approach
>>>> here (see Tim's email regarding Ansible). To date, the ALA environment
>>>> itself is the only place these components are used in production and we
>>>> manage these closely ourselves. We havent had a need to tightly version
>>>> components, but as other projects become reliant we need to do this
>>>> properly. At this point it time, I'd recommend ignoring developments on
>>>> branches within SVN.
>>>>
>>>>  2) ala-name-generator
>>>>
>>>>  We didn't anticipate that other projects would be using the
>>>> ala-name-generator code at this stage (or at all), and instead would rely
>>>> on the Catalogue of Life names lucene index we've produced [1]. The
>>>> ala-name-generator code as it currently is isnt suitable for use outside
>>>> the Australian context. It is dealing with some of the quirks of Australian
>>>> species lists and merging some elements from different sources. We should
>>>> have marked wikis to that effect.
>>>>
>>>>  That said, we appreciate the need for other projects to use their own
>>>> taxonomic checklists. This was something I'd hope we tackle in the 3 month
>>>> evaluation period [2]. There's a few of potential approaches here we are
>>>> exploring and we'll email this list soon with some progress on this front.
>>>> I suggest in the meantime, projects make use of the existing index [1].
>>>>
>>>>  Thanks again,
>>>>
>>>>  Dave Martin
>>>> ALA
>>>>
>>>>  [1] http://biocache.ala.org.au/archives/col_namematching.tgz
>>>> [2] See GBIF's email sent 21st Feb 2014 - "Biodiversity data portals:
>>>> Using the ALA tooling"
>>>>
>>>>
>>>>  From: "Tim Robertson [GBIF]" <trobertson at gbif.org>
>>>> Date: Thursday, 13 March 2014 2:16 am
>>>> To: Paulo André <pfilipak at gmail.com>
>>>> Cc: "Ala-portal at lists.gbif.org" <Ala-portal at lists.gbif.org>
>>>> Subject: Re: [Ala-portal] Names Generator Issues
>>>>
>>>>  Hi Paulo
>>>>
>>>>  Those are all good comments - I'll make sure the ALA dev team are
>>>> following those issues.
>>>> As this goes forward, it is clear that code releases are going to be
>>>> needed, so we get immutable binaries in nexus and tagged SVN branches.
>>>>  I'll try and raise this with Dave Martin.
>>>>
>>>>  I'll try and follow the resolutions for the issues you log, build an
>>>> artifact and verify the same results.
>>>> I'm not so much into scala, but IIRC I saw that issue with another
>>>> artifact.  The solution was to run this before running the command line:
>>>>   jar -xf ala-names-generator-1.0-SNAPSHOT-assembly.jar lib
>>>>
>>>>  I found this in the way they run the biocache command line tools in:
>>>>
>>>> https://ala-portal.googlecode.com/svn/trunk/biocache-install/ubuntu/install.sh
>>>>
>>>>  It may not be the solution, but worth trying.
>>>>
>>>>  I hope this helps,
>>>> Tim
>>>>
>>>>
>>>>
>>>>  On Mar 12, 2014, at 3:57 PM, Paulo André wrote:
>>>>
>>>>  Tim
>>>>
>>>>  I had have several issues on
>>>> https://code.google.com/p/ala-portal/source/browse/#svn%2Ftrunk%2Fala-names-generator<https://code.google.com/p/ala-portal/source/browse/#svn/trunk/ala-names-generator>
>>>>
>>>>  I wrote on Jira: http://dev.gbif.org/issues/browse/ALA
>>>>
>>>>  []'s
>>>> Paulo Andre Filipak
>>>>
>>>>
>>>> 2014-03-12 11:51 GMT-03:00 Tim Robertson [GBIF] <trobertson at gbif.org>:
>>>>
>>>>>  Hi Allan,
>>>>>
>>>>>  I am sure the ALA folks will comment when they wake up.  But...
>>>>>
>>>>>  It doesn't appear to be published as an artifact in the ALA maven
>>>>> repository:
>>>>>   http://maven.ala.org.au/repository/au/org/ala/
>>>>>
>>>>>  You could build from source from:
>>>>> https://code.google.com/p/ala-portal/source/browse/#svn%2Ftrunk%2Fala-names-generator<https://code.google.com/p/ala-portal/source/browse/#svn/trunk/ala-names-generator>
>>>>>
>>>>>  I presume using something along the lines of "mvn clean
>>>>> assembly:assembly"
>>>>>
>>>>>  I hope this helps provide some options,
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   On Mar 12, 2014, at 3:32 PM, Allan Koch wrote:
>>>>>
>>>>>    Does anyone knows where I can download this jar:
>>>>> *ala-names-generator-1.0-SNAPSHOT-assembly.jar*?
>>>>>
>>>>> I´m trying to generate a new Taxon Name List based on NSL for the
>>>>> Biocache processing.
>>>>> This instructions has been followed:
>>>>>
>>>>> http://code.google.com/p/ala-portal/wiki/UpgradeALANames
>>>>>
>>>>> According the instructions, I need to run this command:
>>>>>
>>>>> java -Xmx1G -Xms1G -cp .:ala-names-generator-1.0-SNAPSHOT-assembly.jar
>>>>> au.org.ala.names.NamesGenerator --all
>>>>>
>>>>> But, I can´t find this JAR.
>>>>>
>>>>>  We are trying to build the Scala Project, but we are having some
>>>>> troubles.
>>>>> Would help me, for while, if I could run a ready JAR.
>>>>>
>>>>>  Best regards,
>>>>>
>>>>>   Allan Koch Veiga
>>>>>
>>>>> Research Center on Biodiversity and Computing - BioComp
>>>>> University of São Paulo
>>>>>
>>>>> Laboratório de Automação Agrícola - LAA
>>>>> Depto. de Engenharia de Computação e Sistemas Digitais - PCS
>>>>> Engenharia Elétrica - Escola Politécnica da USP
>>>>> Celular: +55 11 98401-2277
>>>>> Email: allan.kv at usp.br
>>>>>
>>>>> "*Stay hungry, stay foolish.*" Stewart Brand
>>>>>     _______________________________________________
>>>>> Ala-portal mailing list
>>>>> Ala-portal at lists.gbif.org
>>>>> http://lists.gbif.org/mailman/listinfo/ala-portal
>>>>>
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------------------------------
>>>>>  Tim Robertson - GBIF Head of Informatics - trobertson at gbif.org
>>>>>  Global Biodiversity Information Facility http://www.gbif.org/
>>>>>  GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
>>>>>  Tel: +45 3532 1487  Mob: +45 2826 1487  Fax: +45 2875 1480
>>>>> ----------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Ala-portal mailing list
>>>>> Ala-portal at lists.gbif.org
>>>>> http://lists.gbif.org/mailman/listinfo/ala-portal
>>>>>
>>>>>
>>>>
>>>>
>>>> ----------------------------------------------------------------------------------------
>>>>
>>>> Tim Robertson - GBIF Head of Informatics - trobertson at gbif.org
>>>>
>>>> Global Biodiversity Information Facility http://www.gbif.org/
>>>>
>>>> GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
>>>>
>>>> Tel: +45 3532 1487  Mob: +45 2826 1487  Fax: +45 2875 1480
>>>>
>>>>
>>>> ----------------------------------------------------------------------------------------
>>>>
>>>>
>>>> _______________________________________________
>>>> Ala-portal mailing list
>>>> Ala-portal at lists.gbif.org
>>>> http://lists.gbif.org/mailman/listinfo/ala-portal
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ala-portal/attachments/20140402/c04d69ae/attachment-0001.html 


More information about the Ala-portal mailing list