Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414callto:352-392-5414 ________________________________ From: jorrit poelen jhpoelen@xs4all.nl Sent: Monday, May 30, 2016 11:16 AM To: Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer Subject: Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
From: Jan Legind <jlegind@gbif.orgmailto:jlegind@gbif.org> Subject: Re: [API-users] Is there any NEO4J or graph-based driver for this API ? Date: May 30, 2016 at 5:48:51 AM PDT To: Mauro Cavalcanti <maurobio@gmail.commailto:maurobio@gmail.com>, "Juan M. Escamilla Molgora" <j.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk> Cc: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <api-users@lists.gbif.orgmailto:api-users@lists.gbif.org>
Dear Juan,
Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number.
There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3]
In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries.
Best, Jan K. Legind, GBIF Data manager
[1] http://www.gbif.org/developer/occurrence#search [2] https://github.com/sckott/pygbif [3] https://github.com/jlegind/GBIF-downloads
From: API-users [mailto:api-users-bounces@lists.gbif.org] On Behalf Of Mauro Cavalcanti Sent: 30. maj 2016 14:06 To: Juan M. Escamilla Molgora Cc: api-users@lists.gbif.orgmailto:api-users@lists.gbif.org Subject: Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi, One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested. Best regards,
2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail: maurobio@gmail.commailto:maurobio@gmail.com Web: http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Matthew,
As a matter of fact, I have already developed such a tool - it is called "Feronia" and is a program written in Python that uses several available software libraries for downloading data from online databases that provided API's (including, of course, GBIF). This tool downloads data into a relational database using a generic schema I have devised ( http://sites.google.com/site/acaciadb). Exemples of databases already implemented using this tool can be found here: http://coralfish.netne.net and neocop2.biotupe.com
Feronia is not already available but I will make it available on GitHub ASAP.
Hope this helps.
Best regards,
-- Dr. Mauro J. Cavalcanti E-mail: maurobio@gmail.com Web: http://sites.google.com/site/maurobio Em 31/05/2016 13:02, "Collins, Matthew" mcollins@acis.ufl.edu escreveu:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414 callto:352-392-5414
*From:* jorrit poelen jhpoelen@xs4all.nl *Sent:* Monday, May 30, 2016 11:16 AM *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
*From: *Jan Legind jlegind@gbif.org *Subject: **Re: [API-users] Is there any NEO4J or graph-based driver for this API ?* *Date: *May 30, 2016 at 5:48:51 AM PDT *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. Escamilla Molgora" < j.escamillamolgora@lancaster.ac.uk> *Cc: *"api-users@lists.gbif.org" api-users@lists.gbif.org
Dear Juan,
Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number.
There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3]
In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries.
Best, Jan K. Legind, GBIF Data manager
[1] http://www.gbif.org/developer/occurrence#search [2] https://github.com/sckott/pygbif [3] https://github.com/jlegind/GBIF-downloads
*From:* API-users [mailto:api-users-bounces@lists.gbif.org api-users-bounces@lists.gbif.org] *On Behalf Of *Mauro Cavalcanti *Sent:* 30. maj 2016 14:06 *To:* Juan M. Escamilla Molgora *Cc:* api-users@lists.gbif.org *Subject:* Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi,
One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested. Best regards,
2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora < j.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail: maurobio@gmail.com Web: http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414 callto:352-392-5414
*From:* jorrit poelen jhpoelen@xs4all.nl *Sent:* Monday, May 30, 2016 11:16 AM *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ? Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
*From: *Jan Legind <jlegind@gbif.org mailto:jlegind@gbif.org> *Subject: **Re: [API-users] Is there any NEO4J or graph-based driver for this API ?* *Date: *May 30, 2016 at 5:48:51 AM PDT *To: *Mauro Cavalcanti <maurobio@gmail.com mailto:maurobio@gmail.com>, "Juan M. Escamilla Molgora" <j.escamillamolgora@lancaster.ac.uk mailto:j.escamillamolgora@lancaster.ac.uk> *Cc: *"api-users@lists.gbif.org mailto:api-users@lists.gbif.org" <api-users@lists.gbif.org mailto:api-users@lists.gbif.org>
Dear Juan, Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number. There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3] In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries. Best, Jan K. Legind, GBIF Data manager [1]http://www.gbif.org/developer/occurrence#search [2]https://github.com/sckott/pygbif [3]https://github.com/jlegind/GBIF-downloads *From:*API-users [mailto:api-users-bounces@lists.gbif.org]*On Behalf Of*Mauro Cavalcanti *Sent:*30. maj 2016 14:06 *To:*Juan M. Escamilla Molgora *Cc:*api-users@lists.gbif.org mailto:api-users@lists.gbif.org *Subject:*Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi,
One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested.
Best regards, 2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.uk mailto:j.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail:maurobio@gmail.com mailto:maurobio@gmail.com Web:http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk> wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414callto:352-392-5414 ________________________________ From: jorrit poelen jhpoelen@xs4all.nlmailto:jhpoelen@xs4all.nl Sent: Monday, May 30, 2016 11:16 AM To: Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer Subject: Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
From: Jan Legind <mailto:jlegind@gbif.orgjlegind@gbif.orgmailto:jlegind@gbif.org> Subject: Re: [API-users] Is there any NEO4J or graph-based driver for this API ? Date: May 30, 2016 at 5:48:51 AM PDT To: Mauro Cavalcanti <mailto:maurobio@gmail.commaurobio@gmail.commailto:maurobio@gmail.com>, "Juan M. Escamilla Molgora" <mailto:j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk> Cc: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <api-users@lists.gbif.orgmailto:api-users@lists.gbif.org>
Dear Juan,
Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number.
There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3]
In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries.
Best, Jan K. Legind, GBIF Data manager
[1] http://www.gbif.org/developer/occurrence#search http://www.gbif.org/developer/occurrence#search [2] https://github.com/sckott/pygbif https://github.com/sckott/pygbif [3] https://github.com/jlegind/GBIF-downloads https://github.com/jlegind/GBIF-downloads
From: API-users [mailto:api-users-bounces@lists.gbif.org] On Behalf Of Mauro Cavalcanti Sent: 30. maj 2016 14:06 To: Juan M. Escamilla Molgora Cc: mailto:api-users@lists.gbif.org api-users@lists.gbif.orgmailto:api-users@lists.gbif.org Subject: Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi, One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested. Best regards,
2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail: mailto:maurobio@gmail.com maurobio@gmail.commailto:maurobio@gmail.com Web: http://sites.google.com/site/maurobio http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote:
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.uk mailto:j.escamillamolgora@lancaster.ac.uk> wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414 callto:352-392-5414
*From:* jorrit poelen jhpoelen@xs4all.nl *Sent:* Monday, May 30, 2016 11:16 AM *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ? Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
*From: *Jan Legind jlegind@gbif.org *Subject: **Re: [API-users] Is there any NEO4J or graph-based driver for this API ?* *Date: *May 30, 2016 at 5:48:51 AM PDT *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. Escamilla Molgora" j.escamillamolgora@lancaster.ac.uk *Cc: *"api-users@lists.gbif.org mailto:api-users@lists.gbif.org" <api-users@lists.gbif.org mailto:api-users@lists.gbif.org>
Dear Juan, Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number. There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3] In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries. Best, Jan K. Legind, GBIF Data manager [1]http://www.gbif.org/developer/occurrence#search [2]https://github.com/sckott/pygbif [3]https://github.com/jlegind/GBIF-downloads *From:*API-users [mailto:api-users-bounces@lists.gbif.org]*On Behalf Of*Mauro Cavalcanti *Sent:*30. maj 2016 14:06 *To:*Juan M. Escamilla Molgora *Cc:*api-users@lists.gbif.org *Subject:*Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi,
One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested.
Best regards, 2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.uk mailto:j.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail:maurobio@gmail.com Web:http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Thanks Juan
You're quite right - you need the DwC-A download format to get those IDs.
Are the cells decimal degrees, and then partitioned into smaller units, or equal area cells or maybe UTM grids or something else perhaps? I am just curious.
Are you developing this as OSS? I'd like to follow progress if possible?
Thanks, Tim,
On 31 May 2016, at 20:31, Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk> wrote:
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote: Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora <mailto:j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk> wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414callto:352-392-5414 ________________________________ From: jorrit poelen jhpoelen@xs4all.nlmailto:jhpoelen@xs4all.nl Sent: Monday, May 30, 2016 11:16 AM To: Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer Subject: Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
From: Jan Legind <mailto:jlegind@gbif.orgjlegind@gbif.orgmailto:jlegind@gbif.org> Subject: Re: [API-users] Is there any NEO4J or graph-based driver for this API ? Date: May 30, 2016 at 5:48:51 AM PDT To: Mauro Cavalcanti <mailto:maurobio@gmail.commaurobio@gmail.commailto:maurobio@gmail.com>, "Juan M. Escamilla Molgora" <mailto:j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk> Cc: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <mailto:api-users@lists.gbif.orgapi-users@lists.gbif.orgmailto:api-users@lists.gbif.org>
Dear Juan,
Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number.
There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3]
In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries.
Best, Jan K. Legind, GBIF Data manager
[1] http://www.gbif.org/developer/occurrence#search http://www.gbif.org/developer/occurrence#search [2] https://github.com/sckott/pygbif https://github.com/sckott/pygbif [3] https://github.com/jlegind/GBIF-downloads https://github.com/jlegind/GBIF-downloads
From: API-users [mailto:api-users-bounces@lists.gbif.org] On Behalf Of Mauro Cavalcanti Sent: 30. maj 2016 14:06 To: Juan M. Escamilla Molgora Cc: mailto:api-users@lists.gbif.org api-users@lists.gbif.orgmailto:api-users@lists.gbif.org Subject: Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi, One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested. Best regards,
2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora <mailto:j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.ukmailto:j.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail: mailto:maurobio@gmail.com maurobio@gmail.commailto:maurobio@gmail.com Web: http://sites.google.com/site/maurobio http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
_______________________________________________ API-users mailing list API-users@lists.gbif.orgmailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Tim,
Thank you! specially for the DwC-A hint.
The cells are by default in decimal degrees, (wgs84 ) but the functions for generating them are general enough to use any projection supported by gdal using postgis. It could be done "on the fly" or stored on the server side,
I was thinking (day dreaming) in a standard way for coding unique but universal grids (similar to geohash or open location code), but didn't find something fast and ready. Maybe later :)
I only use Open Source Software, Python, Django, GDAL, Numpy, Postgis, Conda, Py2Neo, ete2 among others.
Currently I don't have an official release and the project is quite inmature, unstable as well as the installation could be non trivial. I'm fixing all these issues but will take some time,sorry for this.
The github repository is:
https://github.com/molgor/biospytial.git
An there's a very old documentation here:
http://test.holobio.me/modules/gbif_taxonomy_class.html
Please feel free to follow!
Best wishes
Juan
P.s. The functions for generating the grid are in: biospytial/SQL_functions
On 31/05/16 19:47, Tim Robertson wrote:
Thanks Juan
You're quite right - you need the DwC-A download format to get those IDs.
Are the cells decimal degrees, and then partitioned into smaller units, or equal area cells or maybe UTM grids or something else perhaps? I am just curious.
Are you developing this as OSS? I'd like to follow progress if possible?
Thanks, Tim,
On 31 May 2016, at 20:31, Juan M. Escamilla Molgora <j.escamillamolgora@lancaster.ac.uk mailto:j.escamillamolgora@lancaster.ac.uk> wrote:
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote:
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414 callto:352-392-5414
*From:* jorrit poelen jhpoelen@xs4all.nl *Sent:* Monday, May 30, 2016 11:16 AM *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ? Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
*From: *Jan Legind jlegind@gbif.org *Subject: **Re: [API-users] Is there any NEO4J or graph-based driver for this API ?* *Date: *May 30, 2016 at 5:48:51 AM PDT *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. Escamilla Molgora" j.escamillamolgora@lancaster.ac.uk *Cc: *"api-users@lists.gbif.org mailto:api-users@lists.gbif.org" api-users@lists.gbif.org
Dear Juan, Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number. There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3] In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries. Best, Jan K. Legind, GBIF Data manager [1]http://www.gbif.org/developer/occurrence#search [2]https://github.com/sckott/pygbif [3]https://github.com/jlegind/GBIF-downloads *From:*API-users [mailto:api-users-bounces@lists.gbif.org]*On Behalf Of*Mauro Cavalcanti *Sent:*30. maj 2016 14:06 *To:*Juan M. Escamilla Molgora *Cc:*api-users@lists.gbif.org *Subject:*Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi,
One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested.
Best regards, 2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail:maurobio@gmail.com Web:http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Juan et al
Thanks a lot for triggering this discussion. I am currently working on a Web processing service (http://birdhouse.readthedocs.io/en/latest/) including a species distribution model based on the GBIF data (and climate model data). A good connection to GBIF database is still missing and all hints were quite useful!!
If you want to share code: https://github.com/bird-house/flyingpigeon/blob/master/flyingpigeon/processe...
Merci Nils
On 31/05/2016 22:08, Juan M. Escamilla Molgora wrote:
Hi Tim,
Thank you! specially for the DwC-A hint.
The cells are by default in decimal degrees, (wgs84 ) but the functions for generating them are general enough to use any projection supported by gdal using postgis. It could be done "on the fly" or stored on the server side,
I was thinking (day dreaming) in a standard way for coding unique but universal grids (similar to geohash or open location code), but didn't find something fast and ready. Maybe later :)
I only use Open Source Software, Python, Django, GDAL, Numpy, Postgis, Conda, Py2Neo, ete2 among others.
Currently I don't have an official release and the project is quite inmature, unstable as well as the installation could be non trivial. I'm fixing all these issues but will take some time,sorry for this.
The github repository is:
https://github.com/molgor/biospytial.git
An there's a very old documentation here:
http://test.holobio.me/modules/gbif_taxonomy_class.html
Please feel free to follow!
Best wishes
Juan
P.s. The functions for generating the grid are in: biospytial/SQL_functions
On 31/05/16 19:47, Tim Robertson wrote:
Thanks Juan
You're quite right - you need the DwC-A download format to get those IDs.
Are the cells decimal degrees, and then partitioned into smaller units, or equal area cells or maybe UTM grids or something else perhaps? I am just curious.
Are you developing this as OSS? I'd like to follow progress if possible?
Thanks, Tim,
On 31 May 2016, at 20:31, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote:
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414 callto:352-392-5414
*From:* jorrit poelen jhpoelen@xs4all.nl *Sent:* Monday, May 30, 2016 11:16 AM *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ? Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
> Begin forwarded message: > > *From: *Jan Legind jlegind@gbif.org > *Subject: **Re: [API-users] Is there any NEO4J or graph-based > driver for this API ?* > *Date: *May 30, 2016 at 5:48:51 AM PDT > *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. Escamilla > Molgora" j.escamillamolgora@lancaster.ac.uk > *Cc: *"api-users@lists.gbif.org > mailto:api-users@lists.gbif.org" api-users@lists.gbif.org > > Dear Juan, > Unfortunately we have no tool for creating these kind of SQL > like queries to the portal. I am sure you are aware that the > filters in the occurrence search pages can be applied in > combination in numerous ways. The API can go even further in > this regard[1], but it not well suited for retrieving occurrence > records since there is a 200.000 records ceiling making it unfit > for species exceeding this number. > There is going be updates to the pygbif package[2] in the near > future that will enable you to launch user downloads > programmatically where a whole list of different species can be > used as a query parameter as well as adding polygons.[3] > In the meantime, Mauro’s suggestion is excellent. If you can > narrow your search down until it returns a manageable download > (say less than 100 million records), importing this into a > database should be doable. From there, you can refine using SQL > queries. > Best, > Jan K. Legind, GBIF Data manager > [1]http://www.gbif.org/developer/occurrence#search > [2]https://github.com/sckott/pygbif > [3]https://github.com/jlegind/GBIF-downloads > *From:*API-users [mailto:api-users-bounces@lists.gbif.org]*On > Behalf Of*Mauro Cavalcanti > *Sent:*30. maj 2016 14:06 > *To:*Juan M. Escamilla Molgora > *Cc:*api-users@lists.gbif.org > *Subject:*Re: [API-users] Is there any NEO4J or graph-based > driver for this API ? > > Hi, > > One solution I have successfully adopted for this is to download > the records (either "manually" via browser or, yet better, using > a Python script using the fine pygbif library), storing them > into a MySQL or SQLite database and then perform the relational > queries. I can provide examples if you are interested. > > Best regards, > 2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora > j.escamillamolgora@lancaster.ac.uk: > Hola, > > Is there any API for making relational queries like taxonomy, > location or timestamp? > > Thank you and best wishes > > Juan > _______________________________________________ > API-users mailing list > API-users@lists.gbif.org mailto:API-users@lists.gbif.org > http://lists.gbif.org/mailman/listinfo/api-users > > > > -- > Dr. Mauro J. Cavalcanti > E-mail:maurobio@gmail.com > Web:http://sites.google.com/site/maurobio > _______________________________________________ > API-users mailing list > API-users@lists.gbif.org mailto:API-users@lists.gbif.org > http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Nils,
Really great... 🙂 Thanks for sharing!
Salud!
2016-06-01 6:09 GMT-03:00 Nils Hempelmann info@nilshempelmann.de:
Hi Juan et al
Thanks a lot for triggering this discussion. I am currently working on a Web processing service ( http://birdhouse.readthedocs.io/en/latest/) including a species distribution model based on the GBIF data (and climate model data). A good connection to GBIF database is still missing and all hints were quite useful!!
If you want to share code:
https://github.com/bird-house/flyingpigeon/blob/master/flyingpigeon/processe...
Merci Nils
On 31/05/2016 22:08, Juan M. Escamilla Molgora wrote:
Hi Tim,
Thank you! specially for the DwC-A hint.
The cells are by default in decimal degrees, (wgs84 ) but the functions for generating them are general enough to use any projection supported by gdal using postgis. It could be done "on the fly" or stored on the server side,
I was thinking (day dreaming) in a standard way for coding unique but universal grids (similar to geohash or open location code), but didn't find something fast and ready. Maybe later :)
I only use Open Source Software, Python, Django, GDAL, Numpy, Postgis, Conda, Py2Neo, ete2 among others.
Currently I don't have an official release and the project is quite inmature, unstable as well as the installation could be non trivial. I'm fixing all these issues but will take some time,sorry for this.
The github repository is:
https://github.com/molgor/biospytial.git
An there's a very old documentation here:
http://test.holobio.me/modules/gbif_taxonomy_class.html
Please feel free to follow!
Best wishes
Juan
P.s. The functions for generating the grid are in: biospytial/SQL_functions
On 31/05/16 19:47, Tim Robertson wrote:
Thanks Juan
You're quite right - you need the DwC-A download format to get those IDs.
Are the cells decimal degrees, and then partitioned into smaller units, or equal area cells or maybe UTM grids or something else perhaps? I am just curious.
Are you developing this as OSS? I'd like to follow progress if possible?
Thanks, Tim,
On 31 May 2016, at 20:31, Juan M. Escamilla Molgora < j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.uk> wrote:
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote:
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora < j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.uk> wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote:
Jorrit pointed out this thread to us at iDigBio. Downloading and importing data into a relational database will work great, especially if as Jan said you can cut the data size down to a reasonable amount.
Another approach we've been working on in a collaboration called GUODA [1] is to build an Apache Spark environment with pre-formatted data frames with common data sets in them for researchers to use. This approach would offer a remote service where you could write arbitrary Spark code, probably in Jupyter notebooks, to iterate over data. Spark does a lot of cool stuff including GraphX which might be of interest. This is definitely pre-alpha at this point and if anyone is interested, I'd like to hear your thoughts. I'll also be at SPNHC talking about this.
One thing we've found in working on this is that importing data into a structured data format isn't always easy. If you only want a few columns, it'll be fine. But getting the data typing, format standardization, and column name syntax of the whole width of an iDigBio record right requires some code. I looked to see if EcoData Retriever [2] had a GBIF data source and they have an eBird one that perhaps you might find useful as a starting point if you wanted to try to use someone else's code to download and import data.
For other data structures like BHL, we're kind of making stuff up since we're packaging a relational structure and not something nearly as flat as GBIF and DWC stuff.
[1] http://guoda.bio/%E2%80%8B
[2] http://www.ecodataretriever.org/
Matthew Collins Technical Operations Manager Advanced Computing and Information Systems Lab, ECE University of Florida 352-392-5414 callto:352-392-5414
*From:* jorrit poelen jhpoelen@xs4all.nl jhpoelen@xs4all.nl *Sent:* Monday, May 30, 2016 11:16 AM *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hey y’all:
Interesting request below on the GBIF mailing list - sounds like a perfect fit for the GUODA use cases.
Would it be too early to jump onto this thread and share our efforts/vision?
thx, -jorrit
Begin forwarded message:
*From: *Jan Legind < jlegind@gbif.orgjlegind@gbif.org> *Subject: **Re: [API-users] Is there any NEO4J or graph-based driver for this API ?* *Date: *May 30, 2016 at 5:48:51 AM PDT *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. Escamilla Molgora" < j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.uk> *Cc: *"api-users@lists.gbif.org" api-users@lists.gbif.org
Dear Juan,
Unfortunately we have no tool for creating these kind of SQL like queries to the portal. I am sure you are aware that the filters in the occurrence search pages can be applied in combination in numerous ways. The API can go even further in this regard[1], but it not well suited for retrieving occurrence records since there is a 200.000 records ceiling making it unfit for species exceeding this number.
There is going be updates to the pygbif package[2] in the near future that will enable you to launch user downloads programmatically where a whole list of different species can be used as a query parameter as well as adding polygons.[3]
In the meantime, Mauro’s suggestion is excellent. If you can narrow your search down until it returns a manageable download (say less than 100 million records), importing this into a database should be doable. From there, you can refine using SQL queries.
Best, Jan K. Legind, GBIF Data manager
[1] http://www.gbif.org/developer/occurrence#search http://www.gbif.org/developer/occurrence#search [2] https://github.com/sckott/pygbifhttps://github.com/sckott/pygbif [3] https://github.com/jlegind/GBIF-downloads https://github.com/jlegind/GBIF-downloads
*From:* API-users [mailto:api-users-bounces@lists.gbif.org api-users-bounces@lists.gbif.org] *On Behalf Of *Mauro Cavalcanti *Sent:* 30. maj 2016 14:06 *To:* Juan M. Escamilla Molgora *Cc:* api-users@lists.gbif.orgapi-users@lists.gbif.org *Subject:* Re: [API-users] Is there any NEO4J or graph-based driver for this API ?
Hi,
One solution I have successfully adopted for this is to download the records (either "manually" via browser or, yet better, using a Python script using the fine pygbif library), storing them into a MySQL or SQLite database and then perform the relational queries. I can provide examples if you are interested. Best regards,
2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora < j.escamillamolgora@lancaster.ac.ukj.escamillamolgora@lancaster.ac.uk>: Hola,
Is there any API for making relational queries like taxonomy, location or timestamp?
Thank you and best wishes
Juan _______________________________________________ API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
-- Dr. Mauro J. Cavalcanti E-mail: maurobio@gmail.commaurobio@gmail.com Web: http://sites.google.com/site/maurobio http://sites.google.com/site/maurobio _______________________________________________ API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing listAPI-users@lists.gbif.orghttp://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing listAPI-users@lists.gbif.orghttp://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Dear all
Time is running quickly, and its already a while ago since I discovered GBIF/pygbif.
Meanwhile there is an initial version of a Species Distribution Model in the Web Processing Service Birdhouse. Based on an implementation of pygbif to fetch the Species occurrence data as well as an data search interface (wizard) to connect to the appropriate climate model data.
Here is (also initial) the documentation: http://flyingpigeon.readthedocs.io/en/latest/descriptions/sdm.html
And if you want to have a look on the GUI (including lots of other processes as well): https://mouflon.dkrz.de/
Looking forward for your feedback, suggestions, ideas, hopes and wishes ... :-)
Nils
On 01/06/2016 12:57, Mauro Cavalcanti wrote:
Nils,
Really great... 🙂 Thanks for sharing!
Salud!
2016-06-01 6:09 GMT-03:00 Nils Hempelmann <info@nilshempelmann.de mailto:info@nilshempelmann.de>:
Hi Juan et al Thanks a lot for triggering this discussion. I am currently working on a Web processing service (http://birdhouse.readthedocs.io/en/latest/) including a species distribution model based on the GBIF data (and climate model data). A good connection to GBIF database is still missing and all hints were quite useful!! If you want to share code: https://github.com/bird-house/flyingpigeon/blob/master/flyingpigeon/processes/wps_sdm.py Merci Nils
Hi Nils,
Thank you for sharing!
How is phoenix about? Does it connects to the ESGF network? It's the first time I read about this. Looks very very interesting!
Thanks everybody for these valuable feedback.
Best wishes
Juan
On 01/06/16 10:09, Nils Hempelmann wrote:
Hi Juan et al
Thanks a lot for triggering this discussion. I am currently working on a Web processing service (http://birdhouse.readthedocs.io/en/latest/) including a species distribution model based on the GBIF data (and climate model data). A good connection to GBIF database is still missing and all hints were quite useful!!
If you want to share code: https://github.com/bird-house/flyingpigeon/blob/master/flyingpigeon/processe...
Merci Nils
On 31/05/2016 22:08, Juan M. Escamilla Molgora wrote:
Hi Tim,
Thank you! specially for the DwC-A hint.
The cells are by default in decimal degrees, (wgs84 ) but the functions for generating them are general enough to use any projection supported by gdal using postgis. It could be done "on the fly" or stored on the server side,
I was thinking (day dreaming) in a standard way for coding unique but universal grids (similar to geohash or open location code), but didn't find something fast and ready. Maybe later :)
I only use Open Source Software, Python, Django, GDAL, Numpy, Postgis, Conda, Py2Neo, ete2 among others.
Currently I don't have an official release and the project is quite inmature, unstable as well as the installation could be non trivial. I'm fixing all these issues but will take some time,sorry for this.
The github repository is:
https://github.com/molgor/biospytial.git
An there's a very old documentation here:
http://test.holobio.me/modules/gbif_taxonomy_class.html
Please feel free to follow!
Best wishes
Juan
P.s. The functions for generating the grid are in: biospytial/SQL_functions
On 31/05/16 19:47, Tim Robertson wrote:
Thanks Juan
You're quite right - you need the DwC-A download format to get those IDs.
Are the cells decimal degrees, and then partitioned into smaller units, or equal area cells or maybe UTM grids or something else perhaps? I am just curious.
Are you developing this as OSS? I'd like to follow progress if possible?
Thanks, Tim,
On 31 May 2016, at 20:31, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote:
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
Dear all,
Thank you very much for your valuable feedback!
I'll explain a bit what I'm doing just to clarify, sorry if this spam to some.
I want to build a model for species assemblages based on co-occurrence of taxa within an arbitrary area. I'm building a 2D lattice in which for each cell I'm collapsing the data into a taxonomic tree (the occurrences). For doing this I need first to obtain the data from the gbif api and later, based on the ids (or names) of each taxonomic level (from kingdom to occurrence) build a tree coupled to each cell.
The implementation is done with postgresql (postgis) for storing the raw gbif data and neo4j for storing the relation
"Being a member of the [ specie, genus, family,,,] [name/id]" The idea is to include data from different sources similar to the project Matthew and Jennifer had mentioned (which I'm very interested and like to hear more) and traverse the network looking for significant merged information.
One of the immediate problems I've found is to import big chunks of the gbif data into my specification. Thanks to this thread I've found the tools that are the most used by the community (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and things like that.
I'll be happy to share any code or ideas with the people interested.
Btw, I've checked the tinkerpop project which uses the Gremlin traversal language as independent from the DBMS.
Perhaps it's possible to use it with spark and Guoda as well?
Does GOuda is working now?
Best wishes
Juan.
On 31/05/16 17:02, Collins, Matthew wrote: > > Jorrit pointed out this thread to us at iDigBio. Downloading and > importing data into a relational database will work great, > especially if as Jan said you can cut the data size down to a > reasonable amount. > > > Another approach we've been working on in a collaboration called > GUODA [1] is to build an Apache Spark environment with > pre-formatted data frames with common data sets in them for > researchers to use. This approach would offer a remote service > where you could write arbitrary Spark code, probably in Jupyter > notebooks, to iterate over data. Spark does a lot of cool stuff > including GraphX which might be of interest. This is definitely > pre-alpha at this point and if anyone is interested, I'd like to > hear your thoughts. I'll also be at SPNHC talking about this. > > > One thing we've found in working on this is that importing data > into a structured data format isn't always easy. If you only > want a few columns, it'll be fine. But getting the data typing, > format standardization, and column name syntax of the whole > width of an iDigBio record right requires some code. I looked to > see if EcoData Retriever [2] had a GBIF data source and they > have an eBird one that perhaps you might find useful as a > starting point if you wanted to try to use someone else's code > to download and import data. > > > For other data structures like BHL, we're kind of making stuff > up since we're packaging a relational structure and not > something nearly as flat as GBIF and DWC stuff. > > > [1] http://guoda.bio/%E2%80%8B > > [2] http://www.ecodataretriever.org/ > > > Matthew Collins > Technical Operations Manager > Advanced Computing and Information Systems Lab, ECE > University of Florida > 352-392-5414 callto:352-392-5414 > ------------------------------------------------------------------------ > *From:* jorrit poelen jhpoelen@xs4all.nl > *Sent:* Monday, May 30, 2016 11:16 AM > *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer > *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based > driver for this API ? > Hey y’all: > > Interesting request below on the GBIF mailing list - sounds like > a perfect fit for the GUODA use cases. > > Would it be too early to jump onto this thread and share our > efforts/vision? > > thx, > -jorrit > >> Begin forwarded message: >> >> *From: *Jan Legind jlegind@gbif.org >> *Subject: **Re: [API-users] Is there any NEO4J or graph-based >> driver for this API ?* >> *Date: *May 30, 2016 at 5:48:51 AM PDT >> *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. Escamilla >> Molgora" j.escamillamolgora@lancaster.ac.uk >> *Cc: *"api-users@lists.gbif.org >> mailto:api-users@lists.gbif.org" api-users@lists.gbif.org >> >> Dear Juan, >> Unfortunately we have no tool for creating these kind of SQL >> like queries to the portal. I am sure you are aware that the >> filters in the occurrence search pages can be applied in >> combination in numerous ways. The API can go even further in >> this regard[1], but it not well suited for retrieving >> occurrence records since there is a 200.000 records ceiling >> making it unfit for species exceeding this number. >> There is going be updates to the pygbif package[2] in the near >> future that will enable you to launch user downloads >> programmatically where a whole list of different species can be >> used as a query parameter as well as adding polygons.[3] >> In the meantime, Mauro’s suggestion is excellent. If you can >> narrow your search down until it returns a manageable download >> (say less than 100 million records), importing this into a >> database should be doable. From there, you can refine using SQL >> queries. >> Best, >> Jan K. Legind, GBIF Data manager >> [1]http://www.gbif.org/developer/occurrence#search >> [2]https://github.com/sckott/pygbif >> [3]https://github.com/jlegind/GBIF-downloads >> *From:*API-users [mailto:api-users-bounces@lists.gbif.org]*On >> Behalf Of*Mauro Cavalcanti >> *Sent:*30. maj 2016 14:06 >> *To:*Juan M. Escamilla Molgora >> *Cc:*api-users@lists.gbif.org >> *Subject:*Re: [API-users] Is there any NEO4J or graph-based >> driver for this API ? >> >> Hi, >> >> One solution I have successfully adopted for this is to >> download the records (either "manually" via browser or, yet >> better, using a Python script using the fine pygbif library), >> storing them into a MySQL or SQLite database and then perform >> the relational queries. I can provide examples if you are >> interested. >> >> Best regards, >> 2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora >> j.escamillamolgora@lancaster.ac.uk: >> Hola, >> >> Is there any API for making relational queries like taxonomy, >> location or timestamp? >> >> Thank you and best wishes >> >> Juan >> _______________________________________________ >> API-users mailing list >> API-users@lists.gbif.org mailto:API-users@lists.gbif.org >> http://lists.gbif.org/mailman/listinfo/api-users >> >> >> >> -- >> Dr. Mauro J. Cavalcanti >> E-mail:maurobio@gmail.com >> Web:http://sites.google.com/site/maurobio >> _______________________________________________ >> API-users mailing list >> API-users@lists.gbif.org mailto:API-users@lists.gbif.org >> http://lists.gbif.org/mailman/listinfo/api-users > > > > _______________________________________________ > API-users mailing list > API-users@lists.gbif.org > http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org mailto:API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
Hi Juan et al
Yes phoenix has a search inter-phase to ESGF data ( but you can use other climate data archives as well ).
Here are some preliminary screen shots: http://flyingpigeon.readthedocs.io/en/latest/tutorials/sdm.html
Best Nils
On 01/06/2016 16:06, Juan M. Escamilla Molgora wrote:
Hi Nils,
Thank you for sharing!
How is phoenix about? Does it connects to the ESGF network? It's the first time I read about this. Looks very very interesting!
Thanks everybody for these valuable feedback.
Best wishes
Juan
On 01/06/16 10:09, Nils Hempelmann wrote:
Hi Juan et al
Thanks a lot for triggering this discussion. I am currently working on a Web processing service (http://birdhouse.readthedocs.io/en/latest/) including a species distribution model based on the GBIF data (and climate model data). A good connection to GBIF database is still missing and all hints were quite useful!!
If you want to share code: https://github.com/bird-house/flyingpigeon/blob/master/flyingpigeon/processe...
Merci Nils
On 31/05/2016 22:08, Juan M. Escamilla Molgora wrote:
Hi Tim,
Thank you! specially for the DwC-A hint.
The cells are by default in decimal degrees, (wgs84 ) but the functions for generating them are general enough to use any projection supported by gdal using postgis. It could be done "on the fly" or stored on the server side,
I was thinking (day dreaming) in a standard way for coding unique but universal grids (similar to geohash or open location code), but didn't find something fast and ready. Maybe later :)
I only use Open Source Software, Python, Django, GDAL, Numpy, Postgis, Conda, Py2Neo, ete2 among others.
Currently I don't have an official release and the project is quite inmature, unstable as well as the installation could be non trivial. I'm fixing all these issues but will take some time,sorry for this.
The github repository is:
https://github.com/molgor/biospytial.git
An there's a very old documentation here:
http://test.holobio.me/modules/gbif_taxonomy_class.html
Please feel free to follow!
Best wishes
Juan
P.s. The functions for generating the grid are in: biospytial/SQL_functions
On 31/05/16 19:47, Tim Robertson wrote:
Thanks Juan
You're quite right - you need the DwC-A download format to get those IDs.
Are the cells decimal degrees, and then partitioned into smaller units, or equal area cells or maybe UTM grids or something else perhaps? I am just curious.
Are you developing this as OSS? I'd like to follow progress if possible?
Thanks, Tim,
On 31 May 2016, at 20:31, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
Hi Tim,
The grid is made by selecting a square area and divide it in nxn subsquares which form a partition on the bigger square.
Each grid is a table in postgis and there's a mapping between this table to a django model (class).
The class constructor have attributes: id, cell and neighbours (next release).
The cell is a polygon (square) and with geodjango inherits the properties of the osgeo module for polygons.
I've tried to use the CSV data (downloaded as a CSV request ) but I couldn't find a way to obtain the global id's for each taxonomic level (idspecies, idgenus, idfamily, etc).
Do you know a way for obtaining these fields?
Thank you for your email and best wishes,
Juan
On 31/05/16 19:03, Tim Robertson wrote:
Hi Juan
That sounds like a fun project!
Can you please describe your grid / cells?
Most likely your best bet will be to use the download API (as CSV data) and ingest that. The other APIs will likely hit limits (e.g. You can't page through indefinitely).
Thanks, Tim
On 31 May 2016, at 18:55, Juan M. Escamilla Molgora j.escamillamolgora@lancaster.ac.uk wrote:
> Dear all, > > > Thank you very much for your valuable feedback! > > > I'll explain a bit what I'm doing just to clarify, sorry if this > spam to some. > > > I want to build a model for species assemblages based on > co-occurrence of taxa within an arbitrary area. I'm building a > 2D lattice in which for each cell I'm collapsing the data into a > taxonomic tree (the occurrences). For doing this I need first to > obtain the data from the gbif api and later, based on the ids > (or names) of each taxonomic level (from kingdom to occurrence) > build a tree coupled to each cell. > > > The implementation is done with postgresql (postgis) for storing > the raw gbif data and neo4j for storing the relation > > "Being a member of the [ specie, genus, family,,,] [name/id]" > The idea is to include data from different sources similar to > the project Matthew and Jennifer had mentioned (which I'm very > interested and like to hear more) and traverse the network > looking for significant merged information. > > > One of the immediate problems I've found is to import big chunks > of the gbif data into my specification. Thanks to this thread > I've found the tools that are the most used by the community > (pygbif,rgbif, and python-dwca-reader). I was using urlib2 and > things like that. > > I'll be happy to share any code or ideas with the people interested. > > > Btw, I've checked the tinkerpop project which uses the Gremlin > traversal language as independent from the DBMS. > > Perhaps it's possible to use it with spark and Guoda as well? > > > > Does GOuda is working now? > > > Best wishes > > > Juan. > > > > > > > > On 31/05/16 17:02, Collins, Matthew wrote: >> >> Jorrit pointed out this thread to us at iDigBio. Downloading >> and importing data into a relational database will work great, >> especially if as Jan said you can cut the data size down to a >> reasonable amount. >> >> >> Another approach we've been working on in a collaboration >> called GUODA [1] is to build an Apache Spark environment with >> pre-formatted data frames with common data sets in them for >> researchers to use. This approach would offer a remote service >> where you could write arbitrary Spark code, probably in Jupyter >> notebooks, to iterate over data. Spark does a lot of cool stuff >> including GraphX which might be of interest. This is definitely >> pre-alpha at this point and if anyone is interested, I'd like >> to hear your thoughts. I'll also be at SPNHC talking about this. >> >> >> One thing we've found in working on this is that importing data >> into a structured data format isn't always easy. If you only >> want a few columns, it'll be fine. But getting the data typing, >> format standardization, and column name syntax of the whole >> width of an iDigBio record right requires some code. I looked >> to see if EcoData Retriever [2] had a GBIF data source and they >> have an eBird one that perhaps you might find useful as a >> starting point if you wanted to try to use someone else's code >> to download and import data. >> >> >> For other data structures like BHL, we're kind of making stuff >> up since we're packaging a relational structure and not >> something nearly as flat as GBIF and DWC stuff. >> >> >> [1] http://guoda.bio/%E2%80%8B >> >> [2] http://www.ecodataretriever.org/ >> >> >> Matthew Collins >> Technical Operations Manager >> Advanced Computing and Information Systems Lab, ECE >> University of Florida >> 352-392-5414 callto:352-392-5414 >> ------------------------------------------------------------------------ >> *From:* jorrit poelen jhpoelen@xs4all.nl >> *Sent:* Monday, May 30, 2016 11:16 AM >> *To:* Collins, Matthew; Thompson, Alexander M; Hammock, Jennifer >> *Subject:* Fwd: [API-users] Is there any NEO4J or graph-based >> driver for this API ? >> Hey y’all: >> >> Interesting request below on the GBIF mailing list - sounds >> like a perfect fit for the GUODA use cases. >> >> Would it be too early to jump onto this thread and share our >> efforts/vision? >> >> thx, >> -jorrit >> >>> Begin forwarded message: >>> >>> *From: *Jan Legind jlegind@gbif.org >>> *Subject: **Re: [API-users] Is there any NEO4J or graph-based >>> driver for this API ?* >>> *Date: *May 30, 2016 at 5:48:51 AM PDT >>> *To: *Mauro Cavalcanti maurobio@gmail.com, "Juan M. >>> Escamilla Molgora" j.escamillamolgora@lancaster.ac.uk >>> *Cc: *"api-users@lists.gbif.org >>> mailto:api-users@lists.gbif.org" api-users@lists.gbif.org >>> >>> Dear Juan, >>> Unfortunately we have no tool for creating these kind of SQL >>> like queries to the portal. I am sure you are aware that the >>> filters in the occurrence search pages can be applied in >>> combination in numerous ways. The API can go even further in >>> this regard[1], but it not well suited for retrieving >>> occurrence records since there is a 200.000 records ceiling >>> making it unfit for species exceeding this number. >>> There is going be updates to the pygbif package[2] in the near >>> future that will enable you to launch user downloads >>> programmatically where a whole list of different species can >>> be used as a query parameter as well as adding polygons.[3] >>> In the meantime, Mauro’s suggestion is excellent. If you can >>> narrow your search down until it returns a manageable download >>> (say less than 100 million records), importing this into a >>> database should be doable. From there, you can refine using >>> SQL queries. >>> Best, >>> Jan K. Legind, GBIF Data manager >>> [1]http://www.gbif.org/developer/occurrence#search >>> [2]https://github.com/sckott/pygbif >>> [3]https://github.com/jlegind/GBIF-downloads >>> *From:*API-users [mailto:api-users-bounces@lists.gbif.org]*On >>> Behalf Of*Mauro Cavalcanti >>> *Sent:*30. maj 2016 14:06 >>> *To:*Juan M. Escamilla Molgora >>> *Cc:*api-users@lists.gbif.org >>> *Subject:*Re: [API-users] Is there any NEO4J or graph-based >>> driver for this API ? >>> >>> Hi, >>> >>> One solution I have successfully adopted for this is to >>> download the records (either "manually" via browser or, yet >>> better, using a Python script using the fine pygbif library), >>> storing them into a MySQL or SQLite database and then perform >>> the relational queries. I can provide examples if you are >>> interested. >>> >>> Best regards, >>> 2016-05-30 8:59 GMT-03:00 Juan M. Escamilla Molgora >>> j.escamillamolgora@lancaster.ac.uk: >>> Hola, >>> >>> Is there any API for making relational queries like taxonomy, >>> location or timestamp? >>> >>> Thank you and best wishes >>> >>> Juan >>> _______________________________________________ >>> API-users mailing list >>> API-users@lists.gbif.org mailto:API-users@lists.gbif.org >>> http://lists.gbif.org/mailman/listinfo/api-users >>> >>> >>> >>> -- >>> Dr. Mauro J. Cavalcanti >>> E-mail:maurobio@gmail.com >>> Web:http://sites.google.com/site/maurobio >>> _______________________________________________ >>> API-users mailing list >>> API-users@lists.gbif.org mailto:API-users@lists.gbif.org >>> http://lists.gbif.org/mailman/listinfo/api-users >> >> >> >> _______________________________________________ >> API-users mailing list >> API-users@lists.gbif.org >> http://lists.gbif.org/mailman/listinfo/api-users > > _______________________________________________ > API-users mailing list > API-users@lists.gbif.org mailto:API-users@lists.gbif.org > http://lists.gbif.org/mailman/listinfo/api-users
API-users mailing list API-users@lists.gbif.org http://lists.gbif.org/mailman/listinfo/api-users
participants (5)
-
Collins, Matthew
-
Juan M. Escamilla Molgora
-
Mauro Cavalcanti
-
Nils Hempelmann
-
Tim Robertson