[COL-Users] COL downloads changes
mdoering at gbif.org
Thu May 6 12:29:45 UTC 2021
I must have missed your email in march, sorry for that. Let me still answer your important questions about the monthly downloads.
COL aims to produce a monthly release which we will keep at least for a year accessible via the API with all its features. Each release has its distinct datasetKey in ChecklistBank. There is no fixed day when we will issue the release. Moving to the new infrastructure at the end of last year caused a few teething issues which made us skip a few releases in january/february for example. We hope this will not happen again and you should see now regular monthly updates.
One of these releases will be tagged as an Annual release which from the API point of view is just the same as a monthly one. But it will not be removed from ChecklistBank and therefore you have long term access via the API to it. We plan to issue the next annual release in June.
Once a monthly release is deleted, we will still keep the data in various formats for download. But it will be gone from the database.
The download archive contains a DwC archive (YYYY-MM-DD_dwca.zip) and an ACEF archive (YYYY-MM-DD_acef.zip) right now, with the next May release we will also add a new ColDP archive to the supported formats. A native postgres dump is on our list too, but that is not straight forward as COL is just a small part of ChecklistBank and we will need to filter out the relevant bits.
Prior to December 2020 we only had DwC archives, but these has used some slightly different terms than we use today.
Note that we do not export a flattened classification (family, order, etc) at this stage, but we plan to add that back in again in a not too distant future. The same applies also to ColDP.
The ACEF archive is slightly special. We used it to transfer the data to the old systems and uses \N to represent NULL, which is a postgres specific convention. The files are proper CSV files with a header row and not tab separated. There is a short SQL script to load the ACEF files into a postgres database which also has DDL: https://github.com/CatalogueOfLife/backend/blob/master/webservice/src/main/resources/export/acef/load-export.sql
The DwCA and ColDP archives on the other hand use just an empty string and also use a header row with the term names.
Files in both dwca and coldp are tab delimited and use the .tsv file extension. For ColDP we maintain a Postgres, MS Access and Excel schema, but that needs some small updates as we are about to freeze the format for a final & fixed 1st version.
Hope thats useful,
I had been using the downloads available at https://download.catalogueoflife.org/col/monthly/ to construct a SQLite version of the database here https://github.com/sckott/col-sql to make it easier for users to use.
Two questions, the 2nd with many parts:
1. Will https://download.catalogueoflife.org/col/monthly/ continue to be updated every 2 or 3 months with a new database dump?
2. If the answer to (1) is yes: The format changed in the last database dump "2020-12-01_acef.zip".
a. The included file names changed, and file types changed from .tsv to .csv (although the data still appears to be tab-sep). Was it intended to change to comma-sep?
b. Will there be more changes to the monthly dump?
c. Will the release cycle be something predictable? Every 2 or 3 months?
d. The files have a lot of "\N" in them. Is this supposed to be a newline character? I've not seen a newline with a capital N.
e. Any schema to use for these various csv files?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the COL-Users