how to check last date-time dataset was harvested?
Hi again.
Is there any way to use GBIF portal API to know the last date & time a certain resource was harvested by GBIF? (basically, to know if it has the latest info provided by the original source, or it could be outdated)
I suspect the info in this url but I don' know how to interpretate them all:
http://api.gbif.org/v1/dataset/df3eab30-0837-11d9-acb2-b8a03c50a862
I guess this is the IPT updated time:
"modified": "2017-02-16T22:07:41.127+0000"
¿Perhaps this is what I am looking for?
"machineTags":[{..."name":"crawl_attempt",..."created":"2017-01-05T12:07:12.525+0000"},...]
Thanks a lot David
Hi David,
The crawl attempt is the start of the crawl with the publisher. Once the status confirms it is finished, then you could use that. While it is crawling, the status will be something like “RUNNING” (off the top of my head)
Thanks, Tim
From: API-users <api-users-bounces@lists.gbif.orgmailto:api-users-bounces@lists.gbif.org> on behalf of Herbario SANT <sant.herbarium@gmail.commailto:sant.herbarium@gmail.com> Date: Sunday 19 February 2017 at 14:38 To: "api-users@lists.gbif.orgmailto:api-users@lists.gbif.org" <api-users@lists.gbif.orgmailto:api-users@lists.gbif.org> Subject: [API-users] how to check last date-time dataset was harvested?
Hi again.
Is there any way to use GBIF portal API to know the last date & time a certain resource was harvested by GBIF? (basically, to know if it has the latest info provided by the original source, or it could be outdated)
I suspect the info in this url but I don' know how to interpretate them all:
http://api.gbif.org/v1/dataset/df3eab30-0837-11d9-acb2-b8a03c50a862
I guess this is the IPT updated time:
"modified": "2017-02-16T22:07:41.127+0000"
¿Perhaps this is what I am looking for?
"machineTags":[{..."name":"crawl_attempt",..."created":"2017-01-05T12:07:12.525+0000"},...]
Thanks a lot David
-- David García San León Herbario SANT Universidade de Santiago de Compostela
Thanks Tim, but I don't understand what you meant because I don't see any "status" in the api response.
Anyway, I have new api-related questions. See the parts I extracted from this example request:
http://api.gbif.org/v1/dataset/10734a60-7ed1-11df-8c4a-0800200c9a66
{ "key": "10734a60-7ed1-11df-8c4a-0800200c9a66", "installationKey": "86e4d50b-d77c-4731-99fb-b3e2a2a83163", "publishingOrganizationKey": "def87a70-0837-11d9-acb2-b8a03c50a862", (...) "lockedForAutoUpdate": false, "createdBy": "registry-migration.gbif.org", "modifiedBy": "crawler.gbif.org", "created": "2010-05-03T22:02:18.000+0000", *"modified": "2017-01-19T18:16:28.844+0000",* (...) * "machineTags": *[ { "key": 606300, "namespace": "crawler.gbif.org", *"name": "crawl_attempt",* "value": "45", "createdBy": "crawler.gbif.org", *"created": "2017-01-19T18:17:29.063+0000"* }, (...) ], (...) "dataLanguage": "eng", * "pubDate": "2017-01-18T23:00:00.000+0000",* (...) }
I want to programatically know wether a certain published dataset is updated or not in GBIF portal, comparing to its current IPT server version.
1) - In the example above, there is 1 minute difference between these two values: *"modified":"DATETIME"* "machineTags":[{"name":*"crawl_attempt","created":"DATETIME"*}]
Does this mean the last dataset harvest began immediately after the IPT was updated? (as I said, I don't see the *"status"* tag you mentioned).
2) - I believe the "modified" datetime comes from *IPT server clock*, and the crawl_attempt created datetime comes from the *crawler machine clock*. Is this correct? So if the IPT server clock is not in the correct time but a bit ahead, when one compares both datetime programatically, the script could wrongly conclude that GBIF portal info is outdated (if reported crawl_attempt datetime is a bit earlier than modified datetime).
So, my question is: can I somehow request the crawler clock current time through the API, to compare it to the IPT server clock? And likewise: can I somehow request the clock current time from a given IPT server? So, if they are not synchronized, the script can take in account tha diference when comparing modified and crawl_attempt datetimes.
3) - What does *"pubDate" *represent? It looks odd to me that it always shows the same time (23:00:00).
4) - I am curious about the api request */dataset/{UUID}/crawl* Api documentation says it "Schedules a new crawl of the dataset".
In which case is this request supposed to be necessary, and which user credentials should be used? Can I see a curl example request? I suppose this is not intended for crawling IPT servers, since I had the idea that crawls would be immediate when publishing through IPT (when there are not unexpected delays due to whatever data portal necessary works).
Many thanks in advance
David
On 19 February 2017 at 14:43, Tim Robertson trobertson@gbif.org wrote:
Hi David,
The crawl attempt is the start of the crawl with the publisher. Once the status confirms it is finished, then you could use that. While it is crawling, the status will be something like “RUNNING” (off the top of my head)
Thanks, Tim
From: API-users api-users-bounces@lists.gbif.org on behalf of Herbario SANT sant.herbarium@gmail.com Date: Sunday 19 February 2017 at 14:38 To: "api-users@lists.gbif.org" api-users@lists.gbif.org Subject: [API-users] how to check last date-time dataset was harvested?
Hi again.
Is there any way to use GBIF portal API to know the last date & time a certain resource was harvested by GBIF? (basically, to know if it has the latest info provided by the original source, or it could be outdated)
I suspect the info in this url but I don' know how to interpretate them all:
http://api.gbif.org/v1/dataset/df3eab30-0837-11d9-acb2-b8a03c50a862
I guess this is the IPT updated time:
"modified": "2017-02-16T22:07:41.127+0000"
¿Perhaps this is what I am looking for?
"machineTags":[{..."name":"crawl_attempt",..."created":"2017 -01-05T12:07:12.525+0000"},...]
Thanks a lot David
-- David García San León Herbario SANT Universidade de Santiago de Compostela
participants (2)
-
Herbario SANT
-
Tim Robertson