[API-users] how to check last date-time dataset was harvested?

Herbario SANT sant.herbarium at gmail.com
Sun Feb 19 22:11:22 CET 2017


Thanks Tim, but I don't understand what you meant because I don't see any
"status" in the api response.

Anyway, I have new api-related questions.
See the parts I extracted from this example request:

http://api.gbif.org/v1/dataset/10734a60-7ed1-11df-8c4a-0800200c9a66

{
  "key": "10734a60-7ed1-11df-8c4a-0800200c9a66",
  "installationKey": "86e4d50b-d77c-4731-99fb-b3e2a2a83163",
  "publishingOrganizationKey": "def87a70-0837-11d9-acb2-b8a03c50a862",
(...)
  "lockedForAutoUpdate": false,
  "createdBy": "registry-migration.gbif.org",
  "modifiedBy": "crawler.gbif.org",
  "created": "2010-05-03T22:02:18.000+0000",
  *"modified": "2017-01-19T18:16:28.844+0000",*
   (...)
 * "machineTags": *[
    {
      "key": 606300,
      "namespace": "crawler.gbif.org",
      *"name": "crawl_attempt",*
      "value": "45",
      "createdBy": "crawler.gbif.org",
      *"created": "2017-01-19T18:17:29.063+0000"*
    },
    (...)
  ],
  (...)
  "dataLanguage": "eng",
 * "pubDate": "2017-01-18T23:00:00.000+0000",*
  (...)
}

I want to programatically know wether a certain published dataset is
updated or not in GBIF portal, comparing to its current IPT server version.

1) - In the example above, there is 1 minute difference between these two
values:
*"modified":"DATETIME"*
"machineTags":[{"name":*"crawl_attempt","created":"DATETIME"*}]

Does this mean the last dataset harvest began immediately after the IPT was
updated?
(as I said, I don't see the *"status"* tag you mentioned).

2) - I believe the "modified" datetime comes from *IPT server clock*, and
the crawl_attempt created datetime comes from the *crawler machine clock*.
Is this correct?
So if the IPT server clock is not in the correct time but a bit ahead, when
one compares both datetime programatically, the script could wrongly
conclude that GBIF portal info is outdated (if reported crawl_attempt
datetime is a bit earlier than modified datetime).

So, my question is: can I somehow request the crawler clock current time
through the API, to compare it to the IPT server clock?
And likewise: can I somehow request the clock current time from a given IPT
server?
So, if they are not synchronized, the script can take in account tha
diference when comparing modified and crawl_attempt datetimes.

3) - What does *"pubDate" *represent? It looks odd to me that it always
shows the same time (23:00:00).

4) - I am curious about the api request */dataset/{UUID}/crawl*
  Api documentation says it "Schedules a new crawl of the dataset".

  In which case is this request supposed to be necessary, and which user
credentials should be used?  Can I see a curl example request?
I suppose this is not intended for crawling IPT servers, since I had the
idea that crawls would be immediate when publishing through IPT (when there
are not unexpected delays due to whatever data portal necessary works).

Many thanks in advance

David


On 19 February 2017 at 14:43, Tim Robertson <trobertson at gbif.org> wrote:

> Hi David,
>
> The crawl attempt is the start of the crawl with the publisher.
> Once the status confirms it is finished, then you could use that.  While
> it is crawling, the status will be something like “RUNNING” (off the top of
> my head)
>
> Thanks,
> Tim
>
>
> From: API-users <api-users-bounces at lists.gbif.org> on behalf of Herbario
> SANT <sant.herbarium at gmail.com>
> Date: Sunday 19 February 2017 at 14:38
> To: "api-users at lists.gbif.org" <api-users at lists.gbif.org>
> Subject: [API-users] how to check last date-time dataset was harvested?
>
> Hi again.
>
> Is there any way to use GBIF portal API to know the last date & time a
> certain resource was harvested by GBIF? (basically, to know if it has the
> latest info provided by the original source, or it could be outdated)
>
> I suspect the info in this url but I don' know how to interpretate them
> all:
>
> http://api.gbif.org/v1/dataset/df3eab30-0837-11d9-acb2-b8a03c50a862
>
> I guess this is the IPT updated time:
>
> "modified": "2017-02-16T22:07:41.127+0000"
>
> ¿Perhaps this is what I am looking for?
>
> "machineTags":[{..."name":"crawl_attempt",..."created":"2017
> -01-05T12:07:12.525+0000"},...]
>
> Thanks a lot
> David
>
> --
> David García San León
> Herbario SANT
> Universidade de Santiago de Compostela
>
>


-- 
David García San León
(dixitalización / control de fondos)
Herbario SANT
Facultade de Farmacia - Laboratorio de Botánica
Universidade de Santiago de Compostela
15782 - Galicia (Spain)
http://www.usc.es/herbario
Tel. +34 881815022 <+34%20881%2081%2050%2022>
Fax +34 981594912 <+34%20981%2059%2049%2012>
Skype: herbarium_sant
Twitter: @SANT_Herbarium
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/api-users/attachments/20170219/3032c3b7/attachment-0001.html>


More information about the API-users mailing list