[IPT] Publication of big dataset fails, can't find why

Peter Desmet peter.desmet at inbo.be
Wed Mar 2 16:18:49 CET 2016


Thanks Laura,

If we do some shell magic, we see that the first column of the
generated dwca-45.4 always contains an ID, so it seems there was no
line shifting.

time cut -f 1 occurrence.txt | sort | uniq -c > ids.txt

This function takes the first column of occurrence.txt (i.e. the
occurrenceID), sorts those, creates a unique list (to discover
doubles) and writes them as:

   1 INBO:FLORA:00000001
   1 INBO:FLORA:00000002
   1 INBO:FLORA:00000003
   1 INBO:FLORA:00000004
   1 INBO:FLORA:00000005

We expect to have 1 for all records and to have the same number of
lines as occurrence.txt. That is the case.

Peter

On Wed, Mar 2, 2016 at 4:10 PM, Laura Russell <larussell at vertnet.org> wrote:
> I’ve had publishing fail before when there was a line break (or maybe a
> vertical tab, can’t remember which) in a remarks field that then made a new
> line causing that new line to have a null occurrenceID.  Any chance
> something like that could be causing the failure?
>
> Laura Russell
> VertNet Programmer/iDigBio Data Mobilization Specialist
>
> phone: +01 785 813-1496
> email: larussell at vertnet.org
> Skype: laura.anne.russell
> Hangouts: larussell at vertnet.org
>
> url: www.vertnet.org
> url: www.idigbio.org
>
>
> From: IPT <ipt-bounces at lists.gbif.org> on behalf of Peter Desmet
> <peter.desmet at inbo.be>
> Date: Wednesday, March 2, 2016 at 8:59 AM
> To: <ipt at lists.gbif.org>
> Cc: Stijn Van Hoey <stijn.vanhoey at inbo.be>
> Subject: [IPT] Publication of big dataset fails, can't find why
>
> Hi,
>
> We're trying to publish version 45.4 of this dataset:
> http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the
> validation seems to fail. Here's the publication log:
> http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
>
> As far as I can tell, the validation fails on "the core ID field
> occurrenceID is always present and unique", but we have verified this
> in the generated dwca-45.4.zip file, and all records have a unique
> occurrenceID.
>
> Any idea what might be going on? Possible causes:
>
> 1. The dataset is quite big (3,5 million records)
> 2. We've just solved this issue:
> https://github.com/LifeWatchINBO/data-publication/issues/104 by
> following Kyle Braak's instructions. The latest published version is
> now 45.3, the current (to be published) version is 45.4, so everything
> seems fine there.
> 3. Even though the publication failed, the following files are created
> in /resource/florabank1-occurrences:
>
> eml-45.4.xml
> dwca-45.4,zip
> florabank1-occurrences-45.4.rtf
>
> Will those file be overwritten if I try to republish or might they be
> causing the publication to fail?
>
> Thanks,
>
> Peter
> _______________________________________________
> IPT mailing list
> IPT at lists.gbif.org
> http://lists.gbif.org/mailman/listinfo/ipt
>


More information about the IPT mailing list