Thanks Laura,
If we do some shell magic, we see that the first column of the generated dwca-45.4 always contains an ID, so it seems there was no line shifting.
time cut -f 1 occurrence.txt | sort | uniq -c > ids.txt
This function takes the first column of occurrence.txt (i.e. the occurrenceID), sorts those, creates a unique list (to discover doubles) and writes them as:
1 INBO:FLORA:00000001 1 INBO:FLORA:00000002 1 INBO:FLORA:00000003 1 INBO:FLORA:00000004 1 INBO:FLORA:00000005
We expect to have 1 for all records and to have the same number of lines as occurrence.txt. That is the case.
Peter
On Wed, Mar 2, 2016 at 4:10 PM, Laura Russell larussell@vertnet.org wrote:
I’ve had publishing fail before when there was a line break (or maybe a vertical tab, can’t remember which) in a remarks field that then made a new line causing that new line to have a null occurrenceID. Any chance something like that could be causing the failure?
Laura Russell VertNet Programmer/iDigBio Data Mobilization Specialist
phone: +01 785 813-1496 email: larussell@vertnet.org Skype: laura.anne.russell Hangouts: larussell@vertnet.org
url: www.vertnet.org url: www.idigbio.org
From: IPT ipt-bounces@lists.gbif.org on behalf of Peter Desmet peter.desmet@inbo.be Date: Wednesday, March 2, 2016 at 8:59 AM To: ipt@lists.gbif.org Cc: Stijn Van Hoey stijn.vanhoey@inbo.be Subject: [IPT] Publication of big dataset fails, can't find why
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
- The dataset is quite big (3,5 million records)
- We've just solved this issue:
https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter _______________________________________________ IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt