Publication of big dataset fails, can't find why
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
1. The dataset is quite big (3,5 million records) 2. We've just solved this issue: https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter
I¹ve had publishing fail before when there was a line break (or maybe a vertical tab, can¹t remember which) in a remarks field that then made a new line causing that new line to have a null occurrenceID. Any chance something like that could be causing the failure?
Laura Russell VertNet Programmer/iDigBio Data Mobilization Specialist
phone: +01 785 813-1496 email: larussell@vertnet.org Skype: laura.anne.russell Hangouts: larussell@vertnet.org
url: www.vertnet.org url: www.idigbio.org
From: IPT ipt-bounces@lists.gbif.org on behalf of Peter Desmet peter.desmet@inbo.be Date: Wednesday, March 2, 2016 at 8:59 AM To: ipt@lists.gbif.org Cc: Stijn Van Hoey stijn.vanhoey@inbo.be Subject: [IPT] Publication of big dataset fails, can't find why
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
1. The dataset is quite big (3,5 million records) 2. We've just solved this issue: https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter _______________________________________________ IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
Thanks Laura,
If we do some shell magic, we see that the first column of the generated dwca-45.4 always contains an ID, so it seems there was no line shifting.
time cut -f 1 occurrence.txt | sort | uniq -c > ids.txt
This function takes the first column of occurrence.txt (i.e. the occurrenceID), sorts those, creates a unique list (to discover doubles) and writes them as:
1 INBO:FLORA:00000001 1 INBO:FLORA:00000002 1 INBO:FLORA:00000003 1 INBO:FLORA:00000004 1 INBO:FLORA:00000005
We expect to have 1 for all records and to have the same number of lines as occurrence.txt. That is the case.
Peter
On Wed, Mar 2, 2016 at 4:10 PM, Laura Russell larussell@vertnet.org wrote:
I’ve had publishing fail before when there was a line break (or maybe a vertical tab, can’t remember which) in a remarks field that then made a new line causing that new line to have a null occurrenceID. Any chance something like that could be causing the failure?
Laura Russell VertNet Programmer/iDigBio Data Mobilization Specialist
phone: +01 785 813-1496 email: larussell@vertnet.org Skype: laura.anne.russell Hangouts: larussell@vertnet.org
url: www.vertnet.org url: www.idigbio.org
From: IPT ipt-bounces@lists.gbif.org on behalf of Peter Desmet peter.desmet@inbo.be Date: Wednesday, March 2, 2016 at 8:59 AM To: ipt@lists.gbif.org Cc: Stijn Van Hoey stijn.vanhoey@inbo.be Subject: [IPT] Publication of big dataset fails, can't find why
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
- The dataset is quite big (3,5 million records)
- We've just solved this issue:
https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter _______________________________________________ IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
Please look at the following line in the log which explains why the validation has stopped unexpectedly:
No space left on device
-- Menashè Eliezer
2016-03-02 16:10 GMT+01:00 Laura Russell larussell@vertnet.org:
I’ve had publishing fail before when there was a line break (or maybe a vertical tab, can’t remember which) in a remarks field that then made a new line causing that new line to have a null occurrenceID. Any chance something like that could be causing the failure?
Laura Russell VertNet Programmer/iDigBio Data Mobilization Specialist
phone: +01 785 813-1496 email: larussell@vertnet.org Skype: laura.anne.russell Hangouts: larussell@vertnet.org
url: www.vertnet.org url: www.idigbio.org
From: IPT ipt-bounces@lists.gbif.org on behalf of Peter Desmet < peter.desmet@inbo.be> Date: Wednesday, March 2, 2016 at 8:59 AM To: ipt@lists.gbif.org Cc: Stijn Van Hoey stijn.vanhoey@inbo.be Subject: [IPT] Publication of big dataset fails, can't find why
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
- The dataset is quite big (3,5 million records)
- We've just solved this issue:
https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter _______________________________________________ IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
Thanks all! Should have read through the whole exception stack.
On Wed, Mar 2, 2016 at 4:26 PM, Menashe' Eliezer menashe.eliezer@gmail.com wrote:
Please look at the following line in the log which explains why the validation has stopped unexpectedly:
No space left on device
-- Menashè Eliezer
2016-03-02 16:10 GMT+01:00 Laura Russell larussell@vertnet.org:
I’ve had publishing fail before when there was a line break (or maybe a vertical tab, can’t remember which) in a remarks field that then made a new line causing that new line to have a null occurrenceID. Any chance something like that could be causing the failure?
Laura Russell VertNet Programmer/iDigBio Data Mobilization Specialist
phone: +01 785 813-1496 email: larussell@vertnet.org Skype: laura.anne.russell Hangouts: larussell@vertnet.org
url: www.vertnet.org url: www.idigbio.org
From: IPT ipt-bounces@lists.gbif.org on behalf of Peter Desmet peter.desmet@inbo.be Date: Wednesday, March 2, 2016 at 8:59 AM To: ipt@lists.gbif.org Cc: Stijn Van Hoey stijn.vanhoey@inbo.be Subject: [IPT] Publication of big dataset fails, can't find why
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
- The dataset is quite big (3,5 million records)
- We've just solved this issue:
https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter _______________________________________________ IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
I saw something similar recently when I upgraded a 2.0 IPT instance to 2.3.2. The old IPT instance had resources that were using an old Audubon Core media extension. The presentation of the issue was the same apparent failure at "the core ID field occurrenceID" as you were seeing, with a NullPointerException behind that, and apparently the two versions of the Audubon Core media extension pointing to different columns as the core ID, with what appeared to be the mapping in the UI not corresponding to the mapping that was actually being used.
Not the same as what you are seeing (NullPointerException and no stack trace, instead of an IOException and a stack trace), but the initial presentation apparently at validating core ID values was similar.
Here are some of my notes on the issue:
"Process creates temporary file occurrence.txt, then multimedia.txt, then shows publishing log messages about checking for IDs, then generates a sorted_occurrence.txt file, which ends up at the same size as occurrence.txt, eml file and metadata are written to the temporary directory, and then the process fails with a message that mentions a NullPointerException, but doesn't give a source. Tomcat log files don't appear to contain the exception or a stack trace, though tomcat does appear to be configured to log at that level."
"Increased logging through debug mode in IPT is not informative. Sort of occurrence.txt appears to succeed, then exception is thrown some time around generation of eml file, with no message in the log files."
"Cloned IPT source from github, checked out commit for current 2.3.2 release, got the release to build (had to edit the pom for com.lowagie:itext:4.2.2 for maven to build, the pom includes a redirect to a newer source for itext (which doesn't work as rtf generation has been removed...)). Once build was working, located source of the minimal human readable failure message (ResourceManagerImpl.java line 750 "dwca.failed"). Added a log4j Logger to ResourceManagerImpl, then log.error(e.getMessage(),e); to the catch (ExecutionException e) block about line 745. Then tried to republish the main source. This threw the null pointer exception, with an informative stack trace that pointed to the mapping of occurrenceId in the audubon core extension being incorrect (there may be a bug in the update of the extension). Deleted and remapped the audubon core mapping from the [failing] resources, publication of each was then successful. List of installed extensions shows two installed audubon core media extensions [with the same name]"
"Audubon Core media extension update appeared to have same name, but wasn't compatible with the installed version (it doesn't flatten the URIs of different types to allow mapping of both a thumbnail and a best quality image). Unmapped the older one from both resources [that were using it], deleted it, and remapped both resources onto the current audubon core extension"
-Paul
On Wed, 2 Mar 2016 15:59:39 +0100 Peter Desmet peter.desmet@inbo.be wrote:
Hi,
We're trying to publish version 45.4 of this dataset: http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the validation seems to fail. Here's the publication log: http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
As far as I can tell, the validation fails on "the core ID field occurrenceID is always present and unique", but we have verified this in the generated dwca-45.4.zip file, and all records have a unique occurrenceID.
Any idea what might be going on? Possible causes:
- The dataset is quite big (3,5 million records)
- We've just solved this issue:
https://github.com/LifeWatchINBO/data-publication/issues/104 by following Kyle Braak's instructions. The latest published version is now 45.3, the current (to be published) version is 45.4, so everything seems fine there. 3. Even though the publication failed, the following files are created in /resource/florabank1-occurrences:
eml-45.4.xml dwca-45.4,zip florabank1-occurrences-45.4.rtf
Will those file be overwritten if I try to republish or might they be causing the publication to fail?
Thanks,
Peter _______________________________________________ IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt
participants (4)
-
Laura Russell
-
Menashe' Eliezer
-
Paul J. Morris
-
Peter Desmet