[IPT] Publication of big dataset fails, can't find why

Paul J. Morris mole at morris.net
Wed Mar 2 16:58:28 CET 2016


I saw something similar recently when I upgraded a 2.0 IPT instance to
2.3.2. The old IPT instance had resources that were using an old
Audubon Core media extension.  The presentation of the issue was the
same apparent failure at "the core ID field occurrenceID" as you were
seeing, with a NullPointerException behind that, and apparently the two
versions of the Audubon Core media extension pointing to different
columns as the core ID, with what appeared to be the mapping in the UI
not corresponding to the mapping that was actually being used.  

Not the same as what you are seeing (NullPointerException and no stack
trace, instead of an IOException and a stack trace), but the initial
presentation apparently at validating core ID values was similar.

Here are some of my notes on the issue:

"Process creates temporary file occurrence.txt, then multimedia.txt,
then shows publishing log messages about checking for IDs, then
generates a sorted_occurrence.txt file, which ends up at the same size
as occurrence.txt, eml file and metadata are written to the temporary
directory, and then the process fails with a message that mentions a
NullPointerException, but doesn't give a source. Tomcat log files don't
appear to contain the exception or a stack trace, though tomcat does
appear to be configured to log at that level."

"Increased logging through debug mode in IPT is not informative. Sort
of occurrence.txt appears to succeed, then exception is thrown some
time around generation of eml file, with no message in the log files."

"Cloned IPT source from github, checked out commit for current 2.3.2
release, got the release to build (had to edit the pom for
com.lowagie:itext:4.2.2 for maven to build, the pom includes a redirect
to a newer source for itext (which doesn't work as rtf generation has
been removed...)). Once build was working, located source of the
minimal human readable failure message (ResourceManagerImpl.java line
750 "dwca.failed"). Added a log4j Logger to ResourceManagerImpl, then
log.error(e.getMessage(),e); to the catch (ExecutionException e) block
about line 745. Then tried to republish the main source. This threw the
null pointer exception, with an informative stack trace that pointed to
the mapping of occurrenceId in the audubon core extension being
incorrect (there may be a bug in the update of the extension). Deleted
and remapped the audubon core mapping from the [failing] resources,
publication of each was then successful.  List of installed extensions
shows two installed audubon core media extensions [with the same name]"

"Audubon Core media extension update appeared to have same name, but
wasn't compatible with the installed version (it doesn't flatten the
URIs of different types to allow mapping of both a thumbnail and a best
quality image). Unmapped the older one from both resources [that were
using it], deleted it, and remapped both resources onto the current
audubon core extension"

-Paul

On Wed, 2 Mar 2016 15:59:39 +0100
Peter Desmet <peter.desmet at inbo.be> wrote:

> Hi,
> 
> We're trying to publish version 45.4 of this dataset:
> http://data.inbo.be/ipt/resource?r=florabank1-occurrences, but the
> validation seems to fail. Here's the publication log:
> http://data.inbo.be/ipt/publicationlog.do?r=florabank1-occurrences
> 
> As far as I can tell, the validation fails on "the core ID field
> occurrenceID is always present and unique", but we have verified this
> in the generated dwca-45.4.zip file, and all records have a unique
> occurrenceID.
> 
> Any idea what might be going on? Possible causes:
> 
> 1. The dataset is quite big (3,5 million records)
> 2. We've just solved this issue:
> https://github.com/LifeWatchINBO/data-publication/issues/104 by
> following Kyle Braak's instructions. The latest published version is
> now 45.3, the current (to be published) version is 45.4, so everything
> seems fine there.
> 3. Even though the publication failed, the following files are created
> in /resource/florabank1-occurrences:
> 
> eml-45.4.xml
> dwca-45.4,zip
> florabank1-occurrences-45.4.rtf
> 
> Will those file be overwritten if I try to republish or might they be
> causing the publication to fail?
> 
> Thanks,
> 
> Peter
> _______________________________________________
> IPT mailing list
> IPT at lists.gbif.org
> http://lists.gbif.org/mailman/listinfo/ipt


-- 
Paul J. Morris
Biodiversity Informatics Manager
Museum of Comparative Zoölogy, Harvard University
mole at morris.net  AA3SD  PGP public key available


More information about the IPT mailing list