coreid (lowercase "i") vs coreId in meta.xml - schema validation
I found some oddities and I am not exactly sure where to go next.
We are noticing the following while processing meta.xml in darwin core archives produced by IPT (and other servers):
Schema validation failed, continuing unvalidated XMLSyntaxError: Element '{http://rs.tdwg.org/dwc/text/%7Dcoreid': This element is not expected. Expected is ( {http://rs.tdwg.org/dwc/text/%7DcoreId )
It seems like most consumers are not actually validating meta.xml using the schema, and the producers are generating files out of compliance with the schema.
Most of the Darwin Core archives I have manually inspected and tried to validate contain meta.xml with lowercase "i" in coreid despite the Standard indicating capital "I" in coreId.
I poked at the GBIF Darwin Core Validator 3 code repo and found this:
schema.meta=https://raw.githubusercontent.com/tdwg/dwc/master/standard/documents/text/td...
The first link leads to 404, the second leads to an xsd that contains the proper coreId. So maybe the Validator is not being "strict" about validation against the schema?
Digging around I find some "historic" evidence of changes that occurred when tdwg migrated from google code to github and moved their services around, and that GBIF might have "grabbed a copy" from somewhere while waiting for the public site to become available again.
The current "published" schema contains coreId:
http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd
The current "master" branch contains coreId:
https://github.com/tdwg/dwc/blob/master/docs/text/tdwg_dwc_text.xsd
The "gh-pages" branch (one would think this is being used to generate the website, but apparently not) includes the lowercase coreid:
https://github.com/tdwg/dwc/blob/gh-pages/text/tdwg_dwc_text.xsd
So I am wondering why the GBIF Validator isn't noticing this. And why is IPT generating meta.xml that does not agree with the schema?
Reference:
2.1.2 Elements Element Description <core> An <archive> must contain exactly one <core> element, representing the data entity (the actual file and its column header mappings to Darwin Core terms) upon which records are based. If extensions are being used, each record in the core data must have a unique identifier. The field for this identifier must be specified in an explicit <id> field in order to associate extension records with the core record. <extension> An <archive> may define zero or more <extension> elements, each representing an individual extension entity directly related to the core. In addition to the general file attributes described below, every extension entity must have an explicit <coreId> field to relate the extension record to a row in the core entity. The extension itself does not have to have a unique ID field and many rows can point to the same core record.
Thanks!
Dan Stoner iDigBio / ACIS Laboratory University of Florida
Hi Dan,
On 12/12/2019 16:30, Stoner, Dan F wrote:
I found some oddities and I am not exactly sure where to go next.
We are noticing the following while processing meta.xml in darwin core archives produced by IPT (and other servers):
Schema validation failed, continuing unvalidated XMLSyntaxError: Element '{http://rs.tdwg.org/dwc/text/%7Dcoreid': This element is not expected. Expected is ( {http://rs.tdwg.org/dwc/text/%7DcoreId )
There's some background to this on this issue: https://github.com/tdwg/dwc/issues/143
The schema itself and the documentation were conflicting, and this was fixed (in mine and Tim's opinion) the wrong way, by changing the schema.
*I've just pushed a commit to fix it the right way,* i.e. reflecting 99% actual usage and leaving the schema as it was for almost a decade.
Although we do accept either, we still see only 31 datasets registered in GBIF with "coreId" rather than "coreid".
It seems like most consumers are not actually validating meta.xml using the schema, and the producers are generating files out of compliance with the schema.
Most of the Darwin Core archives I have manually inspected and tried to validate contain meta.xml with lowercase "i" in coreid despite the Standard indicating capital "I" in coreId.
I poked at the GBIF Darwin Core Validator 3 code repo and found this:
schema.meta=https://raw.githubusercontent.com/tdwg/dwc/master/standard/documents/text/td...
The first link leads to 404, the second leads to an xsd that contains the proper coreId. So maybe the Validator is not being "strict" about validation against the schema?
I suspect it has been running for so long that, when the validator process was originally started, both URLs were valid, and had coreid or one of each.
Cheers,
Matt
Fantastic, I see the updated schema and the meta.xml validation error has gone away.
thanks!
Dan Stoner iDigBio / ACIS Laboratory University of Florida
________________________________________ From: IPT ipt-bounces@lists.gbif.org on behalf of Matthew Blissett mblissett@gbif.org Sent: Thursday, December 12, 2019 11:08 AM To: ipt@lists.gbif.org Subject: Re: [IPT] coreid (lowercase "i") vs coreId in meta.xml - schema validation
[External Email]
Hi Dan,
On 12/12/2019 16:30, Stoner, Dan F wrote:
I found some oddities and I am not exactly sure where to go next.
We are noticing the following while processing meta.xml in darwin core archives produced by IPT (and other servers):
Schema validation failed, continuing unvalidated XMLSyntaxError: Element '{https://urldefense.proofpoint.com/v2/url?u=http-3A__rs.tdwg.org_dwc_text_-25... ': This element is not expected. Expected is ( {https://urldefense.proofpoint.com/v2/url?u=http-3A__rs.tdwg.org_dwc_text_-25... )
There's some background to this on this issue: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tdwg_dwc_iss...
The schema itself and the documentation were conflicting, and this was fixed (in mine and Tim's opinion) the wrong way, by changing the schema.
*I've just pushed a commit to fix it the right way,* i.e. reflecting 99% actual usage and leaving the schema as it was for almost a decade.
Although we do accept either, we still see only 31 datasets registered in GBIF with "coreId" rather than "coreid".
It seems like most consumers are not actually validating meta.xml using the schema, and the producers are generating files out of compliance with the schema.
Most of the Darwin Core archives I have manually inspected and tried to validate contain meta.xml with lowercase "i" in coreid despite the Standard indicating capital "I" in coreId.
I poked at the GBIF Darwin Core Validator 3 code repo and found this:
schema.meta=https://urldefense.proofpoint.com/v2/url?u=https-3A__raw.githubusercontent.c...
The first link leads to 404, the second leads to an xsd that contains the proper coreId. So maybe the Validator is not being "strict" about validation against the schema?
I suspect it has been running for so long that, when the validator process was originally started, both URLs were valid, and had coreid or one of each.
Cheers,
Matt
_______________________________________________ IPT mailing list IPT@lists.gbif.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.gbif.org_mailman_...
participants (2)
-
Matthew Blissett
-
Stoner, Dan F