[IPT] [tdwg-content] Reverting the process of DwC standardization

Menashe' Eliezer menashe.eliezer at gmail.com
Thu Oct 29 12:15:01 CET 2015

Please see my updated suggestion at https://github.com/gbif/ipt/issues/1165
IMHO Open Refine is not the right tool. One can simply use org.apache.poi
in his Java application for reading all the information from the different
files inside the DwC, and create an ODS file with the combined matrix,
which takes into consideration also possible parentEventID. I'm sorry I
don't have time to do it myself.
I hope it's clear.

2015-10-28 18:57 GMT+01:00 Shorthouse, David <david.shorthouse at umontreal.ca>

> All,
> Is part of the issue being expressed here because the raw ecological data
> sets we're discussing are small-ish matrices rather than occurrences, with
> site codes as columns, taxa as rows and measures of density/abundance as
> cells (and similar for environmental variables)? Such structures are often
> used as input for software that executes eg ordinations, classification &
> regression trees, species richness estimates. The shortcoming of such a
> structure is the inherent idiosyncratic nature of "site codes", with
> variable numbers of them, i.e. an arbitrary number of columns. I doubt it
> was ever designed for ease of dataset integration, but rather for ease of
> computation. Representing this structure as Event core requires significant
> transposition & potential for error if it were manual. Open Refine is one
> such tool that could permit bi-directional transpositions (DwC -> matrix
> and then matrix -> DwC), but it is still clunky and accommodation of
> extensions is virtually non-existent. But, perhaps Open Refine recipes and
> guides gets us one step closer to finding a balance between the need for
> standardized representation & efficient transport (DwC) vs. end-users who
> want matrices for ease of computation.
> David P. Shorthouse
> On Tue, Oct 27, 2015 at 7:36 AM, David Valentim Dias <dvdias at sibbr.gov.br>
> wrote:
>> Hi again,
>> I think the problem target both. DwC because is a solution to a problem
>> creating another problem to researchers less "skilled" in table
>> manipulation. Ecological data with occurrence is resulting in three tables
>> and manipulation of these are getting harder with the number of core or
>> extensions used.
>> Two possible solutions comes in mind: create a new term describing the
>> original layout of the columns (so we can use csvjoin like Menashe suggest)
>> or ipt with option to store the original table associated with resource.
>> We can always use external links in eml and save the file somewhere but
>> this means creating another service and managing more login (aka resource
>> cost and new problems).
>> I think any solution will need ipt changes.
>> 2015-10-27 9:08 GMT-02:00 Menashe' Eliezer <menashe.eliezer at gmail.com>:
>>> Hi Tim,
>>> I believe that the IPT feature I've requested long ago could be helpful
>>> for David: https://github.com/gbif/ipt/issues/1165
>>> Consumers and also the data providers don't have a DwC-A viewer, and
>>> they need to join the separate csv files for having one table in a
>>> worksheet.
>>> Web applications like the one at OBIS website do let end users download
>>> one big table.
>>> Best regards,
>>> Menashè
>>> 2015-10-27 9:53 GMT+01:00 Tim Robertson <trobertson at gbif.org>:
>>>> Hi David
>>>> (CC’ing the IPT list as this might be an IPT specific thread -
>>>> http://lists.gbif.org/mailman/listinfo/ipt)
>>>> For clarification - is your question specific to the DwC-A standard
>>>> which is possible as Alex says or is it specific to the IPT tool please?
>>>> Do you imagine a scenario where you’d effectively map the same
>>>> extension 2 times - once to interpreted and once to verbatim - or do you
>>>> envisage a different data schema for each?
>>>> Thanks,
>>>> Tim
>>>> On 23 Oct 2015, at 16:00, Alex Thompson <godfoder at acis.ufl.edu> wrote:
>>>> David,
>>>> It's certainly possible, within the context of a Darwin Core Archive,
>>>> to include other files within the ZIP file that lie outside the schema of
>>>> the archive. Both GBIF and iDigBio do this when generating downloads for
>>>> various reasons (RIGHTS & LICENSE files, additional EML metadata, etc).
>>>> However, I do not believe it is possible to do this within IPT. You might
>>>> submit an issue on the IPT issue tracker (
>>>> https://github.com/gbif/ipt/issues) for potential inclusion of this
>>>> feature in a future version of IPT.
>>>> There are workarounds you can use to include additional data in Darwin
>>>> Core archives, but none of them will exactly match your old format. For
>>>> instance, including an additional Occurrence file with the values as JSON
>>>> in dynamicProperties or in some other verbatim format in the
>>>> occurrenceRemarks field. Both of those would at least give some method of
>>>> single-row access (vs joining multiple measurementOrFacts to a single event
>>>> id) if that is the primary concern, even if they would require additional
>>>> parsing steps to be useful.
>>>> Alex Thompson
>>>> iDigBio Infrastructure
>>>> On 10/23/2015 09:40 AM, David Valentim Dias wrote:
>>>> Dear colleagues,
>>>> Here on SiBBr we're using the new eventCore and measurementOrFacts and
>>>> after the process of standardization to DwC and publishing we think some
>>>> users/researchers will want the "original" table format because of multiple
>>>> reasons.
>>>> Is possible to have a vertabimTable or some place where we can store
>>>> the original table/column format?
>>>> Regards
> _______________________________________________
> IPT mailing list
> IPT at lists.gbif.org
> http://lists.gbif.org/mailman/listinfo/ipt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gbif.org/pipermail/ipt/attachments/20151029/ed233c8e/attachment.html>

More information about the IPT mailing list