[IPT] How to add his own/custom extension
biology.info at gmail.com
Wed Apr 9 15:28:30 CEST 2014
Sorry for late reply. I was thinking about your answer ;)
On Fri, Mar 14, 2014 at 6:31 PM, Tim Robertson [GBIF]
<trobertson at gbif.org>wrote:
> Hi Julien,
> If I understand correctly you have a data flow/process with:
> a) multiple input sources:
> - local databases of various schemas
> - remote databases of various schemas
> - CSV files of various flavours
> b) you use an IPT as a staging area, and map the input into DwC
> c) you then harvest datasets from the IPT into a normalised database model
> And you're finding that the intermediary step in b) is causing issue due
> to the limited nature of the format (star schema and only a few extensions
> Can I ask please - are there other reasons you use the IPT? E.g. are you
> publishing to Canadensys or GBIF for example? Or perhaps you use it for
> easier metadata authoring.
You perfectly understand my purpose and, de facto, point a-b-c are correct.
About IPT, indeed, there is other reasons like metadata authoring, GBIF
publishing, Dwc files storage and, yes, i use it like a staging area.
> Some things perhaps worth considering:
> - is the IPT the best tool for this? Defining an intermediate format, and
> scripting this with SQL exports might be a better model. We have used
> Scriptella (http://scriptella.javaforge.com/) for several data migration
> projects and found it very good for transforming across schemas.
Thanks for advice, but i can't elaborate several script for dozens of DB,
it's a crazy job, i can't imagine to explore each schemas and construct
normalize data ( e.g scientificName, date, ... 176 terms ).
No, data providers have to make effort and IPT facilitates mapping step. I
just need to define data format to avoid multiple process method.
> - If you really do want to create extensions, can you please elaborate on
> what they would be? Extensions are in general intended to be reusable by a
> wider community (e.g. standardised publishing of images) and without
> knowing more, it is not easy to comment if extensions are a good way to
> achieve what you want, or if you are really just looking for general
> purpose "extract transform load" (ETL) tools, for which there are many on
> the web; like scriptella.
I realised extension is not a solution, maybe create some new terms ;) but
for the moment [ Core occurrence with References - History - Multimedia
extensions ] are enough to describe data collection.
The purpose of my initial question was to know the possibility to create
easily my own extension and make it "IPT understandable" like just create a
xml file, put the file in the good folder and load IPT and see my extension
> I hope these comments come across as constructive and not a nuisance.
No fear, it's a pleasure to share with you ! Your questions provided me
> On 14 Mar 2014, at 17:13, Julien Husson <biology.info at gmail.com> wrote:
> Hi Tim, thanks for your prompt answer.
> It's not really what i expected, to be more specific >
> First, i have to harvest lot of data from multiple DB (remote or not)
> with, of course, different structures/models and format of datas ...
> sometimes no DB just flat CSV/xsl files...
> That's why i use an IPT to create data mapping and to standardize the data
> stream into Dwc standard.
> (I'm talking about millions of specimen)
> Secondly, i build specific indexes of harvested data with a custom
> harvester using canadensys-harvester lib. ( Thanks to Christian). It's at
> this point that it's begin to be difficult with a denormalized view of the
> data from IPT Dwc-A.
> 'Cause i need to transform this denormalized view or raw model into a
> normalize view that match with my big relational database model which
> become the new repository.
> That's why i thought that the custom extensions could be make my life
> On Fri, Mar 14, 2014 at 5:12 PM, Julien Husson <biology.info at gmail.com>wrote:
>> Hi Tim, thanks for your prompt answer.
>> It's not really what i expected, to be more specific >
>> First, i have to harvest lot of data from multiple DB (remote or not)
>> with, of course, different structures/models and format of datas ...
>> sometimes no DB just flat CSV/xsl files...
>> That's why i use an IPT to create data mapping and to standardize the
>> data stream into Dwc standard.
>> (I'm talking about millions of specimen)
>> Secondly, i build specific indexes of harvested data with a custom
>> harvester using canadensys-harvester lib. ( Thanks to Christian). It's at
>> this point that it's begin to be difficult with a denormalized view of the
>> data from IPT Dwc-A.
>> 'Cause i need to transform this denormalized view or raw model into a
>> normalize view that match with my big relational database model which
>> become the new repository.
>> That's why i thought that the custom extensions could be make my life
>> On Fri, Mar 14, 2014 at 4:32 PM, Tim Robertson [GBIF] <
>> trobertson at gbif.org> wrote:
>>> Hi Julien,
>>> Thanks for your question. It really depends on what you are trying to
>>> publish. We can add extensions of course, but without knowing the
>>> specifics it is difficult to comment.
>>> However, a "specimen, event, location" DB model would typically map to
>>> an Occurrence core with no extensions required - this is the most common
>>> use case of Darwin Core and the IPT. An Occurrence core is basically a
>>> denormalized view of the data.
>>> If I were the data manager, I would probably consider that I was
>>> publishing a "DwC Occurrence view" of my more complex model and as such
>>> would keep a view in the database along the lines of:
>>> CREATE VIEW view_dwc_occurrence AS
>>> specimen.bar_code AS occurrenceID,
>>> specimen.name AS scientificName,
>>> location.latitude AS decimalLatitude,
>>> location.longitude AS decimalLongitude,
>>> event.year AS year
>>> JOIN event ON ...
>>> JOIN location ON ...
>>> <insert any conditions for record inclusion, such as non endangered
>>> Then in my IPT I would simply do "SELECT * FROM view_dwc_occurrence".
>>> Here you are flattening the normalised model into a denormalized DwC view
>>> of the data.
>>> Maintaining a view in the database layer as opposed to a custom mapping
>>> in the IPT benefits you by:
>>> i) catching issues early with database schema changes since the DB
>>> will likely stop you with an error
>>> ii) offering an easy mapping of DB table field names, to DwC terms in
>>> a language I find very familiar (SQL)
>>> iii) a super simple IPT mapping, as all columns will map automatically
>>> in the IPT since they are DwC recognised terms already
>>> Does that help in any way? If not, could you please elaborate on the
>>> model and what you are trying to achieve and we'll do all we can.
>>> On 14 Mar 2014, at 16:06, Julien Husson <biology.info at gmail.com> wrote:
>>> I use Dwc-A to feed my BD.
>>> We know the limits of the Dwc star schema to represent a relationnal
>>> For example in the case of 1-n cardinality : specimen --- event/record
>>> --- location.
>>> If i understand the concept, I need to use the Darwin *Core*
>>> *Occurrence* *extension*, denormalize my relational model in a big raw
>>> model and transform / re-normalize this model to match with my DB model.
>>> In order to reduce cost*, *dev and optimize sql statement, it will be
>>> grandly appreciate to add custom extension. In this case, i can to be
>>> very close of my relational database model and avoid multiple step of dev.
>>> I discovered this link but explanantion is now deprecated
>>> IPT mailing list
>>> IPT at lists.gbif.org
> IPT mailing list
> IPT at lists.gbif.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPT