[IPT] Functionality request: ADMIN checking data before GBIFregistration

Tim Robertson (GBIF) trobertson at gbif.org
Wed Sep 15 14:08:46 CEST 2010


[This is now beyond an IPT discussion]

Ultimately a simple KVP store handles this nicely, but the actual  
implementation is yet to be decided upon.  By parsing and storing the  
individual fields separately, one can rewrite in XML, RDF, CSV etc  
easily.

For a small DB (50million records or so) I would probably look to  
Berkeley DB to provide this functionality.  For the kind of volume and  
throughput we strive for at the global index, we will likely look  
beyond that, and potentially towards a column oriented DB (e.g.  
BigTable clone), such as HBase or Cassandra.  I know the Atlas of  
Living Australia are making nice use of Cassandra.

Thanks,
Tim




On Sep 15, 2010, at 1:59 PM, Holetschek, Jörg wrote:

> Tim,
>
> how will the GBIF Indexer store the original DwC record? Is it kept  
> in a text field in the Index database? In the original DwC archive?
>
> Jörg
>
> Von: ipt-bounces at lists.gbif.org [mailto:ipt-bounces at lists.gbif.org]  
> Im Auftrag von Tim Robertson (GBIF)
> Gesendet: Mittwoch, 15. September 2010 13:29
> An: ipt at lists.gbif.org
> Betreff: Re: [IPT] Functionality request: ADMIN checking data before  
> GBIFregistration
>
> Thanks all for the comments, which I will try and collate here with  
> some resolutions:
>
> Propose resolved:
> a) The preference would be to allow the ADMIN to provide  
> Registration privileges to individual MANAGERS
>
> Outstanding issues:
> b) Visualisations are important, and will need discussion and  
> potentially further IPT modules developed, or deployment of external  
> services.
> I know of one group considering the development of DwC-A  
> visualisations already, and technologies like Google Fusion tables  
> makes this kind of thing trivial.
>
> c) record resolution is something that has been indicated as  
> important.  This can be achieved in one of 2 ways:
>   > implementation in the IPT, and requires research into  
> technologies to perform satisfactorily
>      Potential technologies might be
>        - Berkeley DB
>        - A relational database, such as Mysql or H2
>        - Lucene indexing
>   > reliance on a "stable cache" for record serving
>
> The first release (for test purposes) of the revised IPT software  
> will not have record level serving, but while this is being  
> developed, I would like to ask people to start discussing what kind  
> of record level serving is truly a requirement in the IPT, as  
> opposed to a "nice to have".  We support DwC-A in the GBIF portal,  
> and the intention is to simply reserve the record that came from the  
> DwC-A directly, unless the record indicates there is further  
> information on a URL (e.g. if the record identifier is an LSID).   
> Would this strategy not be suitable for the likes of the BioCASe  
> portals as often there is no further information to redirect to?  in  
> the case of the IPT, there is no extra information, and I propose  
> should the source be a DwC-A, that individual records be cached in  
> the harvesting portal.  With this approach, there would be no  
> individual record serving needs in the IPT.
>
> Ultimately, we might consider aiming for data owners offering single  
> records on a resolvable URL, and conforming to Linked/Open Data  
> requirements, along with a DwC-A effectively providing a single  
> "index" view of the dataset.  The DwC-A records would reference the  
> originals by resolvable ID so any search system would always be able  
> to point back to the authoritative source.  This would effectively  
> be distributed indexing, and not dissimilar to the sharing of  
> sitemaps, but with extra information to enable better discovery.
>
> Thank you all for this feedback, and please correct any  
> misunderstandings on my part
>
> Tim
>
>
>
>
>
>
>
> On Sep 15, 2010, at 12:03 PM, Mihail-Constantin Carausu wrote:
>
>> Dear Tim
>>
>> I think the development team's mentioned approach is a workable  
>> solution
>> to cover both requirements at this stage.
>> However, I think Hannu (and me) had in mind a kind of "Basket of
>> approvals"-alike functionality in the Admin's (owner of the provider)
>> administration section side: When a Manager has been published a  
>> dataset
>> through the IPT, this will automatically trigger a request for  
>> approval
>> or submits an yes/no event in the Admin's administration section. The
>> Admin must finally active interfere and approve the dataset  
>> publication
>> (e.g. by checking an "Approved" check box in the basket of approvals
>> list with events/datasets in the administration section) at the  
>> absolute
>> latest stage (e.g. when GBIF just needs to start to index it, or
>> something like that). Without this final approval the dataset will  
>> still
>> be published and visible through the IPT but not visible/searchable  
>> on
>> the GBIF data portal. This approach is not necessarily in  
>> contradiction
>> with the Manager's ability to autonomously publish datasets within  
>> the
>> IPT, only it puts this ability always under control from the central
>> administration section when the dataset has to go to the GBIF data
>> portal.
>> I think both solutions/approaches have obvious advantages and
>> disadvantages while none of them provides a 100% protection against
>> publishing something odd by a (test) user.
>> I have a little question regarding the development team's proposed
>> solutions:  is it not possible for the central Admin to enable the
>> publishing ability for some "trusted" managers and disable this for
>> others inside the same instance of the IPT.
>>
>> Now I saw Hannu's new message just arrived, sorry for eventually
>> unsynchronized double-crossing messages, but I will send this anyhow.
>>
>> Best regards,
>> Mihail
>>
>> ----------------------------
>> Mihail Carausu
>> MSc.Eng., Informatics Manager
>> Danish Biodiversity Information Facility (DanBIF)
>> --------------------------------------------
>>
>>
>> -----Original Message-----
>> From: ipt-bounces at lists.gbif.org [mailto:ipt- 
>> bounces at lists.gbif.org] On
>> Behalf Of Tim Robertson (GBIF)
>> Sent: 15. september 2010 09:43
>> To: ipt at lists.gbif.org
>> Subject: [IPT] Functionality request: ADMIN checking data before
>> GBIFregistration
>>
>> Hi all,
>>
>> Hannu has raised a request for the following to be satisfied by the  
>> IPT:
>> "- Publishing a resource must be accepted by the owner of the
>> provider.  It has happened that a test user publishes something odd
>> which goes all the way to the data portal without nobody controlling
>> it."
>>
>> This is a contradiction to the requests of others, and specifically
>> those wishing to promote basic "data hosting centers", who request
>> that a data MANAGER should be able to work autonomously.
>>
>> After discussion with the developers the proposal is to implement the
>> following, which we hope satisfies both requirements:
>> In the Administration section, an ADMIN can choose to enable or
>> disable the ability for MANAGERS to register resources with GBIF.  By
>> default MANAGERS can register a resource, but an ADMIN can disable
>> this through this check box.
>>
>> If anyone has any concerns or comments on this approach, please can
>> you raise them on this list?
>>
>> Many thanks,
>> Tim
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> IPT mailing list
>> IPT at lists.gbif.org
>> http://lists.gbif.org/mailman/listinfo/ipt
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ipt/attachments/20100915/36f3eed6/attachment.html 


More information about the IPT mailing list