[This is now beyond an IPT discussion]
Ultimately a simple KVP store handles this nicely, but the actual implementation is yet to be decided upon. By parsing and storing the individual fields separately, one can rewrite in XML, RDF, CSV etc easily.
For a small DB (50million records or so) I would probably look to Berkeley DB to provide this functionality. For the kind of volume and throughput we strive for at the global index, we will likely look beyond that, and potentially towards a column oriented DB (e.g. BigTable clone), such as HBase or Cassandra. I know the Atlas of Living Australia are making nice use of Cassandra.
Thanks, Tim
On Sep 15, 2010, at 1:59 PM, Holetschek, Jörg wrote:
Tim,
how will the GBIF Indexer store the original DwC record? Is it kept in a text field in the Index database? In the original DwC archive?
Jörg
Von: ipt-bounces@lists.gbif.org [mailto:ipt-bounces@lists.gbif.org] Im Auftrag von Tim Robertson (GBIF) Gesendet: Mittwoch, 15. September 2010 13:29 An: ipt@lists.gbif.org Betreff: Re: [IPT] Functionality request: ADMIN checking data before GBIFregistration
Thanks all for the comments, which I will try and collate here with some resolutions:
Propose resolved: a) The preference would be to allow the ADMIN to provide Registration privileges to individual MANAGERS
Outstanding issues: b) Visualisations are important, and will need discussion and potentially further IPT modules developed, or deployment of external services. I know of one group considering the development of DwC-A visualisations already, and technologies like Google Fusion tables makes this kind of thing trivial.
c) record resolution is something that has been indicated as important. This can be achieved in one of 2 ways:
implementation in the IPT, and requires research into
technologies to perform satisfactorily Potential technologies might be - Berkeley DB - A relational database, such as Mysql or H2 - Lucene indexing
reliance on a "stable cache" for record serving
The first release (for test purposes) of the revised IPT software will not have record level serving, but while this is being developed, I would like to ask people to start discussing what kind of record level serving is truly a requirement in the IPT, as opposed to a "nice to have". We support DwC-A in the GBIF portal, and the intention is to simply reserve the record that came from the DwC-A directly, unless the record indicates there is further information on a URL (e.g. if the record identifier is an LSID). Would this strategy not be suitable for the likes of the BioCASe portals as often there is no further information to redirect to? in the case of the IPT, there is no extra information, and I propose should the source be a DwC-A, that individual records be cached in the harvesting portal. With this approach, there would be no individual record serving needs in the IPT.
Ultimately, we might consider aiming for data owners offering single records on a resolvable URL, and conforming to Linked/Open Data requirements, along with a DwC-A effectively providing a single "index" view of the dataset. The DwC-A records would reference the originals by resolvable ID so any search system would always be able to point back to the authoritative source. This would effectively be distributed indexing, and not dissimilar to the sharing of sitemaps, but with extra information to enable better discovery.
Thank you all for this feedback, and please correct any misunderstandings on my part
Tim
On Sep 15, 2010, at 12:03 PM, Mihail-Constantin Carausu wrote:
Dear Tim
I think the development team's mentioned approach is a workable solution to cover both requirements at this stage. However, I think Hannu (and me) had in mind a kind of "Basket of approvals"-alike functionality in the Admin's (owner of the provider) administration section side: When a Manager has been published a dataset through the IPT, this will automatically trigger a request for approval or submits an yes/no event in the Admin's administration section. The Admin must finally active interfere and approve the dataset publication (e.g. by checking an "Approved" check box in the basket of approvals list with events/datasets in the administration section) at the absolute latest stage (e.g. when GBIF just needs to start to index it, or something like that). Without this final approval the dataset will still be published and visible through the IPT but not visible/searchable on the GBIF data portal. This approach is not necessarily in contradiction with the Manager's ability to autonomously publish datasets within the IPT, only it puts this ability always under control from the central administration section when the dataset has to go to the GBIF data portal. I think both solutions/approaches have obvious advantages and disadvantages while none of them provides a 100% protection against publishing something odd by a (test) user. I have a little question regarding the development team's proposed solutions: is it not possible for the central Admin to enable the publishing ability for some "trusted" managers and disable this for others inside the same instance of the IPT.
Now I saw Hannu's new message just arrived, sorry for eventually unsynchronized double-crossing messages, but I will send this anyhow.
Best regards, Mihail
Mihail Carausu MSc.Eng., Informatics Manager Danish Biodiversity Information Facility (DanBIF)
-----Original Message----- From: ipt-bounces@lists.gbif.org [mailto:ipt- bounces@lists.gbif.org] On Behalf Of Tim Robertson (GBIF) Sent: 15. september 2010 09:43 To: ipt@lists.gbif.org Subject: [IPT] Functionality request: ADMIN checking data before GBIFregistration
Hi all,
Hannu has raised a request for the following to be satisfied by the IPT: "- Publishing a resource must be accepted by the owner of the provider. It has happened that a test user publishes something odd which goes all the way to the data portal without nobody controlling it."
This is a contradiction to the requests of others, and specifically those wishing to promote basic "data hosting centers", who request that a data MANAGER should be able to work autonomously.
After discussion with the developers the proposal is to implement the following, which we hope satisfies both requirements: In the Administration section, an ADMIN can choose to enable or disable the ability for MANAGERS to register resources with GBIF. By default MANAGERS can register a resource, but an ADMIN can disable this through this check box.
If anyone has any concerns or comments on this approach, please can you raise them on this list?
Many thanks, Tim
IPT mailing list IPT@lists.gbif.org http://lists.gbif.org/mailman/listinfo/ipt