[IPT] can IPT2 handle big datasets?
Tim Robertson (GBIF)
trobertson at gbif.org
Sun Apr 10 20:10:36 CEST 2011
Hi David,
This test was using a "select * from ... limit..." query from the
MySQL DB that serves the GBIF data portal.
The table queried was as below.
It does not make too much sense to suggest schemas for use with the
IPT as the IPT allows for mapping to the DwC terms from arbitrary
schemas. However, if someone was designing a table for this purpose
then creating a flat table with terms from the DwC would be a good
start. In essence the GBIF portal looks similar to that structure,
and currently holds 270 million + records. If the DB were being used
for multiple collections and normally queried on a single collection
at a time, it would make sense to do range partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html
) although you would have to be into many millions of records, or on
poor hardware to really notice the benefits. If your queries are
spanning all collections, then you would not want to do that either,
as performance would degrade; you may consider a different
partitioning strategy though if your query patterns allow it. The
most important thing I would recommend with databases is to use the
smallest types possible (e.g. UNSIGNED SMALL INT instead of INT) - the
difference in performance is astronomical as the DB grows, and
something we learnt with the GBIF Portal as it grew.
Hope this helps,
Tim
CREATE TABLE `raw_occurrence_record` (
`id` int(11) NOT NULL auto_increment,
`data_provider_id` smallint(6) default NULL,
`data_resource_id` smallint(6) default NULL,
`resource_access_point_id` smallint(6) default NULL,
`institution_code` varchar(255) default NULL,
`collection_code` varchar(255) default NULL,
`catalogue_number` varchar(255) default NULL,
`scientific_name` varchar(255) default NULL,
`author` varchar(255) default NULL,
`rank` varchar(50) default NULL,
`kingdom` varchar(150) default NULL,
`phylum` varchar(150) default NULL,
`class` varchar(250) default NULL,
`order_rank` varchar(50) default NULL,
`family` varchar(250) default NULL,
`genus` varchar(150) default NULL,
`species` varchar(150) default NULL,
`subspecies` varchar(150) default NULL,
`latitude` varchar(50) default NULL,
`longitude` varchar(50) default NULL,
`lat_long_precision` varchar(50) default NULL,
`max_altitude` varchar(50) default NULL,
`min_altitude` varchar(50) default NULL,
`altitude_precision` varchar(50) default NULL,
`min_depth` varchar(50) default NULL,
`max_depth` varchar(50) default NULL,
`depth_precision` varchar(50) default NULL,
`continent_ocean` varchar(100) default NULL,
`country` varchar(100) default NULL,
`state_province` varchar(100) default NULL,
`county` varchar(100) default NULL,
`collector_name` varchar(255) default NULL,
`locality` text,
`year` varchar(50) default NULL,
`month` varchar(50) default NULL,
`day` varchar(50) default NULL,
`basis_of_record` varchar(100) default NULL,
`identifier_name` varchar(255) default NULL,
`identification_date` datetime default NULL,
`unit_qualifier` varchar(255) default NULL,
`created` timestamp NULL default NULL,
`modified` timestamp NULL default NULL,
`deleted` timestamp NULL default NULL,
PRIMARY KEY (`id`),
KEY `created` (`created`,`modified`),
KEY `data_resource_id` (`data_resource_id`,`catalogue_number`),
KEY `resource_access_point_id` (`resource_access_point_id`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=329086455 DEFAULT CHARSET=utf8
On Apr 9, 2011, at 1:47 PM, Herbario SANT wrote:
> Good news
>
> Perhaps it is possible that you publish more detailed info of this
> test as an example. Providers handle they local data in many
> different ways (excel, access, whatever ...), but MySQL is pretty
> commonly available in shared institutional servers.
> So having a good working sample of a well designed database schema
> (properly indexed) for providing massive DarwinCore data would be
> very useful as an example to follow (I mean an sql file with CREATE
> sentences for tables and indexes).
> Just an idea
>
> --
> David García San León
> Herbario SANT
> Facultade de Farmacia - Laboratorio de Botánica
> Universidade de Santiago de Compostela
> 15782 - Galicia (Spain)
> http://www.usc.es/herbario
> Tel. +34 881815022
> Fax +34 981594912
>
> _______________________________________________
> IPT mailing list
> IPT at lists.gbif.org
> http://lists.gbif.org/mailman/listinfo/ipt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ipt/attachments/20110410/f905968f/attachment.html
More information about the IPT
mailing list