[IPT] can IPT2 handle big datasets?

Tim Robertson (GBIF) trobertson at gbif.org
Sun Apr 10 20:10:36 CEST 2011


Hi David,

This test was using a "select * from ... limit..." query from the  
MySQL DB that serves the GBIF data portal.
The table queried was as below.

It does not make too much sense to suggest schemas for use with the  
IPT as the IPT allows for mapping to the DwC terms from arbitrary  
schemas.  However, if someone was designing a table for this purpose  
then creating a flat table with terms from the DwC would be a good  
start.  In essence the GBIF portal looks similar to that structure,  
and currently holds 270 million + records.  If the DB were being used  
for multiple collections and normally queried on a single collection  
at a time, it would make sense to do range partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html 
) although you would have to be into many millions of records, or on  
poor hardware to really notice the benefits.  If your queries are  
spanning all collections, then you would not want to do that either,  
as performance would degrade; you may consider a different  
partitioning strategy though if your query patterns allow it.  The  
most important thing I would recommend with databases is to use the  
smallest types possible (e.g. UNSIGNED SMALL INT instead of INT) - the  
difference in performance is astronomical as the DB grows, and  
something we learnt with the GBIF Portal as it grew.

Hope this helps,
Tim

CREATE TABLE `raw_occurrence_record` (
   `id` int(11) NOT NULL auto_increment,
   `data_provider_id` smallint(6) default NULL,
   `data_resource_id` smallint(6) default NULL,
   `resource_access_point_id` smallint(6) default NULL,
   `institution_code` varchar(255) default NULL,
   `collection_code` varchar(255) default NULL,
   `catalogue_number` varchar(255) default NULL,
   `scientific_name` varchar(255) default NULL,
   `author` varchar(255) default NULL,
   `rank` varchar(50) default NULL,
   `kingdom` varchar(150) default NULL,
   `phylum` varchar(150) default NULL,
   `class` varchar(250) default NULL,
   `order_rank` varchar(50) default NULL,
   `family` varchar(250) default NULL,
   `genus` varchar(150) default NULL,
   `species` varchar(150) default NULL,
   `subspecies` varchar(150) default NULL,
   `latitude` varchar(50) default NULL,
   `longitude` varchar(50) default NULL,
   `lat_long_precision` varchar(50) default NULL,
   `max_altitude` varchar(50) default NULL,
   `min_altitude` varchar(50) default NULL,
   `altitude_precision` varchar(50) default NULL,
   `min_depth` varchar(50) default NULL,
   `max_depth` varchar(50) default NULL,
   `depth_precision` varchar(50) default NULL,
   `continent_ocean` varchar(100) default NULL,
   `country` varchar(100) default NULL,
   `state_province` varchar(100) default NULL,
   `county` varchar(100) default NULL,
   `collector_name` varchar(255) default NULL,
   `locality` text,
   `year` varchar(50) default NULL,
   `month` varchar(50) default NULL,
   `day` varchar(50) default NULL,
   `basis_of_record` varchar(100) default NULL,
   `identifier_name` varchar(255) default NULL,
   `identification_date` datetime default NULL,
   `unit_qualifier` varchar(255) default NULL,
   `created` timestamp NULL default NULL,
   `modified` timestamp NULL default NULL,
   `deleted` timestamp NULL default NULL,
   PRIMARY KEY  (`id`),
   KEY `created` (`created`,`modified`),
   KEY `data_resource_id` (`data_resource_id`,`catalogue_number`),
   KEY `resource_access_point_id` (`resource_access_point_id`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=329086455 DEFAULT CHARSET=utf8


On Apr 9, 2011, at 1:47 PM, Herbario SANT wrote:

> Good news
>
> Perhaps it is possible that you publish more detailed info of this  
> test as an example. Providers handle they local data in many  
> different ways (excel, access, whatever ...), but MySQL is pretty  
> commonly available in shared institutional servers.
> So having a good working sample of a well designed database schema  
> (properly indexed) for providing massive DarwinCore data would be  
> very useful as an example to follow (I mean an sql file with CREATE  
> sentences for tables and indexes).
> Just an idea
>
> -- 
> David García San León
> Herbario SANT
> Facultade de Farmacia - Laboratorio de Botánica
> Universidade de Santiago de Compostela
> 15782 - Galicia (Spain)
> http://www.usc.es/herbario
> Tel. +34 881815022
> Fax +34 981594912
>
> _______________________________________________
> IPT mailing list
> IPT at lists.gbif.org
> http://lists.gbif.org/mailman/listinfo/ipt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gbif.org/pipermail/ipt/attachments/20110410/f905968f/attachment.html 


More information about the IPT mailing list