Quantcast

loading data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

loading data

Rahul Chopra
I wanted to try out running TPC-H queries.

When loading data, I noticed things got very slow for larger files, with CPU heavily used.

5 rows affected (0.634 seconds) #region
25 rows affected (1.105 seconds) #nation
10,000 rows affected (1.878 seconds) #supplier
200,000 rows affected (5.749 seconds) #part
800,000 rows affected (15.06 seconds) #partsupp
150,000 rows affected (4.728 seconds) #customer
1,500,000 rows affected (111.945 seconds) #orders
6,001,215 rows affected (966.304 seconds) #lineitem

This is for a factor 1 (1 GB) database.

I have 8 GB RAM, and I set up the cache for 5 GB.

I used the flat file wrapper and then an INSERT SELECT to do this. Is that the fastest way to load up data? I did not notice any info regarding a fast bulk load utility. If this is the only way, are there any other options available that should be used, like a commit interval?

Thanks.

=
Vegas Shows-Front Rows Guaranteed
Front rows guaranteed to the hottest las vegas show on the strip.
http://a8-asy.a8ww.net/a8-ads/adftrclick?redirectid=2d5e7f326803d33ae7795783586485ca


--
Powered by Outblaze


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: loading data

Zelaine Fong-3
Yes, the technique you're using is the fastest way of loading the
database.  LucidDB does not have a separate bulk load utility, as
insert-selects are already done as bulk inserts.

The slowness you're seeing may have something to do with the large cache
size.  We've been doing some TPC-H testing as well and noticed that with
a larger cache size, the table loads were getting slower, instead of
faster.  One hypothesis is this may be a performance issue with the
bitmap index build that's created for the primary key on the table.
With more memory, the bitmaps are larger, and there may be an issue with
building these large bitmaps.  However, we haven't investigated this
deep enough to confirm this.

-- Zelaine

Rahul Chopra wrote:

> I wanted to try out running TPC-H queries.
>
> When loading data, I noticed things got very slow for larger files, with CPU heavily used.
>
> 5 rows affected (0.634 seconds) #region
> 25 rows affected (1.105 seconds) #nation
> 10,000 rows affected (1.878 seconds) #supplier
> 200,000 rows affected (5.749 seconds) #part
> 800,000 rows affected (15.06 seconds) #partsupp
> 150,000 rows affected (4.728 seconds) #customer
> 1,500,000 rows affected (111.945 seconds) #orders
> 6,001,215 rows affected (966.304 seconds) #lineitem
>
> This is for a factor 1 (1 GB) database.
>
> I have 8 GB RAM, and I set up the cache for 5 GB.
>
> I used the flat file wrapper and then an INSERT SELECT to do this. Is that the fastest way to load up data? I did not notice any info regarding a fast bulk load utility. If this is the only way, are there any other options available that should be used, like a commit interval?
>
> Thanks.
>
> =
> Vegas Shows-Front Rows Guaranteed
> Front rows guaranteed to the hottest las vegas show on the strip.
> http://a8-asy.a8ww.net/a8-ads/adftrclick?redirectid=2d5e7f326803d33ae7795783586485ca
>
>



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: loading data

Rahul Chopra
In reply to this post by Rahul Chopra

Hi Zelaine,

Thanks for the fast response.

Actually, I had not yet created any indexes/PKs on the tables, I did that only after loading it up.

While the wrapper concept is kind of nice for ETL, I think it is cumbersome to set this all up if you have a simple delimited file that matches the structure of the target table and you just want a one-line command to load it up.

Anyway, I think the project is interesting and I wish it success.

> ----- Original Message -----
> From: "Zelaine Fong" <[hidden email]>
> To: "Rahul Chopra" <[hidden email]>
> Subject: Re: [luciddb-users] loading data
> Date: Fri, 04 Jan 2008 08:52:33 -0800
>
>
> Yes, the technique you're using is the fastest way of loading the
> database.  LucidDB does not have a separate bulk load utility, as
> insert-selects are already done as bulk inserts.
>
> The slowness you're seeing may have something to do with the large
> cache size.  We've been doing some TPC-H testing as well and
> noticed that with a larger cache size, the table loads were getting
> slower, instead of faster.  One hypothesis is this may be a
> performance issue with the bitmap index build that's created for
> the primary key on the table. With more memory, the bitmaps are
> larger, and there may be an issue with building these large
> bitmaps.  However, we haven't investigated this deep enough to
> confirm this.
>
> -- Zelaine
>
> Rahul Chopra wrote:
> > I wanted to try out running TPC-H queries.
> >
> > When loading data, I noticed things got very slow for larger
> > files, with CPU heavily used.
> >
> > 5 rows affected (0.634 seconds) #region
> > 25 rows affected (1.105 seconds) #nation
> > 10,000 rows affected (1.878 seconds) #supplier
> > 200,000 rows affected (5.749 seconds) #part
> > 800,000 rows affected (15.06 seconds) #partsupp
> > 150,000 rows affected (4.728 seconds) #customer
> > 1,500,000 rows affected (111.945 seconds) #orders
> > 6,001,215 rows affected (966.304 seconds) #lineitem
> >
> > This is for a factor 1 (1 GB) database.
> >
> > I have 8 GB RAM, and I set up the cache for 5 GB.
> >
> > I used the flat file wrapper and then an INSERT SELECT to do
> > this. Is that the fastest way to load up data? I did not notice
> > any info regarding a fast bulk load utility. If this is the only
> > way, are there any other options available that should be used,
> > like a commit interval?
> >
> > Thanks.
> >
> > =
> > Vegas Shows-Front Rows Guaranteed
> > Front rows guaranteed to the hottest las vegas show on the strip.
> > http://a8-asy.a8ww.net/a8-ads/adftrclick?redirectid=2d5e7f326803d33ae7795783586485ca
> >
> >

>


=
Oreck V. Ozone Generators
Our Oxygenator removes harmful ozone from your air. Ships Free.
http://a8-asy.a8ww.net/a8-ads/adftrclick?redirectid=c129fc0235fd54cef343c4a5b266abc1


--
Powered by Outblaze


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: loading data

John Sichi
Administrator
Rahul Chopra wrote:
> Hi Zelaine,
>
> Thanks for the fast response.
>
> Actually, I had not yet created any indexes/PKs on the tables, I did
> that only after loading it up.

Sounds like we should run oprofile on the load with large buffer pool
and see what's going on.

By the way, Zelaine, for the case where the PK is already defined, I'm
thinking the bitmap size is unlikely to be a culprit since for a PK, all
of the entries will be singletons.

> While the wrapper concept is kind of nice for ETL, I think it is
> cumbersome to set this all up if you have a simple delimited file
> that matches the structure of the target table and you just want a
> one-line command to load it up.

It wouldn't be hard to write a SQL-invocable user-defined procedure in
Java which does this (taking the file location and target table name as
parameters) by wrapping the underlying SQL commands (create flatfile
foreign server, run INSERT, drop flatfile server).  The Mondrian
replication procedure does something like that plus a lot more (source
code in dev/luciddb/src/com/lucidera/luciddb/applib/mondrian).

Adapting open-source ETL tools to understand LucidDB's SQL-everything
worldview would be another way to ease this.  (LucidEra has something
like this internally but it's not currently in a shape suitable for
open-source release.)

JVS


Loading...