30 Minutes of playing with LucidDB

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

30 Minutes of playing with LucidDB

Nicholas Goodman
Greetings LucidDBers.  Looks like we're just getting a start as a community, even though the code for LucidDB has been around for a while.  I look forward to getting to connect with you over time.

I've had very limited time to build some LucidDB test cases/implementations but I did want to share some testing I've done on LucidDB.  I want to first and foremost note that this is not a scientific suite of tests, but just some simple cases that are showing some encouraging results.  In other words, it's what I've been able to do in about 30 minutes of spare time.  My goal is to just share some information; maybe you'll find it interesting/useful maybe not.  :)

I've loaded a test DIM_CUSTOMER2 with about a million rows.  

0: jdbc:luciddb:rmi://localhost> !describe DIM_CUSTOMER2
+------------+----------------+----------------+------------------+------------+------------+--------------+-------------+-------------+-----------------+-----------+----------+-------------+------------+
| TABLE_CAT  |  TABLE_SCHEM   |   TABLE_NAME   |   COLUMN_NAME    | DATA_TYPE  | TYPE_NAME  | COLUMN_SIZE  | BUFFER_LEN  | DEC_DIGITS  | NUM_PREC_RADIX  | NULLABLE  | REMARKS  | COLUMN_DEF  | SQL_DATA_T |
+------------+----------------+----------------+------------------+------------+------------+--------------+-------------+-------------+-----------------+-----------+----------+-------------+------------+
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | DIM_CUSTOMER_ID  | 4          | INTEGER    |              | 0           |             |                 | 0         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | DIM_CUSTOMER_NK  | 4          | INTEGER    |              | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | LASTNAME         | 12         | VARCHAR    | 10           | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | FIRSTNAME        | 12         | VARCHAR    | 15           | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | ZIPCODE          | 4          | INTEGER    |              | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | CITY             | 12         | VARCHAR    | 15           | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | BIRTHDATE        | 91         | DATE       |              | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | STREET           | 12         | VARCHAR    | 15           | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | HOUSENUMBER      | 4          | INTEGER    |              | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | STATE_CODE       | 12         | VARCHAR    | 10           | 0           |             |                 | 1         |          |             | 0          |
| LOCALDB    | TEST_DATAMART  | DIM_CUSTOMER2  | STATE            | 12         | VARCHAR    | 20           | 0           |             |                 | 1         |          |             | 0          |
+------------+----------------+----------------+------------------+------------+------------+--------------+-------------+-------------+-----------------+-----------+----------+-------------+------------+

0: jdbc:luciddb:rmi://localhost> select count(*) from DIM_CUSTOMER;
+----------+
|  EXPR$0  |
+----------+
| 1000000  |
+----------+

Has some textual values, integers, a date:

0: jdbc:luciddb:rmi://localhost> select * from DIM_CUSTOMER2 where DIM_CUSTOMER_NK < 5;
+------------------+------------------+-------------+------------------+----------+------------------+-------------+------------------+--------------+-------------+-----------------------+
| DIM_CUSTOMER_ID  | DIM_CUSTOMER_NK  |  LASTNAME   |    FIRSTNAME     | ZIPCODE  |       CITY       |  BIRTHDATE  |      STREET      | HOUSENUMBER  | STATE_CODE  |         STATE         |
+------------------+------------------+-------------+------------------+----------+------------------+-------------+------------------+--------------+-------------+-----------------------+
| 1000000          | 1                | rpyy-name   | rca-firstname    | 13380    | ogc-city         | 1958-10-30  | fenv-street      | 293          | GU          | GUAM                  |
| 1000001          | 2                | vgvx-name   | bwf-firstname    | 11420    | rxl-city         | 1962-10-29  | fmkt-street      | 218          | NY          | NEW YORK              |
| 1000002          | 3                | yxda-name   | efw-firstname    | 14460    | sbv-city         | 1954-04-15  | mptp-street      | 473          | GA          | GEORGIA               |
| 1000003          | 4                | ijmz-name   | vuk-firstname    | 17740    | gjz-city         | 1954-07-30  | fedt-street      | 155          | MS          | MISSISSIPPI           |
+------------------+------------------+-------------+------------------+----------+------------------+-------------+------------------+--------------+-------------+-----------------------+


From a storage perspective, it looks as if the total storage for the 1 million row table ends up at about 120MB.  I've not added additional multiple column indexes, so these are just the column stores.  If anyone has some additional information on how to interpret this undocumented view that would certainly be welcome.  NOTE: I've done similar tests on MySQL Archive engine on this same table, and even at a very maximum compression it ends up being about 50MB.  

0: jdbc:luciddb:rmi://localhost> select index_name, pages, pages * 32 / 1024 as pages_mb from sys_boot.mgmt.page_counts_view where table_name='DIM_CUSTOMER2';                                              
+-----------------------------------------------------+--------+-----------+                                                                      
|                     INDEX_NAME                      | PAGES  | PAGES_MB  |
+-----------------------------------------------------+--------+-----------+
| SYS$CONSTRAINT_INDEX$SYS$PRIMARY_KEY$DIM_CUSTOMER2  | 674    | 21        |
| SYS$CONSTRAINT_INDEX$DIM_CUSTOMER2_NK               | 674    | 21        |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$HOUSENUMBER       | 44     | 1         |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$ZIPCODE           | 44     | 1         |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$DIM_CUSTOMER_ID   | 125    | 3         |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$STREET            | 497    | 15        |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$STATE_CODE        | 26     | 0         |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$BIRTHDATE         | 157    | 4         |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$LASTNAME          | 341    | 10        |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$DIM_CUSTOMER_NK   | 155    | 4         |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$CITY              | 496    | 15        |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$FIRSTNAME         | 496    | 15        |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER2$STATE             | 27     | 0         |
| SYS$DELETION_INDEX$DIM_CUSTOMER2                    | 1      | 0         |
+-----------------------------------------------------+--------+-----------+
14 rows selected (1.079 seconds)
0: jdbc:luciddb:rmi://localhost> select sum(pages) * 32 / 1024 from sys_boot.mgmt.page_counts_view where table_name='DIM_CUSTOMER2';                         
+---------+
| EXPR$0  |
+---------+
| 117     |
+---------+
1 row selected (1.376 seconds)

Analyzing the table to compute statistics for all columns takes about 50 seconds:
0: jdbc:luciddb:rmi://localhost> analyze table DIM_CUSTOMER2 compute statistics for all columns;
No rows affected (52.956 seconds)

The LOAD was fast too.  Approximately 60 seconds to do an INSERT INTO DIM_CUSTOMER2 from **select statement from Flat File Adaptor**.

So, how about some interesting queries?
0: jdbc:luciddb:rmi://localhost> select min(zipcode), max(zipcode), avg(zipcode) from DIM_CUSTOMER2;                                                         
+---------+---------+---------+
| EXPR$0  | EXPR$1  | EXPR$2  |
+---------+---------+---------+
| 10000   | 20000   | 2113    |
+---------+---------+---------+
1 row selected (1.074 seconds)

0: jdbc:luciddb:rmi://localhost> select STATE, min(zipcode) minzip, max(zipcode) maxzip, avg(zipcode) avgzip from DIM_CUSTOMER2 group by STATE order by 1;
+-----------------------+---------+---------+---------+
|         STATE         | MINZIP  | MAXZIP  | AVGZIP  |
+-----------------------+---------+---------+---------+
| ALABAMA               | 10000   | 20000   | 15051   |
| ALASKA                | 10000   | 20000   | 15023   |
| AMERICAN SAMOA        | 10000   | 20000   | 15015   |
| ARIZONA               | 10000   | 20000   | 15025   |
| ARKANSAS              | 10000   | 20000   | 15014   |
| CALIFORNIA            | 10000   | 20000   | 15049   |
| COLORADO              | 10000   | 20000   | 14976   |
| CONNECTICUT           | 10000   | 20000   | 15000   |
| DELAWARE              | 10000   | 20000   | 14975   |
| DISTRICT OF COLUMBIA  | 10000   | 20000   | 14978   |
| FEDERATED STATES OF   | 10000   | 20000   | 15017   |
| FLORIDA               | 10000   | 20000   | 14958   |
| GEORGIA               | 10000   | 20000   | 15048   |
| GUAM                  | 10000   | 20000   | 14979   |
| HAWAII                | 10000   | 20000   | 14964   |
| IDAHO                 | 10000   | 20000   | 14999   |
| ILLINOIS              | 10000   | 20000   | 15025   |
| INDIANA               | 10000   | 20000   | 14997   |
| IOWA                  | 10000   | 20000   | 14969   |
| KANSAS                | 10000   | 20000   | 15021   |
| KENTUCKY              | 10000   | 20000   | 15009   |
| LOUISIANA             | 10000   | 20000   | 14995   |
| MAINE                 | 10000   | 20000   | 14980   |
| MARSHALL ISLANDS      | 10000   | 20000   | 14986   |
| MARYLAND              | 10000   | 20000   | 15012   |
| MASSACHUSETTS         | 10000   | 20000   | 15013   |
| MICHIGAN              | 10000   | 20000   | 14977   |
| MINNESOTA             | 10000   | 20000   | 14997   |
| MISSISSIPPI           | 10000   | 20000   | 14999   |
| MISSOURI              | 10000   | 20000   | 15006   |
| MONTANA               | 10000   | 20000   | 15006   |
| NEBRASKA              | 10000   | 20000   | 15017   |
| NEVADA                | 10000   | 20000   | 14971   |
| NEW HAMPSHIRE         | 10000   | 20000   | 14991   |
| NEW JERSEY            | 10000   | 20000   | 14958   |
| NEW MEXICO            | 10000   | 20000   | 14955   |
| NEW YORK              | 10000   | 20000   | 14982   |
| NORTH CAROLINA        | 10000   | 20000   | 15019   |
| NORTH DAKOTA          | 10000   | 20000   | 14992   |
| NORTHERN MARIANA ISL  | 10000   | 20000   | 14998   |
| OHIO                  | 10000   | 20000   | 14995   |
| OKLAHOMA              | 10000   | 20000   | 14987   |
| OREGON                | 10000   | 20000   | 15012   |
| PALAU                 | 10000   | 20000   | 15010   |
| PENNSYLVANIA          | 10000   | 20000   | 15015   |
| PUERTO RICO           | 10000   | 20000   | 15004   |
| RHODE ISLAND          | 10000   | 20000   | 14964   |
| SOUTH CAROLINA        | 10000   | 20000   | 15011   |
| SOUTH DAKOTA          | 10000   | 20000   | 14984   |
| TENNESSEE             | 10000   | 20000   | 15026   |
| TEXAS                 | 10000   | 20000   | 14954   |
| UTAH                  | 10000   | 20000   | 14994   |
| VERMONT               | 10000   | 20000   | 14977   |
| VIRGIN ISLANDS        | 10000   | 20000   | 14976   |
| VIRGINIA              | 10000   | 20000   | 15027   |
| WASHINGTON            | 10000   | 20000   | 15029   |
| WEST VIRGINIA         | 10000   | 20000   | 15002   |
| WISCONSIN             | 10000   | 20000   | 14989   |
| WYOMING               | 10000   | 20000   | 14994   |
| undefined             | 10000   | 20000   | 15019   |
+-----------------------+---------+---------+---------+
60 rows selected (3.107 seconds)

0: jdbc:luciddb:rmi://localhost> select CITY, min(zipcode) minzip, max(zipcode) maxzip, avg(zipcode) avgzip from DIM_CUSTOMER2 group by CITY order by 1;
-- TRUNCATED --
| zzx-city         | 10080   | 19860   | 14690   |
| zzy-city         | 10040   | 19380   | 14732   |
| zzz-city         | 10100   | 20000   | 15101   |
+------------------+---------+---------+---------+
17,576 rows selected (19.271 seconds)

0: jdbc:luciddb:rmi://localhost> select count(*) from DIM_CUSTOMER2 where STATE like '%IN%';
+---------+
| EXPR$0  |
+---------+
| 202825  |
+---------+
1 row selected (4.212 seconds)
0: jdbc:luciddb:rmi://localhost> select sum(zipcode) from DIM_CUSTOMER2 where STATE like '%IN%';
+--------------+
|    EXPR$0    |
+--------------+
| -1251720736  |
+--------------+
1 row selected (4.787 seconds)

I'm not sure if the last result is LucidDB or sqlline, but it's clearly not correct (zipcode are all positive integers).

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

Zelaine Fong-3


Nick Goodman wrote:

>
> 0: jdbc:luciddb:rmi://localhost> select sum(zipcode) from DIM_CUSTOMER2
> where STATE like '%IN%';
> +--------------+
> |    EXPR$0    |
> +--------------+
> | -1251720736  |
> +--------------+
> 1 row selected (4.787 seconds)
>
> I'm not sure if the last result is LucidDB or sqlline, but it's clearly
> not correct (zipcode are all positive integers).
>

The sum of your zipcodes is probably overflowing the return type of the
sum(), which is an integer.  You can see this in a simple example like
the following:

create table t(a int);
insert into t values(2147483647);
insert into t values(2147483647);

select sum(a) from t will return a negative number.

You can workaround this by doing the following:

select sum(cast(a as bigint)) from t;

It looks like the SQL standard specifies the result type in cases like
this to be "implementation-dependent".

-- Zelaine


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

John Sichi
Administrator
Zelaine Fong wrote:
> select sum(a) from t will return a negative number.
>
> You can workaround this by doing the following:
>
> select sum(cast(a as bigint)) from t;
>
> It looks like the SQL standard specifies the result type in cases like
> this to be "implementation-dependent".

Of course, SQL:2003 also specifies that overflow should be detected at
execution time as an error condition:

http://issues.eigenbase.org/browse/LDB-21

JVS


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

John Sichi
Administrator
In reply to this post by Nicholas Goodman
Nick Goodman wrote:
>  From a storage perspective, it looks as if the total storage for the 1
> million row table ends up at about 120MB.  I've not added additional
> multiple column indexes, so these are just the column stores.  If anyone
> has some additional information on how to interpret this undocumented
> view that would certainly be welcome.  NOTE: I've done similar tests on
> MySQL Archive engine on this same table, and even at a very maximum
> compression it ends up being about 50MB.  

Note that if you leave off the storage for the unclustered indexes, the
column-stores by themselves add up to only 75MB.  The unclustered
indexes are on  columns with all distinct values; if you create new
single-column indexes on the other columns, you should see good
compression from bitmap indexing.

The distributions for the CITY/FIRSTNAME/LASTNAME/STREET appear to be
synthetic-uniform rather than real-world, since usually you'd expect a
lot more duplicates for these.  MySQL's Archive engine probably uses
compression similar to zip, which is nice for sequential access since it
can compress tokens within values (e.g. "city" or "firstname").

> 0: jdbc:luciddb:rmi://localhost> select CITY, min(zipcode) minzip,
> max(zipcode) maxzip, avg(zipcode) avgzip from DIM_CUSTOMER2 group by
> CITY order by 1;
> -- TRUNCATED --
> | zzx-city         | 10080   | 19860   | 14690   |
> | zzy-city         | 10040   | 19380   | 14732   |
> | zzz-city         | 10100   | 20000   | 15101   |
> +------------------+---------+---------+---------+
> 17,576 rows selected (19.271 seconds)

One thing to watch out for with sqlline is that it has a lot of overhead
for fetching and rendering big result sets.  For example, by default it
buffers up the whole thing and does lots of string manipulation to
figure out good display widths for each column.

I think someone is looking into Apache JMeter to see if it can be used
as a good JDBC perf-testing framework.

JVS


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

Nicholas Goodman

On May 22, 2007, at 6:55 PM, John V. Sichi wrote:

> Note that if you leave off the storage for the unclustered indexes,  
> the column-stores by themselves add up to only 75MB.  The  
> unclustered indexes are on  columns with all distinct values; if  
> you create new single-column indexes on the other columns, you  
> should see good compression from bitmap indexing.
I'll try that and see what kind of compression I see - I'll report  
back the next time I get 30 minutes to play again.

> The distributions for the CITY/FIRSTNAME/LASTNAME/STREET appear to  
> be synthetic-uniform rather than real-world, since usually you'd  
> expect a lot more duplicates for these.  MySQL's Archive engine  
> probably uses compression similar to zip, which is nice for  
> sequential access since it can compress tokens within values (e.g.  
> "city" or "firstname").

Absolutely.  I was trying to suggest that maximum compression that  
one can get with the raw bits is not that much less.  ie, Zipping the  
records in a row store yields (mysql archive) an absolute maximum 50  
MB which leads me to believe that storage compression in LucidDB is  
pretty damn good.  I'll play around with some of the multi column  
indexes as well.  I wonder though?  Will having a multi column index  
provide that much benefit in a column store database?  Some of mine  
were small anyhow - a few MB and limited IO.  I can do some testing  
on this as well, but I wouldn't expect to see as large of  
improvements in a column store as a row store.  Anyone care to  
comment on this line of thinking?  :)
>
> One thing to watch out for with sqlline is that it has a lot of  
> overhead for fetching and rendering big result sets.  For example,  
> by default it buffers up the whole thing and does lots of string  
> manipulation to figure out good display widths for each column.
Good point - Can LucidDB do an inline table:
select count(*) from (select my_original_query_columns from  
original_query_table group by original_grouping) t
?
I shouldn't ask, I should just test but I don't have a server up and  
running.  That reminds me, at some point I need to try and compile  
LucidDB (+farrago/fennel) on OS X.  :)

Thanks for all the comments.

Nick


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

Zelaine Fong-3
Nick Goodman wrote:

> Good point - Can LucidDB do an inline table:
> select count(*) from (select my_original_query_columns from  
> original_query_table group by original_grouping) t
> ?
> I shouldn't ask, I should just test but I don't have a server up and  
> running.  That reminds me, at some point I need to try and compile  
> LucidDB (+farrago/fennel) on OS X.  :)
>

Yes, LucidDB allows you to specify a select statement in the FROM clause.

-- Zelaine


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

John Sichi
Administrator
In reply to this post by Nicholas Goodman
Nick Goodman wrote:

> Absolutely.  I was trying to suggest that maximum compression that one
> can get with the raw bits is not that much less.  ie, Zipping the
> records in a row store yields (mysql archive) an absolute maximum 50 MB
> which leads me to believe that storage compression in LucidDB is pretty
> damn good.  I'll play around with some of the multi column indexes as
> well.  I wonder though?  Will having a multi column index provide that
> much benefit in a column store database?  Some of mine were small anyhow
> - a few MB and limited IO.  I can do some testing on this as well, but I
> wouldn't expect to see as large of improvements in a column store as a
> row store.  Anyone care to comment on this line of thinking?  :)

For query purposes, indexing is mostly independent of column-store vs.
row-store; the index still maps a key to a set of rows, so if a
multi-column index would be beneficial for row-store, it may still be
beneficial for column-store.  For update purposes, there is a
distinction; LucidDB's index creation has a special optimization for
converting compressed column-store representation into bitmap
representation in the case of a single-column index; for a multi-column
index, it can't use this trick.

Of course, in a properly designed star schema (ignoring complications
such as dimensions with history), multi-column indexes are the
exception; most indexes of interest for query processing are single-column:

- dimension table surrogate keys
- fact-table foreign keys referencing dimension table surrogate keys
- dimension attribute indexes
- fact attribute or measure indexes

On the ETL side, life is a lot messier because of the need to deal with
source-system keys, lookup tables, cross-references, all that jazz, so
multi-column indexes are more common there.

> Good point - Can LucidDB do an inline table:
> select count(*) from (select my_original_query_columns from
> original_query_table group by original_grouping) t
> ?

Be careful with this one.  In this case, it has the desired effect, but
in other cases, the optimizer will push the projection all the way down,
eliminating a lot of the work supposedly being tested.

For example, if you do

select count(*) from (select i+j from t);

the optimizer will recognize that it's pointless to compute the sum, so
it will just count the rows.  Use

explain plan for select count(*) from (select i+j from t);
explain plan for select i+j from t;

to see the details (more than you want to know).

> I shouldn't ask, I should just test but I don't have a server up and
> running.  That reminds me, at some point I need to try and compile
> LucidDB (+farrago/fennel) on OS X.  :)

A few years ago someone almost got it working, but ran into a roadblock
with STLport not building on OS X.  Most likely by now STLport itself is
building fine there, but some tinkering with Fennel's build scripts
would almost certainly be required.  The Java side is always easy once
the native code builds.  (Farrago has a pure-Java build mode, leaving
out Fennel, but it's not very useful in the LucidDB context, since other
than the optimizer, most of the heavy lifting there is done in the
native code.)

JVS


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

Nicholas Goodman
In reply to this post by Nicholas Goodman

Again, just sharing some informal results.  Not interested, hit delete now.  :)

0: jdbc:luciddb:rmi://localhost> select CITY, min(zipcode) minzip, max(zipcode) maxzip, avg(zipcode) avgzip from DIM_CUSTOMER2 group by CITY order by 1;
-- TRUNCATED --
| zzx-city         | 10080   | 19860   | 14690   |
| zzy-city         | 10040   | 19380   | 14732   |
| zzz-city         | 10100   | 20000   | 15101   |
+------------------+---------+---------+---------+
17,576 rows selected (19.271 seconds)

Per other email threads (and cautions of pushdown optimization for testing) it does appear that sqlLine was chewing a bunch of time in retrieving and formatting results:

0: jdbc:luciddb:rmi://localhost> select count(c1) from (select CITY c1, min(zipcode) minzip, max(zipcode) maxzip, avg(zipcode) avgzip from DIM_CUSTOMER group by CITY);
+---------+
| EXPR$0  |
+---------+
| 17576   |
+---------+
1 row selected (2.89 seconds)

I've also created some single column indexes and have some storage information on single column indexes on columns that should be well suited for compression.

select 'State' "Column Name", count (distinct state) "Distinct Values" from DIM_CUSTOMER UNION select 'State Code' "Column Name", count (distinct state_code) "Distinct Values" from DIM_CUSTOMER;
+--------------+------------------+
| Column Name  | Distinct Values  |
+--------------+------------------+
| State        | 60               |
| State Code   | 60               |
+--------------+------------------+
2 rows selected (5.337 seconds)

create index DIM_CUSTOMER_S_IDX on DIM_CUSTOMER(STATE);
create index DIM_CUSTOMER_SC_IDX on DIM_CUSTOMER(STATE_CODE); 
analyze table DIM_CUSTOMER compute statistics for columns (ZIPCODE, STATE);  -- Needed to show up in next query
0: jdbc:luciddb:rmi://localhost> select index_name, pages, pages * 32 / 1024 as pages_mb from sys_boot.mgmt.page_counts_view where table_name='DIM_CUSTOMER';
+-------------------------------------------------------------------+--------+------------+
|                            INDEX_NAME                             | PAGES  |  PAGES_MB  |
+-------------------------------------------------------------------+--------+------------+
| SYS$CONSTRAINT_INDEX$SYS$PRIMARY_KEY$DIM_CUSTOMER                 | 674    | 21.062500  |
| SYS$CONSTRAINT_INDEX$SYS$UNIQUE_KEY$DIM_CUSTOMER$DIM_CUSTOMER_NK  | 674    | 21.062500  |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$DIM_CUSTOMER_ID                  | 125    | 3.906250   |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$CITY                             | 496    | 15.500000  |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$STATE_CODE                       | 26     | 0.812500   |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$ZIPCODE                          | 44     | 1.375000   |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$DIM_CUSTOMER_NK                  | 125    | 3.906250   |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$STREET                           | 497    | 15.531250  |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$BIRTHDATE                        | 157    | 4.906250   |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$FIRSTNAME                        | 496    | 15.500000  |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$LASTNAME                         | 341    | 10.656250  |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$STATE                            | 27     | 0.843750   |
| SYS$CLUSTERED_INDEX$DIM_CUSTOMER$HOUSENUMBER                      | 44     | 1.375000   |
| SYS$DELETION_INDEX$DIM_CUSTOMER                                   | 1      | 0.031250   |
| DIM_CUSTOMER_SC_IDX                                               | 65     | 2.031250   |
| DIM_CUSTOMER_S_IDX                                                | 66     | 2.062500   |
+-------------------------------------------------------------------+--------+------------+

The index contains more pages than the row store (26 pages for STATE_CODE, 65 pages for its index)?

Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 30 Minutes of playing with LucidDB

John Sichi
Administrator
Nick Goodman wrote:
> I've also created some single column indexes and have some storage
> information on single column indexes on columns that should be well
> suited for compression.
> [snip]
>
> The index contains more pages than the row store (26 pages for
> STATE_CODE, 65 pages for its index)?

That sounds right; here's how the math works out.

For the column-store, each entry is a 6-bit value ID.  (50 states means
6 bits, since 2^6=64 is the power-of-2 ceiling for 50.)  So total
storage should be on the order of 6 million bits for 1 million rows.  6
million bits = 6/8 million bytes = .75MB, which is close to your actual
result (.81MB).  Chalk up the difference to the per-page overhead for
storing the mapping from system-generated value ID to user-level state code.

For the unclustered index, let's start by supposing an unintelligent
bitmap index representation, where for each state, we store a bitmap of
1 million bits, with each bit indicating yea or nay for whether that row
ID (position within bitmap) identifies a row containing that state's
code.  There are 1 million rows in the table, so the bitmap for each
state will be 1 million bits.  50 states, 50 million bits; much worse
than the corresponding column-store (8.3x).

Of course, the actual bitmap representation is compressed, and since the
state distribution is non-uniform (lots of zeros for Rhode Island!), the
compression works well, so that the storage ends up only 2.5x over the
column-store.

JVS


Loading...