Quantcast

ETL - key strategies

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ETL - key strategies

Francisco Reyes-3
Current setup is Postgres table

transactions_table
natural_key text <-- data is actually char(19)
data1
data2

On Lucid I am making it more structured and getting the natural_key to it's
own table.

keys_table
lucid_key integer
natural_key char(19) <-- exactly same as postgres

fact_table
fact_key integer
lucid_key integer
data1
data2

On the postgres side there isn't a table with a unique list of all the
natural_key values.

Currently I connect to postgres from lucid, create a uniqe list of
natural_key insert it into a table on lucid then select into keys_table any
rows that do not exist in keys_table by natural_key.

Is that the most efficient way or would it be more efficient to do a
postgres->lucid query like
select into lucid select from postgres where natural_key not in lucid keys
table

How does lucid handles postgres to lucid queries where I am doing something
more complex than just a select into .. select ?

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ETL - key strategies

John Sichi
Administrator
Francisco Reyes wrote:
> Currently I connect to postgres from lucid, create a uniqe list of
> natural_key insert it into a table on lucid then select into keys_table any
> rows that do not exist in keys_table by natural_key.

> Is that the most efficient way or would it be more efficient to do a
> postgres->lucid query like
> select into lucid select from postgres where natural_key not in lucid keys
> table

A more efficient way is to use upsert to do it all in one shot, without
any intermediate table (here d.t is the LucidDB table and d.emp is an
hsqldb external table):

merge into d.t using (select distinct deptno from d.emp) s on
t.deptno=s.deptno when not matched then insert (deptno) values (deptno);

At the bottom of my reply, I've included the EXPLAIN PLAN I get for
this.  You should confirm that you get something similar with PostgreSQL.

Besides skipping the intermediate table, the upsert approach also
eliminates a superfluous MIN aggregation added by the NOT IN rewrite.
Note that it is pushing the DISTINCT aggregation down into the remote
source (the GROUP BY inside MedJdbcQueryRel foreignSql), so only the
distinct key list is being pulled over from PostgreSQL, which is what we
want.

It's possible to skip the intermediate table with the INSERT approach
also, but when I tried that, LucidDB's optimizer did not push down the
DISTINCT.

> How does lucid handles postgres to lucid queries where I am doing something
> more complex than just a select into .. select ?

Good question.  Currently, LucidDB lacks stats about foreign tables, so
the optimizer doesn't get the information it normally would about row
counts, distinct value counts, histograms, etc.  For complex SQL, this
can cause trouble, in which case the solution is currently to
materialize intermediate results inside LucidDB (as you were doing), and
then possibly run ANALYZE on them before continuing with the ETL
workflow which processes them.  For simple SQL like the upsert above,
LucidDB can do fine without the advanced optimizer info, so you can skip
the materialization.  It's always a good idea to use ANALYZE to verify
that the optimizer is doing what you're hoping it's doing.

There are some well-known remote SQL optimization techniques which
LucidDB doesn't have yet, such as pushing lists of join keys down into
the foreign server (either via a temp table or big IN list) to perform
prefiltering.  This could be implemented via a semijoin technique
similar to what LucidDB already uses for local star joins.

JVS

0: jdbc:luciddb:> explain plan for merge into d.t using (select distinct
deptno from d.emp) s on t.deptno=s.deptno when not matched then insert
(deptno) values (deptno);
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
      column0
                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FennelToIteratorConverter
 
                  |
|   LcsTableMergeRel(table=[[LOCALDB, D, T]])
 
                  |
|     IteratorToFennelConverter
 
                  |
|       IterCalcRel(expr#0..3=[{inputs}], expr#4=[CAST($t3):INTEGER NOT
NULL], expr#5=[IS NULL($t1)], $f0=[$t4], $condition=[$t5])
                 |
|         FennelToIteratorConverter
 
                  |
|           LhxJoinRel(leftKeys=[[2]], rightKeys=[[0]],
joinType=[RIGHT])
                                 |
|             FennelReshapeRel(projection=[[0, 1, 0]],
outputRowType=[RecordType(INTEGER NOT NULL DEPTNO, BIGINT LCS_RID,
INTEGER CAST($0):INTEGER) NOT NULL])  |
|               LcsRowScanRel(table=[[LOCALDB, D, T]], projection=[[0,
LCS_RID]], clustered indexes=[[SYS$CLUSTERED_INDEX$T$DEPTNO]])
                  |
|             IteratorToFennelConverter
 
                  |
|               ResultSetToFarragoIteratorConverter
 
                  |
|                 MedJdbcQueryRel(foreignSql=[SELECT "DEPTNO"
 
                  |
| FROM (SELECT "DEPTNO"
 
                  |
| FROM "SALES"."EMP")
 
                  |
| GROUP BY "DEPTNO"])
 
                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Loading...