User-defined aggregate functions in LucidDB -- a straw-man design

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

User-defined aggregate functions in LucidDB -- a straw-man design

Julian Hyde
Nick Goodman and I discussed a couple of weeks ago how we would add user-defined aggregate functions to Eigenbase (and therefore to LucidDB and DynamoDB).
 
I've come up with a rough design for how these might look to the DBA (installing the user-defined aggregates), SQL developer (writing the queries) and java developer (implementing the user-defined aggregates as java classes). I'd like your feedback.
 
We could have implemented user-defined aggregates in C or C++, but we discounted that possibility for now. Even though C/C++ would be a bit faster, and we would be easier to integrate into existing XOs, it is painful to write C/C++ code, compile it, link and integrate into LucidDB, and debug the inevitable crashes and memory leaks. To use these aggregations implemented in Java, we will need to write some new execution objects (XOs) that can do aggregation in Java, but that's a topic for another email thread.
 
As often happens, there are simple core requirements that could be satisfied with a simple design, and t here are more complex requirements that require a more powerful design. The complex requirements lead us to create an 'Aggregate' interface that can dialog with the query validator, telling the validator what the aggregate function is capable of, and producing an implementation that meets the query's needs.
 
The core requirements:
  1. User-defined aggregate functions are not really 'functions'. You could say that they are functions from a set of values to a scalar result, but it's not practical to build up a huge set. Real-world aggregate functions have an accumulator, an 'add' operator that gets applied for each incoming row, and a 'result' operator to get the final result.
  2. Validator needs to be able to deduce the argument types taken by the aggregate function, and the return type.
  3. Need a way to register aggregate functions. The SQL standard does not cover user-defined aggregate functions in the way that it does user-defined functions, but we propose a similar mechanism: the CREATE AGGREGATE FUNCTION command.
The complex requirements:
  1. Operator overloading. (More than one aggregate function with the different typed parameters.)
  2. Aggregate functions that take more than one parameter.
  3. Aggregate functions that return multiple outputs.
  4. Windowed aggregation. Windowed aggregates need to remove values An aggregate function is suitable for windowed agg if it supports the 'remove' operation. But sometimes remove is difficult to implement, so we don't require that all aggregate functions support it.
  5. Optional 'merge' function. (Useful for merging partial results from different partitions using hybrid hash aggregation, or for rolling up rows in aggregate tables.)
  6. Rewrite rules. For example, AVG(x) can be rewritten to SUM(x) / COUNT(x). If SUM(x) and COUNT(x) are also being computed in the query, it is more efficient to compute the constituent parts just once.
  7. NULL handling. Usual SQL behavior is not to call an aggregate functions are not called if their parameter is NULL. (In the case of a multi-argument aggregate function, the function is not called if ANY of their parameters is NULL.)
The complex requirements make user-defined aggregates powerful, but start to make the specification more difficult to understand. They also are more complicated to implement, might have a small impact on performance. In other words, the complex requirements strain the 'keep simple things simple' mantra.
 
In my opinion, all of the complex requirements are worth having, and the SPI doesn't seem too complicated to me. I'd like to hear what y'all think, so I created this straw man design for you to shoot down. I've created a couple of detailed examples so you can see how it looks.
 
---------------------
 
// 1. The SPI.
 
// Definition of an aggregate function.
interface Aggregate {
    // Initializes the aggregate with the argument types of this particular
    // overloading, and the return types.
    //
    // If the aggregate has only a
    // single return value, resultTypes will have a single field called "RESULT".
    //
    // The types are exactly as specified
in the SQL 'CREATE FUNCTION'
    // command. It is possible to declare the same aggregate class using
    // multiple CREATE FUNCTION commands, and thereby achieve
    // overloaded aggregate calls.
    void init(List<Field> argTypes, List<Field> resultTypes);
 
    // Returns whether this aggregate can support the optional 'remove' operation.
    boolean supportsRemove();
 
    // Sets whether created accumulators should support the 'remove' option.
    // Throws if remove is not supported.
    void setRemove(boolean remove);
 
    // Returns whether this aggregate can support the optional 'merge' operation.
    boolean supportsMerge();
 
    // Sets whether created accumulators should support the 'merge(P)' option.
    // Throws if merge is not supported.
    void setMerge(boolean merge);
 
    // Called by the system to create a new accumulator, one for each distinct
    // value of the GROUP BY key. The accumulator should have the capabilities
    // required by setRemove and setMerge. If remove or merge are not required,
    // the aggregate may choose to use a cheaper implementation of the accumulator.
    Accumulator createAccumulator();   
}
 
// Accumulator to build up the value of the aggregate for a particular key.
interface Accumulator {
    // Merges this accumulator with one or more others, and returns
    // the result. Often an accumulator implementation will return
    // this accumulator.
    Accumulator merge(Accumulator...);
 
    // Must also have the following methods:
    // void add(<<Parameter Type(s)>>);
    // void remove(<<Parameter Type(s)>>);
    // <<Result Type>> getResult();
    //
    // or if there are multiple results:
    // <<Result X Type>> getX();
    // <<Result Y Type>> getY();
}
 
class Field {
    public final String name;
    public final int type; // a value from SqlTypes;
    public final int precision;
    public final int scale;
    public final Class javaType;
}
 
 
// 2. Registering the aggregate functions.
 
// Simple 'MySum' aggregate is
// equivalent to the builtin 'Sum' aggregate.
create aggregate function MySum(
    integer n)
returns integer
language java
parameter style system defined java
no sql
external name 'class com.acme.SumAggregate';
 
// More complex 'LinearRegress'
// aggregate that takes (x, y) arguments and returns (slope, intercept, count).
create aggregate function LinearRegress(
    double x,
    double y)
returns row (
    double slope,
    double intercept,
    integer count)
language java
parameter style system defined java
no sql
external name 'class com.acme.LinearRegressAggregate';
 
// 3. Using the aggregate functions in SQL queries.
 
-- Simple GROUP BY
select deptno, MySum(sal)
from emp
group by deptno;
 
-- Windowed aggregation
select ename,
    deptno,
    hiredate,
    MySum(sal) over (
        order by hiredate
        partition by deptno
        range '1' year preceding)
from emp;
 
-- Linear regress takes multiple arguments and returns a record
-- with multiple fields.
select LinearRegress(salary, salesRevenue).slope,
    LinearRegress(salary, salesRevenue).intercept
from emp
where deptno in (10, 40);
 
// 4. Supporting classes.
 
public class SumAggregate implements Aggregate {
    // all implementations of Aggregate must have a public default constructor
    public SumAggregate() {
    }
    void init(
        List<Field> argTypes,
        List<Field> returnTypes)
    {
        // Assume that argTypes is {n integer}
        // and returnTypes is {result integer}.
    }
    boolean supportsRemove() {
        return true;
    }
    void setRemove(boolean remove) {
        // nothing to do; we always support remove
    }
    boolean supportsMerge() {
        return true;
    }
    void setMerge(boolean merge) {
        // nothing to do; we always support merge
    }
    Accumulator createAccumulator() {
        return new SumAccumulator();
    }
}
 
public class SumAccumulator implements Accumulator {
    private int total;
 
    // called by SumAggregate
    SumAccumulator() {
        total = 0;
    }
    public void add(int n) {
        total += n;
    }
    public void remove(int n) {
        total -= n;
    }
    public Accumulator merge(Accumulator... accumulators) {
        for (Accumulator accumulator : accumulators) {
            total += ((SumAccumulator) accumulator).total;
        }
        return this;
    }
    public int getResult() {
        return total;
    }
}
 
public class LinearRegressAggregate implements Aggregate {
    public LinearRegressAggregate() {
    }
    void init(
        List<Field> argTypes,
        List<Field> returnTypes)
    {
        // Assume that argTypes is {x double, y double}
        // and returnTypes is {intercept double, slope double, count integer}.
    }
    boolean supportsRemove() {
        return true;
    }
    void setRemove(boolean remove) {
        // nothing to do; we always support remove
    }
    boolean supportsMerge() {
        return true;
    }
    void setMerge(boolean merge) {
        // nothing to do; we always support merge
    }
    Accumulator createAccumulator() {
        return new LinearRegressAccumulator();
    }
}
 
class LinearRegressAccumulator implements Accumulator {
    double sumX, sumY, sumXX, sumYY;
    int n;
 
    LinearRegressAccumulator() {
        sumX = sumY = sumXX = sumYY = 0d;
        n = 0;
    }
    public void add(double x, double y) {
        sumX += x;
        sumY += y;
        sumXX += x * x;
        sumXY += x * y;
        ++n;
    }
    public void remove(double x, double y) {
        sumX -= x;
        sumY -= y;
        sumXX -= x * x;
        sumXY -= x * y;
        --n;
    }
    public Accumulator merge(Accumulator... accumulators) {
        for (Accumulator accumulator : accumulators) {
            LinearRegressAccumulator lra =
                (LinearRegressAccumulator) accumulator;
            sumX += lra.sumX;
            sumY += lra.sumY;
            sumXX += lra.sumXX;
            sumXY += lra.sumXY;
            n += lra.n;
        }
        return this;
    }
    public double getSlope() {
        return (n * sumXY - sumX * sumY) / (n * sumXX - sumX * sumX);
    }
    public double getIntercept() {
        double xMean = sumX / n;
        double yMean = sumY / n;
        return yMean - slope * xMean;
    }
    public int getCount() {
        return n;
    }
}
 
Julian

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: User-defined aggregate functions in LucidDB -- a straw-man design

John Sichi
Administrator
Thanks Julian; let's get the next draft into the wiki to make it easier
to collect review commments.

Just for fun, here's an example UDAF from Hive, which uses a similar but
not identical interface:

http://svn.apache.org/repos/asf/hadoop/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udaf/example/UDAFExampleAvg.java

Julian and I talked offline about a mechanism for reusing the existing
UDF method resolution mechanism:  the user can provide a static factory
method which is capable of instantiating the Aggregate interface; this
static method entry point is what will get referenced in the usual
EXTERNAL NAME clause of CREATE AGGREGATE FUNCTION.  We'll invoke this
via reflection during query preparation (for type derivation), and again
later during query execution (as we initialize for processing rows).
The codegen for execution is going to have to take care of calling the
various getters for multi-valued return.

For the interface names/packaging and Field type specifications, we'll
want to be careful in defining and exposing them, since they could be
reused later as the basis for the kind of programmatic metadata
derivation facility Julian has requested for UDX's in the past (as
opposed to today's purely declarative facility).

JVS

Julian Hyde wrote:

> Nick Goodman and I discussed a couple of weeks ago how we would add
> user-defined aggregate functions to Eigenbase (and therefore to LucidDB
> and DynamoDB).
>  
> I've come up with a rough design for how these might look to the DBA
> (installing the user-defined aggregates), SQL developer (writing the
> queries) and java developer (implementing the user-defined aggregates as
> java classes). I'd like your feedback.
>  
> We could have implemented user-defined aggregates in C or C++, but we
> discounted that possibility for now. Even though C/C++ would be a bit
> faster, and we would be easier to integrate into existing XOs, it is
> painful to write C/C++ code, compile it, link and integrate into
> LucidDB, and debug the inevitable crashes and memory leaks. To use these
> aggregations implemented in Java, we will need to write some new
> execution objects (XOs) that can do aggregation in Java, but that's a
> topic for another email thread.
>  
> As often happens, there are simple core requirements that could be
> satisfied with a simple design, and t here are more complex requirements
> that require a more powerful design. The complex requirements lead us to
> create an 'Aggregate' interface that can dialog with the query
> validator, telling the validator what the aggregate function is capable
> of, and producing an implementation that meets the query's needs.
>  
> The core requirements:
>
>    1. User-defined aggregate functions are not really 'functions'. You
>       could say that they are functions from a set of values to a scalar
>       result, but it's not practical to build up a huge set. Real-world
>       aggregate functions have an accumulator, an 'add' operator that
>       gets applied for each incoming row, and a 'result' operator to get
>       the final result.
>    2. Validator needs to be able to deduce the argument types taken by
>       the aggregate function, and the return type.
>    3. Need a way to register aggregate functions. The SQL standard does
>       not cover user-defined aggregate functions in the way that it does
>       user-defined functions, but we propose a similar mechanism: the
>       CREATE AGGREGATE FUNCTION command.
>
> The complex requirements:
>
>    1. Operator overloading. (More than one aggregate function with the
>       different typed parameters.)
>    2. Aggregate functions that take more than one parameter.
>    3. Aggregate functions that return multiple outputs.
>    4. Windowed aggregation. Windowed aggregates need to remove values An
>       aggregate function is suitable for windowed agg if it supports the
>       'remove' operation. But sometimes remove is difficult to
>       implement, so we don't require that all aggregate functions
>       support it.
>    5. Optional 'merge' function. (Useful for merging partial results
>       from different partitions using hybrid hash aggregation, or for
>       rolling up rows in aggregate tables.)
>    6. Rewrite rules. For example, AVG(x) can be rewritten to SUM(x) /
>       COUNT(x). If SUM(x) and COUNT(x) are also being computed in the
>       query, it is more efficient to compute the constituent parts just
>       once.
>    7. NULL handling. Usual SQL behavior is not to call an aggregate
>       functions are not called if their parameter is NULL. (In the case
>       of a multi-argument aggregate function, the function is not called
>       if ANY of their parameters is NULL.)
>
> The complex requirements make user-defined aggregates powerful, but
> start to make the specification more difficult to understand. They also
> are more complicated to implement, might have a small impact on
> performance. In other words, the complex requirements strain the 'keep
> simple things simple' mantra.
>  
> In my opinion, all of the complex requirements are worth having, and the
> SPI doesn't seem too complicated to me. I'd like to hear what y'all
> think, so I created this straw man design for you to shoot down. I've
> created a couple of detailed examples so you can see how it looks.
>  
> ---------------------
>  
> // 1. The SPI.
>  
> // Definition of an aggregate function.
> interface Aggregate {
>     // Initializes the aggregate with the argument types of this particular
>     // overloading, and the return types.
>     //
>     // If the aggregate has only a
>     // single return value, resultTypes will have a single field called
> "RESULT".
>     //
>     // The types are exactly as specified in the SQL 'CREATE FUNCTION'
>     // command. It is possible to declare the same aggregate class using
>     // multiple CREATE FUNCTION commands, and thereby achieve
>     // overloaded aggregate calls.
>     void init(List<Field> argTypes, List<Field> resultTypes);
>  
>     // Returns whether this aggregate can support the optional 'remove'
> operation.
>     boolean supportsRemove();
>  
>     // Sets whether created accumulators should support the 'remove' option.
>     // Throws if remove is not supported.
>     void setRemove(boolean remove);
>  
>     // Returns whether this aggregate can support the optional 'merge'
> operation.
>     boolean supportsMerge();
>  
>     // Sets whether created accumulators should support the 'merge(P)'
> option.
>     // Throws if merge is not supported.
>     void setMerge(boolean merge);
>  
>     // Called by the system to create a new accumulator, one for each
> distinct
>     // value of the GROUP BY key. The accumulator should have the
> capabilities
>     // required by setRemove and setMerge. If remove or merge are not
> required,
>     // the aggregate may choose to use a cheaper implementation of the
> accumulator.
>     Accumulator createAccumulator();  
> }
>  
> // Accumulator to build up the value of the aggregate for a particular key.
> interface Accumulator {
>     // Merges this accumulator with one or more others, and returns
>     // the result. Often an accumulator implementation will return
>     // this accumulator.
>     Accumulator merge(Accumulator...);
>  
>     // Must also have the following methods:
>     // void add(<<Parameter Type(s)>>);
>     // void remove(<<Parameter Type(s)>>);
>     // <<Result Type>> getResult();
>     //
>     // or if there are multiple results:
>     // <<Result X Type>> getX();
>     // <<Result Y Type>> getY();
> }
>  
> class Field {
>     public final String name;
>     public final int type; // a value from SqlTypes;
>     public final int precision;
>     public final int scale;
>     public final Class javaType;
> }
>  
>  
> // 2. Registering the aggregate functions.
>  
> // Simple 'MySum' aggregate is
> // equivalent to the builtin 'Sum' aggregate.
> create aggregate function MySum(
>     integer n)
> returns integer
> language java
> parameter style system defined java
> no sql
> external name 'class com.acme.SumAggregate';
>  
> // More complex 'LinearRegress'
> // aggregate that takes (x, y) arguments and returns (slope, intercept,
> count).
> create aggregate function LinearRegress(
>     double x,
>     double y)
> returns row (
>     double slope,
>     double intercept,
>     integer count)
> language java
> parameter style system defined java
> no sql
> external name 'class com.acme.LinearRegressAggregate';
>  
> // 3. Using the aggregate functions in SQL queries.
>  
> -- Simple GROUP BY
> select deptno, MySum(sal)
> from emp
> group by deptno;
>  
> -- Windowed aggregation
> select ename,
>     deptno,
>     hiredate,
>     MySum(sal) over (
>         order by hiredate
>         partition by deptno
>         range '1' year preceding)
> from emp;
>  
> -- Linear regress takes multiple arguments and returns a record
> -- with multiple fields.
> select LinearRegress(salary, salesRevenue).slope,
>     LinearRegress(salary, salesRevenue).intercept
> from emp
> where deptno in (10, 40);
>  
> // 4. Supporting classes.
>  
> public class SumAggregate implements Aggregate {
>     // all implementations of Aggregate must have a public default
> constructor
>     public SumAggregate() {
>     }
>     void init(
>         List<Field> argTypes,
>         List<Field> returnTypes)
>     {
>         // Assume that argTypes is {n integer}
>         // and returnTypes is {result integer}.
>     }
>     boolean supportsRemove() {
>         return true;
>     }
>     void setRemove(boolean remove) {
>         // nothing to do; we always support remove
>     }
>     boolean supportsMerge() {
>         return true;
>     }
>     void setMerge(boolean merge) {
>         // nothing to do; we always support merge
>     }
>     Accumulator createAccumulator() {
>         return new SumAccumulator();
>     }
> }
>  
> public class SumAccumulator implements Accumulator {
>     private int total;
>  
>     // called by SumAggregate
>     SumAccumulator() {
>         total = 0;
>     }
>     public void add(int n) {
>         total += n;
>     }
>     public void remove(int n) {
>         total -= n;
>     }
>     public Accumulator merge(Accumulator... accumulators) {
>         for (Accumulator accumulator : accumulators) {
>             total += ((SumAccumulator) accumulator).total;
>         }
>         return this;
>     }
>     public int getResult() {
>         return total;
>     }
> }
>  
> public class LinearRegressAggregate implements Aggregate {
>     public LinearRegressAggregate() {
>     }
>     void init(
>         List<Field> argTypes,
>         List<Field> returnTypes)
>     {
>         // Assume that argTypes is {x double, y double}
>         // and returnTypes is {intercept double, slope double, count
> integer}.
>     }
>     boolean supportsRemove() {
>         return true;
>     }
>     void setRemove(boolean remove) {
>         // nothing to do; we always support remove
>     }
>     boolean supportsMerge() {
>         return true;
>     }
>     void setMerge(boolean merge) {
>         // nothing to do; we always support merge
>     }
>     Accumulator createAccumulator() {
>         return new LinearRegressAccumulator();
>     }
> }
>  
> class LinearRegressAccumulator implements Accumulator {
>     double sumX, sumY, sumXX, sumYY;
>     int n;
>  
>     LinearRegressAccumulator() {
>         sumX = sumY = sumXX = sumYY = 0d;
>         n = 0;
>     }
>     public void add(double x, double y) {
>         sumX += x;
>         sumY += y;
>         sumXX += x * x;
>         sumXY += x * y;
>         ++n;
>     }
>     public void remove(double x, double y) {
>         sumX -= x;
>         sumY -= y;
>         sumXX -= x * x;
>         sumXY -= x * y;
>         --n;
>     }
>     public Accumulator merge(Accumulator... accumulators) {
>         for (Accumulator accumulator : accumulators) {
>             LinearRegressAccumulator lra =
>                 (LinearRegressAccumulator) accumulator;
>             sumX += lra.sumX;
>             sumY += lra.sumY;
>             sumXX += lra.sumXX;
>             sumXY += lra.sumXY;
>             n += lra.n;
>         }
>         return this;
>     }
>     public double getSlope() {
>         return (n * sumXY - sumX * sumY) / (n * sumXX - sumX * sumX);
>     }
>     public double getIntercept() {
>         double xMean = sumX / n;
>         double yMean = sumY / n;
>         return yMean - slope * xMean;
>     }
>     public int getCount() {
>         return n;
>     }
> }
>  
> Julian
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> luciddb-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/luciddb-users


------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: User-defined aggregate functions in LucidDB -- a straw-man design

Julian Hyde


> John V. Sichi wrote:
>
> Julian and I talked offline about a mechanism for reusing the
> existing
> UDF method resolution mechanism:  the user can provide a
> static factory
> method which is capable of instantiating the Aggregate
> interface; this
> static method entry point is what will get referenced in the usual
> EXTERNAL NAME clause of CREATE AGGREGATE FUNCTION.  We'll invoke this
> via reflection during query preparation (for type
> derivation), and again
> later during query execution (as we initialize for processing rows).
> The codegen for execution is going to have to take care of
> calling the
> various getters for multi-valued return.

For UDFs, there's no question that static function is a nice concise way to
express the parameter types and return types. The value of having a dummy
UDA function is less, because it has to return an aggregate, and therefore
we can only glean the parameter types, not the return type.

I'm still open to allowing dummy functions if you think they would make some
important use cases simpler.

> For the interface names/packaging and Field type
> specifications, we'll
> want to be careful in defining and exposing them, since they could be
> reused later as the basis for the kind of programmatic metadata
> derivation facility Julian has requested for UDX's in the past (as
> opposed to today's purely declarative facility).

I was a bit tentative about introducing Field. I just couldn't see a simpler
way to communicate the necessary metadata. I tried to tie it as closely as
possible to JDBC's type system. The important attributes are {type,
precision, scale} -- see e.g.
http://java.sun.com/javase/6/docs/api/java/sql/ResultSetMetaData.html -- but
JDBC never actually puts them together in one struct.

Julian


------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: User-defined aggregate functions in LucidDB -- a straw-man design

John Sichi
Administrator
Julian Hyde wrote:

>
>> John V. Sichi wrote:
>>
>> Julian and I talked offline about a mechanism for reusing the
>> existing
>> UDF method resolution mechanism:  the user can provide a
>> static factory
>> method which is capable of instantiating the Aggregate
>> interface; this
>> static method entry point is what will get referenced in the usual
>> EXTERNAL NAME clause of CREATE AGGREGATE FUNCTION.  We'll invoke this
>> via reflection during query preparation (for type
>> derivation), and again
>> later during query execution (as we initialize for processing rows).
>> The codegen for execution is going to have to take care of
>> calling the
>> various getters for multi-valued return.
>
> For UDFs, there's no question that static function is a nice concise way to
> express the parameter types and return types. The value of having a dummy
> UDA function is less, because it has to return an aggregate, and therefore
> we can only glean the parameter types, not the return type.
>
> I'm still open to allowing dummy functions if you think they would make some
> important use cases simpler.

I meant that we would use the static method as a factory method,
responsible for returning an instance of Aggregate.  This allows us to
avoid inventing a new syntax for referencing a class instead of a method
(since all of the existing catalog/DDL infrastructure for UDR's are in
terms of methods rather than classes).

>> For the interface names/packaging and Field type
>> specifications, we'll
>> want to be careful in defining and exposing them, since they could be
>> reused later as the basis for the kind of programmatic metadata
>> derivation facility Julian has requested for UDX's in the past (as
>> opposed to today's purely declarative facility).
>
> I was a bit tentative about introducing Field. I just couldn't see a simpler
> way to communicate the necessary metadata. I tried to tie it as closely as
> possible to JDBC's type system. The important attributes are {type,
> precision, scale} -- see e.g.
> http://java.sun.com/javase/6/docs/api/java/sql/ResultSetMetaData.html -- but
> JDBC never actually puts them together in one struct.

Perhapse we could instead do it as a type string, e.g. 'VARCHAR(7)'?

JVS

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Farrago-developers] User-defined aggregate functions in LucidDB -- a straw-man design

Julian Hyde


> -----Original Message-----
> From: John V. Sichi [mailto:[hidden email]]
> Sent: Saturday, January 02, 2010 10:52 PM
> To: Mailing list for users of LucidDB
> Cc: [hidden email]
> Subject: Re: [Farrago-developers] [luciddb-users]
> User-defined aggregate functions in LucidDB -- a straw-man design
>
> Julian Hyde wrote:
> >
> >> John V. Sichi wrote:
> >>
> >> Julian and I talked offline about a mechanism for reusing the
> >> existing
> >> UDF method resolution mechanism:  the user can provide a
> >> static factory
> >> method which is capable of instantiating the Aggregate
> >> interface; this
> >> static method entry point is what will get referenced in the usual
> >> EXTERNAL NAME clause of CREATE AGGREGATE FUNCTION.  We'll
> invoke this
> >> via reflection during query preparation (for type
> >> derivation), and again
> >> later during query execution (as we initialize for
> processing rows).
> >> The codegen for execution is going to have to take care of
> >> calling the
> >> various getters for multi-valued return.
> >
> > For UDFs, there's no question that static function is a
> nice concise way to
> > express the parameter types and return types. The value of
> having a dummy
> > UDA function is less, because it has to return an
> aggregate, and therefore
> > we can only glean the parameter types, not the return type.
> >
> > I'm still open to allowing dummy functions if you think
> they would make some
> > important use cases simpler.
>
> I meant that we would use the static method as a factory method,
> responsible for returning an instance of Aggregate.  This
> allows us to
> avoid inventing a new syntax for referencing a class instead
> of a method
> (since all of the existing catalog/DDL infrastructure for
> UDR's are in
> terms of methods rather than classes).

So the factory method would have arguments, and the aggregate would be
responsible for fully describing its parameters & return type? I'm fine with
that.

I'd be happiest if we allowed both factory method and class name. So, given
class

package com.acme;

class MyUda {
    public static Aggregate createSumAgg() {
        return new MyAggregate();
    }

    public static class SumAggregate implements Aggregate {
        public SumAggregate() {
        }
        ...
    }
}

we should allow external names that reference either a class or a factory
method:

-- Aggregate function based on factory method.
-- (Factory method must be public, static, take zero parameters,
-- and the return type must implement the Aggregate interface.)
CREATE AGGREGATE FUNCTION ...
EXTERNAL NAME 'class com.acme.MyUda.createSumAgg';

-- Aggregate function based on a class.
-- (Class must be public, either top-level or a static inner class,
-- have a public constructor with zero parameters, and implement
-- the Aggregate interface.)
CREATE AGGREGATE FUNCTION ...
EXTERNAL NAME 'class com.acme.MyUda.SumAggregate';


> > jhyde:
> > I was a bit tentative about introducing Field. I just
> > couldn't see a simpler
> > way to communicate the necessary metadata.
>
> jvs:
> Perhapse we could instead do it as a type string, e.g. 'VARCHAR(7)'?

The implementor of the UDA would have to parse the type string. Is that
really easier than taking a {type, precision, scale} record?

Julian


------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Farrago-developers] User-defined aggregate functions in LucidDB -- a straw-man design

John Sichi
Administrator
Julian Hyde wrote:
> So the factory method would have arguments, and the aggregate would be
> responsible for fully describing its parameters & return type? I'm fine with
> that.
>
> I'd be happiest if we allowed both factory method and class name.

OK, we'll leave that up to the person doing the implementation.  Note
that EXTERNAL NAME can reference a method name without
parentheses-enclosed parameters, meaning there'll be ambiguity between
class vs method name, but this can be resolved during reflection.  See
class net.sf.farrago.query.FarragoUserDefinedRoutine for where this
currently happens, along with the corresponding OJRex codegen.

>>> jhyde:
>>> I was a bit tentative about introducing Field. I just
>>> couldn't see a simpler
>>> way to communicate the necessary metadata.
>> jvs:
>> Perhapse we could instead do it as a type string, e.g. 'VARCHAR(7)'?
>
> The implementor of the UDA would have to parse the type string. Is that
> really easier than taking a {type, precision, scale} record?

Hmmm, true.  It would be nice to avoid creating Yet Another Type
Descriptor, but I suppose exposing internal interfaces such as
RelDataType wouldn't be a good idea either since we want UDAF's to be
approachable without arcane knowledge.  However, there are a few more
points to consider for whatever we come up with:

* how extensible does it need to be?  I.e. mightn't there be a desire to
allow aggregation over UDT's in the future?

* since it's a Java interface, should we allow for any use of type
families instead of specific types (e.g. "give me any VARCHAR as input
and I'll produce a VARCHAR of the same precision as output"); currently
we don't have this, so a lot of applib UDX's declare VARCHAR(65535) for
both input and output instead

JVS

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: User-defined aggregate functions in LucidDB -- a straw-man design

ngoodman
In reply to this post by Julian Hyde
Julian,

Thanks a bunch for tackling the initial design!

re: 7. NULL Handling.

I'd think that the function would be called only if all the inputs are NULL.  If a scalar is NULL, one can say it is NULL.  If a complex type has a NULL field, but others are defined, I don't think we can refer to the complex type as NULL as it does have some defined parts.

Personally, I'd like for UDAFs to be called in all circumstances (including NULLs) so that the UDAF can determine how they want to handle it.  I realize since SQL 2003 doesn't actually specify UDAF this wouldn't be divergence, but would be further from the SQL standard precepts for UDF.  Postgres apparently has a "strict" qualifier for a transition function (we're calling it the add in the accumulator) that allows the UDAF author to determine the behavior.

Calling in all circumstances (even w/ NULLs) also simplifies the extensions for batching I discuss below.

I have two more "complex" requirements that I'll put out for discussion.  Unlike the complex requirements you highlight below which you are certain are worth the additional complexity I'm not convinced that these two are worth it.  Worth discussing for sure, implementing, not convinced.  :)

8. Accumulator State Serialization.

An additional option for the interface supportsSwapping() that would allow the state of the accumulator to be serialized (using standard java serialization or other appropriate method).  This could allow a swapping hash aggregation operator for long running hash aggregation.  For instance, if we are aggregating across 100k groups, and the dataset is not ordered, we'd have to have 100k accumulator objects in memory.  If we had a method to swap, if necessary or desired, we could ask the accumulator to swap and we could release the object and then reinstantiate when unswapping.  I don't think the initial implementation of the UDAF XO for aggs would have the capability but can imagine use cases where it could be desirable.  Also thinking about eventually keeping some up to date aggregates, or materialized views I can imagine use cases where keeping a persistent "point in time" accumulator stored, then running new rows through it, then swapping it again could be interesting.  However, I'm not by any means, close to having a design for Materialized Views/etc in mind so can't say whether this would be useful or not.

9. Batch Processing.

An additional option to allow the UDAF writer to have a chance at optimizing the code to run in batches.  Add the option, supportsBatching() which if returning true, would require the UDAF writer to implement both a scalar "add/remove" functions implementation and array  "add/remove" functions implementation.  In your example:

 public void add(int n) {
        total += n;
    }
    public void remove(int n) {
        total -= n;
    }
if supportsBatching() returns true then the UDAF would also have to implement:
 public void add(int[] n) {
        for ( int i = 0 ; i < n.length; n++ )
            total += n[i]; 
    }
    public void remove(int[] n) {
        for ( int i = 0 ; i < n.length; n++ )
            total -= n;
    }

I'm not sure what the multi input value add functions would look like.  Perhaps:
public void add(double[] x, double[] y) {
        // x and y must be same length 
        for (int i = 0 ; i < x.length; i++ ) 
  process x[i] and y[i]
    }


For simple arithmetic I don't see this is as a huge boon, but I can see some expensive GIS, and other calculations that could have efficiencies by processing in batch.  In terms of efficiency, would making a method call in a loop a few million times be as effecient as a for loop inside a method (in terms of JVM/CPU)?  I'm not sure, myself.

Thanks again for starting the discussion.

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: User-defined aggregate functions in LucidDB -- a straw-man design

Julian Hyde
 


From: Nicholas Goodman [mailto:[hidden email]]
Sent: Tuesday, January 05, 2010 9:59 AM
To: Mailing list for users of LucidDB
Cc: [hidden email]
Subject: Re: [luciddb-users] User-defined aggregate functions in LucidDB -- a straw-man design

Julian,

Thanks a bunch for tackling the initial design!

re: 7. NULL Handling.

I'd think that the function would be called only if all the inputs are NULL.  If a scalar is NULL, one can say it is NULL.  If a complex type has a NULL field, but others are defined, I don't think we can refer to the complex type as NULL as it does have some defined parts.

Personally, I'd like for UDAFs to be called in all circumstances (including NULLs) so that the UDAF can determine how they want to handle it.  I realize since SQL 2003 doesn't actually specify UDAF this wouldn't be divergence, but would be further from the SQL standard precepts for UDF.  Postgres apparently has a "strict" qualifier for a transition function (we're calling it the add in the accumulator) that allows the UDAF author to determine the behavior.

Calling in all circumstances (even w/ NULLs) also simplifies the extensions for batching I discuss below. 
This is a case of reason vs. the SQL standard. The SQL standard sends a row to an n-argument agg function only if all n arguments are not null. For example, this is what it says about binary aggregate functions (SQL:2008, ch 2, 4.15.4 Aggregate Functions):

The binary aggregate functions take a pair of arguments, the <dependent variable expression> and the <independent   variable expression>, which are both <numeric value expression>s. Any row in which either argument evaluates to the null value is removed from the group. If there are no rows remaining in the group, then the result of REGR_COUNT is 0 (zero), and the other binary aggregate functions result in the null value. 

Let's make the default behavior consistent with the standard. That is, don't pass in rows where any of the args is null. (See my last paragraph for another benefit of this behavior: performance.)

The standard provides a flag 'CALLED ON NULL INPUT' on UDFs. We can have a similar flag for UDAs.

  I have two more "complex" requirements that I'll put out for discussion.  Unlike the complex requirements you highlight below which you are certain are worth the additional complexity I'm not convinced that these two are worth it.  Worth discussing for sure, implementing, not convinced.  :)

8. Accumulator State Serialization.

An additional option for the interface supportsSwapping() that would allow the state of the accumulator to be serialized (using standard java serialization or other appropriate method).  This could allow a swapping hash aggregation operator for long running hash aggregation.  For instance, if we are aggregating across 100k groups, and the dataset is not ordered, we'd have to have 100k accumulator objects in memory.  If we had a method to swap, if necessary or desired, we could ask the accumulator to swap and we could release the object and then reinstantiate when unswapping.  I don't think the initial implementation of the UDAF XO for aggs would have the capability but can imagine use cases where it could be desirable.  Also thinking about eventually keeping some up to date aggregates, or materialized views I can imagine use cases where keeping a persistent "point in time" accumulator stored, then running new rows through it, then swapping it again could be interesting.  However, I'm not by any means, close to having a design for Materialized Views/etc in mind so can't say whether this would be useful or not. 
Definitely very useful.
 
Thinking from the perspective of the server generating the wrapper code, I would want to know whether the aggregate supports swapping at query preparation time (i.e. before I generate the wrapper code). 'boolean supportsSwapping()' would therefore be a method on the Aggregate interface, because there are no instances of Accumulator until query execution time.
 
I'd build swapping using the Java concept of externalization. This means that an Accumulator would implement methods 'void writeExternal(ObjectOutput out)' and 'void readExternal(ObjectInput in)'. And the method on Aggregate would be called 'boolean supportsExternalization()'.
 
We can defer this feature to a future version.
9. Batch Processing.

An additional option to allow the UDAF writer to have a chance at optimizing the code to run in batches.  Add the option, supportsBatching() which if returning true, would require the UDAF writer to implement both a scalar "add/remove" functions implementation and array  "add/remove" functions implementation.  In your example:

 public void add(int n) {
        total += n;
    }
    public void remove(int n) {
        total -= n;
    }
if supportsBatching() returns true then the UDAF would also have to implement:
 public void add(int[] n) {
        for ( int i = 0 ; i < n.length; n++ )
            total += n[i]; 
    }
    public void remove(int[] n) {
        for ( int i = 0 ; i < n.length; n++ )
            total -= n;
    }

I'm not sure what the multi input value add functions would look like.  Perhaps:
public void add(double[] x, double[] y) {
        // x and y must be same length 
        for (int i = 0 ; i < x.length; i++ ) 
  process x[i] and y[i]
    }


For simple arithmetic I don't see this is as a huge boon, but I can see some expensive GIS, and other calculations that could have efficiencies by processing in batch.  In terms of efficiency, would making a method call in a loop a few million times be as effecient as a for loop inside a method (in terms of JVM/CPU)?  I'm not sure, myself. 
I don't see a massive value to batching.  There would be less time spent in the user's code, but the server would have to spend extra effort building the big array. Extra complexity with not much gain.
 
Let's not kid ourselves. If you need searing performance, implement your aggregate as a C++ built-in. This is open source, after all.
 
Note that our examples have used native types as arguments. If you allowed nulls, we'd have to change my
 
void add(int n) {
    total += n;
}
 
method to
 
void add(Integer n) {
    if (n != null) {
        total += n.intValue();
    }
}
 
My hunch is that native types is an area where saving the pennies will actually pay off. So, the SQL standard's behavior of not passing in null or partially null values is actually right, albeit for the wrong reason.
 
Julian

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Farrago-developers] User-defined aggregate functions in LucidDB -- a straw-man design

Nicholas Goodman

On Jan 5, 2010, at 11:05 AM, Julian Hyde wrote:

Let's make the default behavior consistent with the standard. That is, don't pass in rows where any of the args is null. (See my last paragraph for another benefit of this behavior: performance.)

The standard provides a flag 'CALLED ON NULL INPUT' on UDFs. We can have a similar flag for UDAs


I assumed that since UDAs weren't in the standard that there wouldn't be in any collateral for us to evaluate for our use case.  I agree that the standard does give some guidance there (with a precedent multi input aggregate calls not called if any are null).

Like Postgres (with it's "strict" qualifier) I do like the flag which can be done as a follow on at some point, for UDF and UDAs.  The default, until such time as we can implement the flag above, should be as you originally proposed: not called if ANY args are NULL.

The only thing that seems kind of weird is that now the behavior that the Aggregate Accumulator supports is now being defined two places: the interface (supportsRemove(), etc) and declaratively (CALLED ON NULL INPUT).  I think this, while a bit confusing for new developers wishing to write new functions, is worthwhile to maintain as close as possible compatibility with SQL standard.
I'd build swapping using the Java concept of externalization. This means that an Accumulator would implement methods 'void writeExternal(ObjectOutput out)' and 'void readExternal(ObjectInput in)'. And the method on Aggregate would be called 'boolean supportsExternalization()'.
 
We can defer this feature to a future version.
Should we add the 'boolean supportsSwapping()' to the interface and just defer the feature implementation?  I'm trying to think of the impact of adding an additional method to an interface which we hope to have a widely cast set of authors for (inside and outside of Eigenbase).

I don't see a massive value to batching.  There would be less time spent in the user's code, but the server would have to spend extra effort building the big array. Extra complexity with not much gain.
 
Let's not kid ourselves. If you need searing performance, implement your aggregate as a C++ built-in. This is open source, after all.

Sure, let's forego any type of batching at the interface/XO level.

Since the UDA are classes, with their own internal implementation/state anyhow, the UDA developer is free to do their own internal queuing/batching as well.  In other words, the add() function can simply add into an appending structure, and every 1000 or so objects, do an intermediate computation.  In fact, some aggregates may need the entire data set, and may do NO intermediate computations up until the moment that one of the fields getSlope()/getIntercept() etc are called.

Nick

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Farrago-developers] User-defined aggregate functions in LucidDB -- a straw-man design

John Sichi
Administrator
Nicholas Goodman wrote:

>
> On Jan 5, 2010, at 11:05 AM, Julian Hyde wrote:
>
>> The standard provides a flag 'CALLED ON NULL INPUT' on UDFs. We can
>> have a similar flag for UDAs
>>
>
> Like Postgres (with it's "strict" qualifier) I do like the flag which
> can be done as a follow on at some point, for UDF and UDAs.  The
> default, until such time as we can implement the flag above, should be
> as you originally proposed: not called if ANY args are NULL.

I'm fine with whatever is decided here; I'll just mention that we do
already implement the CALLED ON NULL INPUT attribute for normal UDF's,
and the standard-defined default is that the function is called (you
have to specify RETURNS NULL ON NULL INPUT in order to make it skip).

JVS

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Loading...