Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom aggregate functions (take 2) #529

Merged
merged 36 commits into from
Sep 8, 2022

Conversation

llimllib
Copy link
Contributor

@llimllib llimllib commented Sep 5, 2022

This PR builds on #407, updating it to current HEAD, adding more tests, and documenting it.

This PR closes #204, and is composed mostly of the work that @AnyhowStep did so long ago in that thread.

#407 notes that aggregate window functions do not work, but I think that's a different issue that should be tackled in another PR.

The full list of changes follows:

  • Update README.md to document create_aggregate
  • extract parseFunctionArguments into a function that can be used by both create_aggregate and create_function
    • Please verify that nothing substantial has changed here - I'm worried I missed some small divergence between Add: aggregate functions. #407 and current HEAD and this function should only be an extraction, not changing any behavior
  • extract setFunctionResults into a function
    • same caveats apply
  • export the sqlite3_aggregate_context function
  • Add the create_aggregate function
    • The signature: create_aggregate(function_name, initial_value, { step: (state, ...value) =>, finalize: (state) =>})
    • create_aggregate calls sqlite3_aggregate_context to allocate 1 byte of memory per instance of the aggregate function, which it uses as a key for its state array. sqlite3_aggregate_context documentation
  • Add tests

@llimllib llimllib marked this pull request as ready for review September 5, 2022 01:43
@llimllib
Copy link
Contributor Author

llimllib commented Sep 5, 2022

cc @lovasoa

Copy link
Member

@lovasoa lovasoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thank you !

A few things:

  • Can we avoid calling init in create_aggregate itself ? I think this is a very surprising behavior for the programmer. init should be called every time the aggregate function is called.
  • Can we make the step function return the new state, like the standard reduce ? This would avoid the awkward function init() { return {sum: 0} }
  • Taking three functions as arguments makes the code using create_aggregate a little bit hard to read. Can we take an object instead ?
db.create_aggregate("json_array", {
    init: () => [],
    step: (state, value) => [...state, value],
    finalize: state => JSON.stringify(state)
})
  • Can we make init and finalize optional ? init should default to () => null and finalize to state => state. In the end we should be able to call
db.create_aggregate("js_sum", { step: (a,b) => a+b })
  • Can we add memory leak tests like the ones we have in test/test_functions_recreate.js

@llimllib
Copy link
Contributor Author

llimllib commented Sep 5, 2022

Can we avoid calling init in create_aggregate itself ? I think this is a very surprising behavior for the programmer. init should be called every time the aggregate function is called.

Yes, thank you! Bad mental model striking there.

I like all your proposed changes, will do.

@llimllib
Copy link
Contributor Author

llimllib commented Sep 6, 2022

A question: when an instance of create_aggregate is returned, do I have to worry about it being thread-safe?

That is, if I say: x = create_aggregate_func("x", ...), can I rely on that function only being called by one caller from init to finalize?

I'm assuming I can? If not, storing the state is tricky, because I need to account for this scenario:

  • x = create_aggregate_func("x", ...)
  • On thread Foo, SELECT x(...) from y runs, and sets a state var defined in the create_aggregate scope to the value of init (like the code I'm starting from does)
  • Before this finishes, thread Bar runs SELECT x(...) from z
  • it sets state to its own init value - but unless we're getting a copy of the create_aggregate function (are we?), this is going to overwrite the state of the previous function
  • Foo finishes, and returns an invalid value because its state got overwritten

I would try to answer this for myself but the web platform is a really tricky runtime so I'm asking in case you know off hand?

@llimllib
Copy link
Contributor Author

llimllib commented Sep 6, 2022

The way that proper sqlite3 extensions do it is by hanging their state off the execution context like so.

Currently, sqlite3_aggregate_context isn't exposed and using it seems like it would be tricky, but doable

@llimllib
Copy link
Contributor Author

llimllib commented Sep 6, 2022

Can we make the step function return the new state, like the standard reduce?

I don't think so? We're not the ones calling the step function, that's sqlite doing it

Can we make init and finalize optional

We can make init optional (non-existant actually) if we can figure out how to store state on the sqlite context between steps - otherwise callers will need to have a place to accumulate state between steps

We can't make finalize optional without getting pretty aggressive in overriding sqlite, because otherwise the aggregate function has no return value - finalize is where we call setFunctionResult with the reduced value of the accumulated state.

(Maybe I'm not being clever enough and there is a way for both!)

Basically it seems that the sqlite extension pattern of 'allocate
a struct and stick it in the context pointer' is not going to work
for us here. I wonder if using the id of the pointer returned by
sqlite3_aggregate_context would be enough? Since no two functions
could use the same pointer, per https://www.sqlite.org/c3ref/aggregate_context.html ?
@llimllib
Copy link
Contributor Author

llimllib commented Sep 6, 2022

I managed to make init optional at least, and I think the solution I gave here to use the value of the pointer returned by sqlite3_aggregate_context is clever at least?

This branch is WIP status, and I apologize for the noise on it.

I will look into getting it down to a single function argument tomorrow.

@lovasoa
Copy link
Member

lovasoa commented Sep 6, 2022

About defaults for init and finalize: Can't we just have create_aggregate_func start like that :

function create_aggregate_func<T>(
   name: string,
   methods: {
       init?: () => T,
       step: (state: T, value: any) => T,
       finalize?: (state: T) => any 
   }
) {
    let init = methods.init || (() => null);
    let step = (state: T, value: any) => { 
        state = methods.step(state, value);
        __update_sqlite_internal_state(state);
    };
    let finalize = methods.finalize || (state => state);
    ...
}

About thread-safety: there is no such thing as multi-threading with concurrent access in javascript. The javascript runtime guarantees that there is a single execution thread that can access mutable javascript objects. That said, even within a single thread, there can be multiple aggregate functions that are at various stages of there execution simultaneously. The following has to work:

SELECT js_agg_fn(x),  js_agg_fn(y) FROM t;

A test case like this one should probably be added to the tests too.

@llimllib
Copy link
Contributor Author

llimllib commented Sep 6, 2022

Having to return values from step makes it more awkward in the very common "accumulate values" case. Here's json_agg with returning values (and a default state of null):

db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state = (state || []); state.push(val); return state; },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

And with modifying state (and with state defaulting to [] instead of null):

db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state.push(val); },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

I think we might even want to make step default to pushing the values into an array? That way many functions could be written as just a finalizer - you get the list of values and pull the median, or serialize them, or calculate the standard deviation.

sum could be implemented this way, but it would be very memory inefficient, so it's important to give functions the ability at least to avoid creating an array of all values.

(I also wonder if there's efficency trickery we could do to accumulate values in a typed array?)

edit: standard deviation is another example of an aggregation that can be completed without accumulating all values

@lovasoa
Copy link
Member

lovasoa commented Sep 6, 2022

Having to return values from step makes it more awkward in the very common "accumulate values" case. Here's json_agg with returning values (and a default state of null):

db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state = (state || []); state.push(val); return state; },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

No, that's ugly, you wouldn't write it like that ! 🙂 This is a very common pattern in javascript, in many state-management libraries. You would write it as (state, val) => [...state, val]

And with modifying state (and with state defaulting to [] instead of null):

db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state.push(val); },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

I think we might even want to make step default to pushing the values into an array? That way many functions could be written as just a finalizer - you get the list of values and pull the median, or serialize them, or calculate the standard deviation.

sum could be implemented this way, but it would be very memory inefficient, so it's important to give functions the ability at least to avoid creating an array of all values.

(I also wonder if there's efficency trickery we could do to accumulate values in a typed array?)

edit: standard deviation is another example of an aggregation that can be completed without accumulating all values

Let's just stick to what developers are used to. Reduce (sometimes called fold) is a very common pattern, and users will appreciate it if you don't try to reinvent the wheel here. undefined as the default initial value and no default reducer.

@lovasoa
Copy link
Member

lovasoa commented Sep 6, 2022

Can we also add a test with

  • an init function that throws
  • a step function that throws
  • a finalize function that throws
  • many redefinitions of the same function in a loop

@llimllib
Copy link
Contributor Author

llimllib commented Sep 7, 2022

init is a function in C because you have to allocate memory, but in almost every other reduce interface, it's provided as a value, so I changed the signature of the function to be create_aggregate(name, initial_value, aggregateFunctions) and eliminated the init function from the interface entirely.

@llimllib
Copy link
Contributor Author

llimllib commented Sep 7, 2022

You can see from the changes in ac548d4 how much that simplifies the step functions

@llimllib
Copy link
Contributor Author

llimllib commented Sep 7, 2022

OK, I think this is ready for re-review now. Thanks for working through this with me!

src/api.js Outdated Show resolved Hide resolved
src/api.js Outdated Show resolved Hide resolved
test/test_aggregate_functions.js Outdated Show resolved Hide resolved
Copy link
Member

@lovasoa lovasoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me ! Can you just fix the documentation, I'll read it one last time, and we'll merge :)

README.md Outdated Show resolved Hide resolved
llimllib and others added 2 commits September 7, 2022 12:39
Co-authored-by: Ophir LOJKINE <contact@ophir.dev>
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@lovasoa lovasoa merged commit 5e5b063 into sql-js:master Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for creating SQLite aggregate functions
2 participants