Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression eval should probably return a lazy iterator over engine data #389

Open
nicklan opened this issue Oct 10, 2024 · 0 comments
Open

Comments

@nicklan
Copy link
Collaborator

nicklan commented Oct 10, 2024

Much as we allow the reading of a single parquet file to produce multiple EngineDatas in an iterator, we should also allow expression eval on a single EngineData to result in a lazy iterator over multiple EngineDatas.

This requires an API change from:

    fn evaluate(&self, batch: &dyn EngineData) -> DeltaResult<Box<dyn EngineData>>;

to:

    fn evaluate(&self, batch: &dyn EngineData) -> Box<dyn Iterator<Item = DeltaResult<Box<dyn EngineData>>>>

That's almost the same type as returned by read_[json/parquet]_files, which return Box<dyn Iterator<Item = DeltaResult<Box<dyn EngineData>>> + Send> and is aliased to FileDataReadResultIterator. So we should also factor out the Send requirement and give the type a better name.

The reason is, there are cases where expressions can expand input data significantly. This could cause OOMs, mess up block sizing, etc. As a concrete example, consider a hypothetical future table feature that supports non-materialized generated columns and/or default values. Given enough such columns, a single block read from parquet could easily expand by several factors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant