Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: check builds for the-epic-split after rebase #8178

Closed
wants to merge 157 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Feb 1, 2024

  • feat(duckdb-geospatial): enable use of literals
  • test(duckdb-geospatial): add test for literals support
  • chore: fix commit hash for git-blame-ignore-rev
  • feat(flink): add new temporal operators
  • feat(flink): implement struct field, clean up literal, and adjust timecontext test markers (feat(flink): implement struct field, clean up literal, and adjust timecontext test markers #7997)
  • ci: wire up new cache via setup-python
  • ci: use new ci-doctest to run doctests in the poetry environment
  • ci: properly install sqlalchemy2 deps; avoid relocking as a dependency
  • test(duckdb-geospatial): fix incomplete skipping
  • fix(api): deferred values are not truthy
  • docs(meta): add goatcounter to header of all quarto pages
  • docs(blog): redux array blog with equivalent duckdb and bq expressions
  • fix(mutate/select): ensure that unsplatted dictionaries work in mutateandselect APIs (fix(mutate/select): ensure that unsplatted dictionaries work in mutateandselect APIs #8014)
  • fix(docs): surround executable code blocks with interactive mode on/off
  • chore(flake/nixpkgs): cc3ab0e4 -> 0799f514
  • chore(flake/nixpkgs): 0799f514 -> b4ee3c3c
  • test(flink): deep dive on the tests marked for Flink in test_json.py (test(flink): deep dive on the tests marked for Flink in test_json.py #7908)
  • fix(deps): update dependency pyarrow to v15
  • chore: bump requirements-dev.txt
  • fix(duckdb): ensure that casting to floating point values produces valid types in generated sql
  • fix(datatypes): ensure that array construction supports literals and infers their shape from its inputs (fix(datatypes): ensure that array construction supports literals and infers their shape from its inputs #8049)
  • feat(examples): add zones geojson example (feat(examples): add zones geojson example #8040)
  • docs: fix rolling date on bigquery/duckdb array blog (docs: fix rolling date on bigquery/duckdb array blog #8059)
  • docs: blog for the 1 billion row challenge (docs: blog for the 1 billion row challenge #8004)
  • docs: add quotes around install in 1brc post (docs: add quotes around install in 1brc post #8065)
  • chore(deps): update bitnami/minio docker tag to v2024.1.18
  • docs(pandas): fix format for kwarg warning callout
  • docs(blog): show how to install geospatial dependencies
  • chore(flake/nixpkgs): b4ee3c3c -> 15ff1758
  • docs: ibis-analytics blog post (docs: ibis-analytics blog post #7990)
  • chore(deps): update flink docker tag to v1.18.1 (chore(deps): update flink docker tag to v1.18.1 #8084)
  • docs(blog): update geospatial - no need to_array()
  • fix(pandas): support non-string categorical columns
  • ci(docker): yes -> "yes" to avoid yaml parsing weirdness
  • fix(examples): use anonymous access when reading example data from GCS
  • docs: document possible range of seed values to Table.sample
  • chore(deps): update peter-evans/create-issue-from-file action to v5
  • chore(deps): update peter-evans/create-or-update-comment action to v4
  • refactor(flink): expose raw_sql over _exec_sql
  • chore(deps): update peter-evans/find-comment action to v3
  • docs(style): add style guide to contributing (docs(style): add style guide to contributing #8092)
  • chore(deps): update trinodb/trino docker tag to v437
  • chore(flake/nixpkgs): 15ff1758 -> 7ac72b3e
  • chore: escape "." in re_split() docstring
  • fix(deps): update dependency black to v24
  • chore(deps): relock
  • fix(polars): avoid using unnecessary subquery for schema inference
  • build(docker): clean up minio setup
  • build(docker): remove extraneous druid docker volumes
  • chore(deps): update dependency pytest to v8
  • test(postgres): remove usefixtures mark from fixture
  • test(tpch): remove _fixtureinfo.argnames hack
  • test: fix and enable hypothesis profiles
  • ci: run setuptools check tests in parallel
  • ci: cache based on requirements-dev.txt for setuptools ci check
  • chore(deps): remove undefined exasol extra from sqlalchemy-exasol dependency
  • feat(flink): export result to pyarrow
  • chore: handle the no-batches case in to_pyarrow()
  • feat(risingwave): init impl for Risingwave (feat(risingwave): init impl for Risingwave #7954)
  • build(docker): simplify risingwave docker setup (build(docker): simplify risingwave docker setup #8126)
  • docs(random): document behavior of repeated use of ibis.random() instance
  • test(doctest): add range builtin to doctest namespace
  • feat(mssql): add hashbytes and test for binary output hash fns (feat(mssql): add hashbytes and test for binary output hash fns #8107)
  • chore: disable renovate dependency dashboard (chore: disable renovate dependency dashboard #8139)
  • test(risingwave): add notimpl markers for new hexdigest tests (test(risingwave): add notimpl markers for new hexdigest tests #8140)
  • fix(sqlglot): stop using removed singletons for true, false, null
  • chore(flake/nixpkgs): 7ac72b3e -> d3c09ae0
  • chore(deps): update clickhouse/clickhouse-server docker tag to v24
  • build(deps): bump aiohttp from 3.9.1 to 3.9.2
  • chore(deps): relock
  • chore(deps): update bitnami/minio docker tag to v2024.1.29
  • revert: chore(deps): update clickhouse/clickhouse-server docker tag to v24
  • docs: kedro blog post link (docs: kedro blog post link #8150)
  • fix(repr): force exception message to console in IPython in interactive mode
  • docs(geospatial): add examples for duckdb supported methods (docs(geospatial): add examples for duckdb supported methods #8128)
  • feat(flink): implement array operators (feat(flink): implement array operators #7951)
  • test(ir): reorganize ibis/tests/expr to enable running the tests without a functional backend
  • test(sql): move sql tests requiring a functional backend from ibis/tests/sql to ibis/backends/tests/sql
  • test(backends): move backends dependent benchmarks to ibis/backends/tests/
  • test(ir): ensure that no backends are required to run the core tests
  • chore(ci): skip running backend tests on the-epic-split branch
  • chore(ci): change the core testing command since the core marker is completely broken without the backend tests
  • chore(ci): temporarily disable test_doctests job
  • chore(ci): add todo note about restoring the previous ci-check command
  • test(ir): ensure that no backends are required to run the core tests
  • refactor(ir): split the relational operations
  • refactor(ir): wrap JoinChain.first in ops.SelfReference similar to the rest of the join tables
  • test(ir): cover constructing reductions in the core test suite
  • refactor(ir): add JoinTable operation unique to JoinChain instead of using the globally unique SelfReference
  • fix(decompile): ensure that SelfReference is decompiled with a call to .view()
  • refactor(ir): support join of joins while avoiding nesting
  • feat(sql): lower expressions to SQL-like relational operations
  • refactor(duckdb): initial cut of sqlglot DuckDB compiler
  • refactor(duckdb/clickhouse): implement sqlglot backends and re-enable ci
  • feat(datafusion): port to new sqlglot backend
  • refactor(compilers): conslidate StringJoin impl
  • feat(common): add Dispatched base class for convenient visitor pattern implementation
  • refactor(duckdb): remove the need for a specialized _to_geodataframe method
  • fix(duckdb): ensure that create_schema and create_database are actually tested
  • refactor(ir): stricter scalar subquery integrity checks
  • feat(common): add a memory efficient Node.map() implementation
  • fix(common): intermediate result removal fails if there are duplicated dependencies
  • refactor(api): revamp asof join predicates
  • fix(ir): self reference fields were incorrectly dereferenced to the parent relation
  • fix(rewrites): add missing filter arguments for node.replace() calls
  • refactor(snowflake): use sqlglot for the snowflake backend
  • refactor(common): support union types as well as forward references in the dispatch utilities
  • ci(snowflake): enable for the-epic-split branch
  • fir(ir): asof join tolerance parameter should post-filter and post-join instead of adding a predicate
  • feat(duckdb): support asof joins including tolerance parameter
  • ci: remove merge_group (ci: remove merge_group #7899)
  • refactor(pandas): port the pandas backend with an improved execution model (refactor(pandas): port the pandas backend with an improved execution model #7797)
  • chore(deps): relock
  • ci: comment out sys-deps step
  • chore: remove duplicate distinct decompile rule
  • ci: install decompiler as extra not as its own dependency (ci: install decompiler as extra not as its own dependency #7901)
  • fix(conversion): convert decimals to the exact precision and scale requested by the input type
  • test(snowflake): fix expected decimal results
  • test(duckdb): relax exact type check in decimal literal assertion
  • refactor(sqlglot): various sqlglot compiler and backend clean ups (refactor(sqlglot): various sqlglot compiler and backend clean ups #7904)
  • refactor(polars): update the polars backend to use the new relational abstractions (refactor(polars): update the polars backend to use the new relational abstractions #7868)
  • feat(trino): port to sqlglot
  • fix(polars): force null sorting to match the rest of ibis
  • test(pandas): ignore array size warning
  • refactor(postgres): port to sqlglot (refactor(postgres): port to sqlglot #7877)
  • refactor(mysql): port to sqlglot (refactor(mysql): port to sqlglot #7926)
  • refactor(sqlglot): remove duplicated simple compilation rules and sort
  • chore(deps): bump sqlglot and regen sql
  • fix(duckdb): add flip_coordinates translation to sqlglot duckdb backend
  • fix(snowflake): use _safe_raw_sql for insert implementation
  • fix(mysql): remove not-allowed frame clause from rank window function
  • test(postgres): use DBAPI instead of sqlalchemy apis in timezone test
  • test(postgres): remove test that no longer works
  • test(pandas): use the correct error type when xfailing for compound-sort-key rank
  • refactor(ir): give unbound tables namespaces
  • chore(duckdb/mysql): remove dead code and comment
  • refactor(sqlglot): remove duplicate StringAscii definitions
  • chore(sqlglot): deduplicate pad functions
  • refactor(sqlglot): make anonymous functions easier to use and remove array_func hack
  • test(backends): make null results try_cast test agnostic to nan vs None
  • refactor(sqlglot): use a more backend-agnostic expression for non-finite constants
  • refactor(sqlglot): clean up explode usage
  • refactor(pyspark): reimplement the backend using the new relational operations an spark SQL
  • feat(pyspark): add support for PySpark 3.5
  • refactor(pyspark): remove sqlalchemy dependency from pyspark
  • chore(deps): bump pyspark to 3.5 in poetry lock file
  • fix(ir): only dereference comparisons not generic binary operations
  • chore: rename to dereference_comparison
  • chore(deps): relock
  • test(generic): clean up try_cast to null test
  • fix(polars): user newer drop API to avoid deprecation warning
  • fix(polars): user newer drop API in asof join implementation
  • refactor(druid): port to sqlglot
  • refactor(impala): port to sqlglot
  • refactor(pandas): simplify pandas helpers
  • chore(impala): remove unused imports
  • refactor(bigquery): port to sqlglot
  • chore: add docstring for null ordering transform
  • chore(bigquery-datatypes): fix type annotations and raise uniform error types for datatype conversion
  • ci(bigquery): install geospatial extra
  • test: remove unused spread_type function
  • test: account for new error type
  • test(bigquery): skip geospatial execution test when geopandas not installed
  • fix(snowflake): handle udf function naming
  • refactor(exasol): port to sqlglot (refactor(exasol): port to sqlglot #8032)
  • ci(exasol): run ci serially (ci(exasol): run ci serially #8042)
  • feat(sql): extract common table expressions
  • chore(sql): regenerate snapshots for clickhouse, duckdb and postgres
  • chore(sql): regenerate snapshots for snowflake
  • chore(sql): regenerate snapshots for bigquery
  • chore(sql): regenerate snapshots for pyspark
  • chore(sql): regenerate snapshots for trino
  • chore(impala): regen snapshots
  • fix(trino): compile property literal values directly instead of going throughh the pipeline
  • ci(impala): run tests in series
  • chore(exasol): avoid complex websocket callback for inserting memtables
  • test(exasol): account for unordered results in window function tests
  • test(duckdb): test that column name case is preserved when inserting
  • refactor(oracle): port to sqlglot (refactor(oracle): port to sqlglot #8020)
  • chore(deps): remove sqlalchemy dependencies from oracle extra
  • refactor(mssql): port to sqlglot
  • fix(sql): don't generate table aliases for ops.JoinLink
  • chore(impala): regen snapshots
  • test(markers): add tests for custom markers
  • fix(duckdb): allow passing both overwrite and temp to create_table
  • refactor(polars): allow passing temp=False to polars create_table
  • refactor(exasol): add temp kwarg to create_table for api consistency
  • test(backends): add test for overwrite and temp intersection in create_table
  • fix(oracle): enable dropping temporary tables
  • fix(oracle): clean up memtables at exit
  • fix(oracle): allow passing both overwrite and temp to create_table
  • refactor(oracle): simplify oracle timestamp overrides
  • fix(api): forbid using asc/desc in selections
  • fix(api): support passing literal booleans to filter
  • test(api): add union aliasing test
  • fix(polars): reference the correct field in the ops.SelfReference rule
  • test(polars): enable xpassing test
  • test(duckdb): move tests to specific backend test suites
  • test(duckdb): run test in subprocess to avoid setting the default backend
  • refactor(sqlite): port to SQLGlot (refactor(sqlite): port to SQLGlot #8154)
  • refactor(sql): remove temporary table creation when using inline sql (refactor(sql): remove temporary table creation when using inline sql #8149)
  • refactor(sql): reorganize sqlglot rewrites

@kszucs kszucs changed the base branch from the-epic-split to main February 1, 2024 00:30
@cpcloud
Copy link
Member

cpcloud commented Feb 1, 2024

The failing nix builds will not pass until the RisingWave sqlglot port (#8171) is merged.

@cpcloud
Copy link
Member

cpcloud commented Feb 1, 2024

This should be good to go to force push to the-epic-split.

@kszucs kszucs marked this pull request as ready for review February 2, 2024 09:29
kszucs and others added 22 commits February 2, 2024 11:26
Rationale and history
---------------------
In the last couple of years we have been constantly refactoring the
internals to make it easier to work with. Although we have made great
progress, the current codebase is still hard to maintain and extend.
One example of that complexity is the try to remove the `Projector`
class in ibis-project#7430. I had to realize that we are unable to improve the
internals in smaller incremental steps, we need to make a big leap
forward to make the codebase maintainable in the long run.

One of the hotspots of problems is the `analysis.py` module which tries
to bridge the gap between the user-facing API and the internal
representation. Part of its complexity is caused by loose integrity
checks in the internal representation, allowing various ways to
represent the same operation. This makes it hard to inspect, reason
about and optimize the relational operations. In addition to that, it
makes much harder to implement the backends since more branching is
required to cover all the variations.

We have always been aware of these problems, and actually we had several
attempts to solve them the same way this PR does. However, we never
managed to actually split the relational operations, we always hit
roadblocks to maintain compatibility with the current test suite.
Actually we were unable to even understand those issues because of the
complexity of the codebase and number of indirections between the API,
analysis functions and the internal representation.

But(!) finally we managed to prototype a new IR in ibis-project#7580 along with
implementations for the majority of the backends, including `various SQL
backends` and `pandas`. After successfully validating the viability of
the new IR, we split the PR into smaller pieces which can be
individually reviewed. This PR is the first step of that process, it
introduces the new IR and the new API. The next steps will be to
implement the remaining backends on top of the new IR.

Changes in this commit
----------------------
- Split the `ops.Selection` and `ops.Aggregration` nodes into proper
  relational algebra operations.
- Almost entirely remove `analysis.py` with the technical debt
  accumulated over the years.
- More flexible window frame binding: if an unbound analytical function
  is used with a window containing references to a relation then
  `.over()` is now able to bind the window frame to the relation.
- Introduce a new API-level technique to dereference columns to the
  target relation(s).
- Revamp the subquery handling to be more robust and to support more
  use cases with strict validation, now we have `ScalarSubquery`,
  `ExistsSubquery`, and `InSubquery` nodes which can only be used in
  the appropriate context.
- Use way stricter integrity checks for all the relational operations,
  most of the time enforcing that all the value inputs of the node must
  originate from the parent relation the node depends on.
- Introduce a new `JoinChain` operations to represent multiple joins in
  a single operation followed by a projection attached to the same
  relation. This enabled to solve several outstanding issues with the
  join handling (including the notorious chain join issue).
- Use straightforward rewrite rules collected in `rewrites.py` to
  reinterpret user input so that the new operations can be constructed,
  even with the strict integrity checks.
- Provide a set of simplification rules to reorder and squash the
  relational operations into a more compact form.
- Use mappings to represent projections, eliminating the need of
  internally storing `ops.Alias` nodes. In addition to that table nodes
  in projections are not allowed anymore, the columns are expanded to
  the same mapping making the semantics clear.
- Uniform handling of the various kinds of inputs for all the API
  methods using a generic `bind()` function.

Advantages of the new IR
------------------------
- The operations are much simpler with clear semantics.
- The operations are easier to reason about and to optimize.
- The backends can easily lower the internal representation to a
  backend-specific form before compilation/execution, so the lowered
  form can be easily inspected, debugged, and optimized.
- The API is much closer to the users' mental model, thanks to the
  dereferencing technique.
- The backend implementation can be greatly simplified due to the
  simpler internal representation and strict integrity checks. As an
  example the pandas backend can be slimmed down by 4k lines of code
  while being more robust and easier to maintain.

Disadvantages of the new IR
---------------------------
- The backends must be rewritten to support the new internal
  representation.
… of using the globally unique `SelfReference`

This enables us to maintain join expression equality:
`a.join(b).equals(a.join(b))`

So far we have been using SelfReference to make join tables unique, but
it was globally unique which broke the equality check above. Therefore
we need to restrict the uniqueness to the scope of the join chain. The
simplest solution for that is to simply enumerate the join tables in
the join chain, hence now all join participants must be
`ops.JoinTable(rel, index)` instances.

`ops.SelfReference` is still required to distinguish between two
identical tables at the API level, but it is now decoupled from the
join internal representation.
it's alive!

tests run (and fail)

chore(duckdb): naive port of clickhouse compiler

fix(duckdb): hacky fix for output shape

feat(duckdb): bitwise ops (most of them)

feat(duckdb): handle pandas dtype mapping in execute

feat(duckdb): handle decimal types

feat(duckdb): add euler's number

test(duckdb): remove duckdb from alchemycon

feat(duckdb): get _most_ of string ops working

still some failures in re_exract

feat(duckdb): add hash

feat(duckdb): add CAST

feat(duckdb): add cot and strright

chore(duckdb): mark all the targets that still need attention (at least)

feat(duckdb): combine binary bitwise ops

chore(datestuff): some datetime ops

feat(duckdb): add levenshtein, use op.dtype instead of output_dtype

feat(duckdb): add blank list_schemas, use old current_database for now

feat(duckdb): basic interval ops

feat(duckdb): timestamp and temporal ops

feat(duckdb): use pyarrow for fetching execute results

feat(duckdb): handle interval casts, broken for columns

feat(duckdb): shove literal handling up top

feat(duckdb): more timestamp ops

feat(duckdb): back to pandas output in execute

feat(duckdb): timezone handling in cast

feat(duckdb): ms and us epoch timestamp support

chore(duckdb): misc cleanup

feat(duckdb): initial create table

feat(duckdb): add _from_url

feat(duckdb): add read_parquet

feat(duckdb): add persistent cache

fix(duckdb): actually insert data if present in create_table

feat(duckdb): use duckdb API read_parquet

feat(duckdb): add read_csv

This, frustratingly, cannot use the Python API for `read_csv` since that
does not support list of files, for some reason.

fix(duckdb): dont fully qualify the table names

chore(duckdb): cleanup

chore(duckdb): mark broken test broken

fix(duckdb): fix read_parquet so it works

feat(duckdb): add to_pyarrow, to_pyarrow_batches, sql()

feat(duckdb): null checking

feat(duckdb): translate uints

fix(duckdb): fix file outputs and torch output

fix(duckdb): add rest of integer types

fix(duckdb): ops.InValues

feat(duckdb): use sqlglot expressions (maybe a big mistake)

fix(duckdb): don't stringify strings

feat(duckdb): use sqlglot expr instead of strings for count

fix(duckdb): fix isin

fix(duckdb): fix some agg variance functions

fix(duckdb): for logical equals, use sqlglot not operator

fix(duckdb): struct not tuple for struct type
kszucs and others added 25 commits February 2, 2024 11:27
Co-authored-by: Kexiang Wang <kx.wang@hotmail.com>
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>
…s-project#8005)

Reimplementation of the dask backend on top of the new pandas executor.
I had to adjust the pandas backend to support extending. This way the
new dask implementation turned out to be pretty tidy.

There are a couple of features which are not implemented using proper
dask constructs, but rather have a fallback to local execution using
pandas. The most notable are the window functions. The previous dask
implementation supported just a couple of window cases, but this way we
have full coverage at least.

Thanks to the new pandas base we have a wider feature coverage, see the
removed xfails in the test suite.
@kszucs
Copy link
Member Author

kszucs commented Feb 2, 2024

Pushed to the-epic-split branch, closing.

@kszucs kszucs closed this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants