Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(all): enable passing in-memory data to create_table #9251

Merged
merged 24 commits into from
May 29, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented May 24, 2024

This PR adds/codifies support for passing in-memory data to create_table.

The default behavior for most backends is to first create a memtable with
whatever obj is passed to create_table, then we create a table based on that
memtable -- because of this, semantics around temp tables and
catalog.database locations are handled correctly.

After the new table (that the user has provided a name for) is created, we
drop the intermediate memtable so we don't add two tables for every in-memory
object passed to create_table.

Currently most backends fail when passed RecordBatchReaders, or a single
RecordBatch, or a pyarrow.Dataset -- if we add support for these to
memtable, all of those backends would start working, so I've marked those
xfails as notimpl for now.

A few backends don't work this way:

polars reads in the table directly using their fast-path local-memory reading stuff.

datafusion uses a fast-path read, then creates a table from the table that is
created by the fast-path -- this is because the datafusion dataframe API has
no way to specify things like overwrite, or table location, but the CTAS from
already present tables is very quick (and possibly zero-copy?) so no issue
there.

duckdb has a refactored read_in_memory (which we should deprecate), but it
isn't entirely hooked up inside of create_table yet, so some paths may go via
memtable creation, but memtable creation on DuckDB is especially fast, so
I'm all for fixing this up eventually.

pyspark works with the intermediate memtable -- there are possibly
fast-paths available, but they aren't currently implemented.

pandas and dask have a custom _convert_object path

TODO:

  • [ ] Flink Flink can't create tables from in-memory data?
  • Impala
  • BigQuery
  • Remove read_in_memory from datafusion and polars

Resolves #6593
xref #8863

Signed-off-by: Gil Forsyth gil@forsyth.dev

  • refactor(duckdb): add polars df as option, move test to backend suite
  • feat(polars): enable passing in-memory data to create_table
  • feat(datafusion): enable passing in-memory data to create_table
  • feat(datafusion): use info_schema for list_tables
  • feat(duckdb): enable passing in-memory data to create_table
  • feat(postgres): allow passing in-memory data to create_table
  • feat(trino): allow passing in-memory date to create_table
  • feat(mysql): allow passing in-memory data to create_table
  • feat(mssql): allow passing in-memory data to create_table
  • feat(exasol): allow passing in-memory data to create_table
  • feat(risingwave): allow passing in-memory data to create_table
  • feat(sqlite): allow passing in-memory data to create_table
  • feat(clickhouse): enable passing in-memory data to create_table
  • feat(oracle): enable passing in-memory data to create_table
  • feat(snowflake): allow passing in-memory data to create_table
  • feat(pyspark): enable passing in-memory data to create_table
  • feat(pandas,dask): allow passing in-memory data to create_table

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitiest of nits. LGTM overall!

ibis/backends/datafusion/__init__.py Outdated Show resolved Hide resolved
ibis/backends/datafusion/__init__.py Outdated Show resolved Hide resolved
@cpcloud cpcloud added this to the 9.1 milestone May 25, 2024


@lazy_singledispatch
def _read_in_memory(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this could be ibis.memtable()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and that would unify the implementations across the backends, too. I'll open an follow-up to make use of the lazy single-dispatching for memtable insertion for the in-process backends.

@gforsyth
Copy link
Member Author

Question: would folks rather I add polars as an extra to many CI jobs to test polars inputs to create_table or put an importorskip around it?

gforsyth and others added 22 commits May 28, 2024 11:21
supports pandas, polars, and pyarrow tablelikes
This means you can actually select a database with `list_tables`
I don't know that we can unregister `_clean_up_tmp_table` for specific
tables, so Exasol might throw some atexit errors (which are ignored) at
shutdown, because it's attempting to drop tables that have already been
dropped (also not sure why Exasol complains about this with
`force=True`).

Still, I think it's better to not pollute the table-space with copies of
memtables for every table we create.
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
This branch initially started with my adding `read_in_memory` everywhere
before we settled on making this functionality part of `create_table`
instead.  This hasn't landed in a release, so I'm removing it.
@gforsyth gforsyth force-pushed the ibis-create-table-in-memory branch from a5d34a3 to 1bfbdb8 Compare May 28, 2024 15:25
@gforsyth gforsyth marked this pull request as ready for review May 28, 2024 15:26
@gforsyth gforsyth force-pushed the ibis-create-table-in-memory branch from 1bfbdb8 to 4e45c10 Compare May 28, 2024 16:08
@cpcloud
Copy link
Member

cpcloud commented May 28, 2024

@gforsyth Can we add polars to one or two backends instead of all of them?

And we also have it installed already in the postgres torch build and
DuckDB.
@gforsyth gforsyth force-pushed the ibis-create-table-in-memory branch from 1c5abe5 to 9a3a698 Compare May 28, 2024 22:00
@gforsyth
Copy link
Member Author

@gforsyth Can we add polars to one or two backends instead of all of them?

Yep -- added it explicitly to MySQL, MSSQL, and Oracle. And we already have it available on the DuckDB jobs, and the Postgres torch job (and obviously on the polars jobs) -- seems like reasonably good coverage?

@gforsyth gforsyth merged commit fa15c7d into ibis-project:main May 29, 2024
74 checks passed
@gforsyth gforsyth deleted the ibis-create-table-in-memory branch May 29, 2024 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: allow registering all in-memory table types via create_table
3 participants