-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(all): enable passing in-memory data to create_table #9251
feat(all): enable passing in-memory data to create_table #9251
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitiest of nits. LGTM overall!
|
||
|
||
@lazy_singledispatch | ||
def _read_in_memory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this could be ibis.memtable()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and that would unify the implementations across the backends, too. I'll open an follow-up to make use of the lazy single-dispatching for memtable insertion for the in-process backends.
Question: would folks rather I add |
Signed-off-by: Gil Forsyth <gil@forsyth.dev>
supports pandas, polars, and pyarrow tablelikes
This means you can actually select a database with `list_tables`
I don't know that we can unregister `_clean_up_tmp_table` for specific tables, so Exasol might throw some atexit errors (which are ignored) at shutdown, because it's attempting to drop tables that have already been dropped (also not sure why Exasol complains about this with `force=True`). Still, I think it's better to not pollute the table-space with copies of memtables for every table we create.
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
This branch initially started with my adding `read_in_memory` everywhere before we settled on making this functionality part of `create_table` instead. This hasn't landed in a release, so I'm removing it.
a5d34a3
to
1bfbdb8
Compare
1bfbdb8
to
4e45c10
Compare
@gforsyth Can we add polars to one or two backends instead of all of them? |
And we also have it installed already in the postgres torch build and DuckDB.
1c5abe5
to
9a3a698
Compare
Yep -- added it explicitly to MySQL, MSSQL, and Oracle. And we already have it available on the DuckDB jobs, and the Postgres torch job (and obviously on the polars jobs) -- seems like reasonably good coverage? |
This PR adds/codifies support for passing in-memory data to
create_table
.The default behavior for most backends is to first create a
memtable
withwhatever
obj
is passed tocreate_table
, then we create a table based on thatmemtable
-- because of this, semantics aroundtemp
tables andcatalog.database
locations are handled correctly.After the new table (that the user has provided a name for) is created, we
drop the intermediate
memtable
so we don't add two tables for every in-memoryobject passed to
create_table
.Currently most backends fail when passed
RecordBatchReaders
, or a singleRecordBatch
, or apyarrow.Dataset
-- if we add support for these tomemtable
, all of those backends would start working, so I've marked thosexfails as
notimpl
for now.A few backends don't work this way:
polars
reads in the table directly using their fast-path local-memory reading stuff.datafusion
uses a fast-path read, then creates a table from the table that iscreated by the fast-path -- this is because the
datafusion
dataframe API hasno way to specify things like
overwrite
, or table location, but the CTAS fromalready present tables is very quick (and possibly zero-copy?) so no issue
there.
duckdb
has a refactoredread_in_memory
(which we should deprecate), but itisn't entirely hooked up inside of
create_table
yet, so some paths may go viamemtable
creation, butmemtable
creation on DuckDB is especially fast, soI'm all for fixing this up eventually.
pyspark
works with the intermediatememtable
-- there are possiblyfast-paths available, but they aren't currently implemented.
pandas
anddask
have a custom_convert_object
pathTODO:
[ ] FlinkFlink can't create tables from in-memory data?read_in_memory
from datafusion and polarsResolves #6593
xref #8863
Signed-off-by: Gil Forsyth gil@forsyth.dev