-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Create dfdb
, a new CLI different than datafusion-cli
with pre-built integrations
#11979
Comments
I had a related problems a couple of weeks ago, I wanted to distribute a CLI that had some additional udfs bundled with it. I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb |
That would be awesome Another thing that comes to mind might be for you to fork this datafusion-cli code and bundle your UDFs with it. Or we could publish most of the code for this CLI to crates.io with the idea that people could make their own CLI mashups with a little bit of configuration setup. The ideas are endless! |
Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted. I am currently building a new terminal app with DataFusion at the core, but it is more domain specific than general purpose like I think the new proposed tool is expected to be (and if i wasnt working on that, I would probably volunteer to push this forward). That being said, and depending on the direction of this new tool, there could be some potential overlap / common functionality with what im working on. If thats the case then I would be very happy to collaborate on those pieces. I hope to open source what I'm working on in the fall around the time of v0.1 release. I'll keep my eye on this as it progresses though. |
Making generic plugin system would be great, but may be a bit of a challenge with lack of stable rust ABI. Making some kind of |
I think this is a great idea. |
I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc). In other words, it could be something like squirrel / dbeaver for analytics. |
I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have. |
Will do some homework. Maybe we only need a datafusion-cli repo that one can fork stand-alone to create a proprietary CLI distribution and the plugin system is effectively an overkill How should we go about dfdb? Do we think we should have "flavors", maybe like the following?
Or should we aim at having a single "flavor" only? |
I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more. |
This sounds like "Ibis" for CLIs and I think is a neat idea However, I am personally not likely to work on such a thing -- I am far more interested in showing off / using the DataFusion execution engine more broadly, in part as I think it will drive more use and thus more contributions back to DataFusion So therefore I am not opposed to creating some sort of pluggable backend query engine system, but I think I would like to focus on the DataFusion one. I am thinking what I will try to do if no one beats me to it is to sketch out what a
|
Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client. That sounds interesting, but I agree with @alamb that that's a different question. I like the broad idea here, while I like Where I disagree (I think) with @alamb is around what it lacks — my main problem with
There's one more thing that I've wanted from a database CLI many times and nothing (AFAIK) supports:
Anyway, that's my wish list, as you might guess I've thought about this a fair bit! One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it) and make it pluggable into any query engine that speaks Arrow Flight SQL, whether that be over a network, or just as an API within a single binary or between dynamically linked libaries. |
this seems to be the goal of the https://github.com/xo/usql project. |
indeed, if the goal is to make a However, as my personal goal is not (at least not yet 😆 ) to make a |
Another possibility might be to leverage ADBC (which has a flight SQL driver). DuckDB implements the ADBC interface, so between that and the other available drivers for ADBC, it might be more beneficial than just using flight SQL? |
Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts. |
I agree -- @matthewmturner what do you think about rekindling the 🔥 around I am going to be on vacation next week so may have some time to play around with this idea -- perhaps I can make some PRs to that repo? |
I would definitely be happy to have work pick back up there. I do believe that it's original goal is generally aligned with the above comments / wish list. Based on my current work on the other TUI I am working on, which has similar functionality, I think theres some cleanup that will need to be done (updating dependencies and perhaps refactoring the event loop / handler ). I would be happy to drive the efforts on that to get everything back up to date (as well as cleanup the issue backlog). Separately, I have some code in my current project that helps converting record batches to a One other note, I had initially made @alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store). I'm currently on vacation but would be able to start on this next week if this is ultimately the direction that we decide to go. |
Sounds good - I'll try and update the dependencies as a way to get familiar with the project. This is going to be great I spent some time poking around with it and it has some quite cool UI -- I think what I am most passionate about is setting up the integrations with iceberg-rs, s3, delta-rs, etc. I am not very good at UIs / interactions As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me. |
Yes, it can already do that. I am definitely also passionate about the integrations point (i had some ideas for that in mind from the beginning such as delta, flightsql, and excel. |
I'd like to highlight the idea of having table providers in |
I agree this sounds like the ideal setup to me -- having them in a separate repo would help keep the boundaries clear as well |
@alamb When did you plan on starting to work on this? On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done. I also may try to pause my other work for a few weeks to focus on dft to see if it can get in a decent shape for the upcoming NYC meetup. |
This is the branch with the rewrite. |
I keep telling myself "tomorrow" but then I end up getting carried away reviewing all the other good stuff going on (eg.. #12044 #12095 etc 🤣 ) THis week I am on vacation, so I have a bit more time for some fun projects (at least that is what I am telling myself)
Nice! I'll go check it out now |
@alamb i still need to add back the query history / context info and i am improving the ergonomics of navigating query results now. |
For anyone interested, i have been on a bit of a fast and furious dev sprint making updates to datafusion-tui. ive basically done a clean rewrite to modernize it and ive gotten the following features to work
I plan on adding one more feature (new tab for storing query execution results / stats) and then will pause feature development to focus on cleanup / usability improvements / testing / docs / etc. If anyone wants to help test it would be appreciated else i just keep plugging away :) if you want all those features you can run |
Somewhat related to a great CLI experience, "leading FROM" duckdb syntax would make any autocomplete much more useful — apache/datafusion-sqlparser-rs#1400. |
@samuelcolvin autocomplete and syntax highlighting are features i would like to support - but i dont expect to get around to them in the short term |
I think |
I think dft is pretty close, so I am claiming this is done |
TLDR
datafusion-cli
in the apache/datafusion repodfdb
(ordatafusion-cli++
ordfcli
) which is purposely designed for running queries against a wide variety of pre-integrated sourcesProblem Statement
As of today,
datafusion-cli
(docs) serves two roles:duckdb
CLI toolIt is really sweet to have a CLI that lets you query a directory of parquet files
However, similarly to the discussion with have had with
datafusion-pytyhon
this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experienceExamples of Friction
I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example
I realize I have been partly responsible for this mess and for that I apologize.
Proposal
I propose resolving this conflict by creating a new repository for the "CLI tool people actually use"
We would keep
datafusion-cli
as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)We would add many new features / capabilitues to this
dfdb
toolExamples of new features
There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):
@philippemnoel actually referrs to the lack of built in Apache Iceberg support in his blog about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use
Other cool features
There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:
CREATE EXTERNAL TABLE
definitions in a file someere (.open <filename>
style)Open questions
Should the new tool be in the
datafusion-contrib
organization or theapache
organization?The tradeoffs are that
datafusion-contrib
could move faster / has less governance overhead, but would also lose the apache communityI personally suggest we start with this tool in the
datafusion-contrib
organization and if there is interest we can discuss bringing it back to the apache organization.The text was updated successfully, but these errors were encountered: