-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792
base: main
Are you sure you want to change the base?
Conversation
Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper |
b797d28
to
4591734
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, can you add a section here ? https://github.com/apache/datafusion/blob/main/docs/source/user-guide/cli/datasources.md
This is now able to be reviewed now. I completed most of the refining of the initial code. |
) | ||
STORED AS parquet | ||
LOCATION "hf://datasets/cais/mmlu/astronomy/"; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently not working due to:
- The hugging face user access token is case-sensitive.
- a previous change enforces every option value in lower case. https://github.com/apache/datafusion/pull/9723/files.
I will figure out the history to see whether this will be feasible.
I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml) |
Ya, the first time I found this on my timeline on LinkedIn, and am glad to be part of this awesome project. I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations. |
If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 ) |
@@ -0,0 +1,9 @@ | |||
select count(*) from "hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 -- tests too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw duckdb build their organizations and datasets, instead of depending on the random existing ones, to make their CI safe. It would be too early for this PR to do so... It is still in an earlier stage.
Ya. re-implementing the datastore and associated facilities is code-consuming. Do you think the |
Sorry for the delay -- I paln to review this tomorrow |
Sorry, I lost Github connections for a couple of days and just returned. Also Thanks. please take your time. |
This is still on my list, but I am behind in my reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xinlifoobar
I am sorry for the delay in responding to this PR. This is an amazing piece of software engineering. Very nice 🎩 👌
As you have noted, the challege here is that the hs_store
is non trivial and yet somewhat specialized for HuggingFace. It is a really neat feature but somewhat hard to justify adding to the datafusion-cli in the DataFusion repo.
I feel like there is a tension between making datafusion-cli easy to use with many built in integrations (e.g. hugging face, delta-rs, etc) and keeping the dependencies manageable
What would you think about somehow moving this hugging face integration into another repo and making some version of datafusion-cli that had a bunch of pre-defined integrations?
For example, maybe put it in https://github.com/datafusion-contrib/datafusion-cli-plus or something
That could be like the power user version of datafusion-cli that we could use all the fun table providers (like what is in the connector libraries, etc), delta-rust, iceberg-rust, etc
If you think this is a reasonable idea, I will file a ticket for larger discussion
Hey @alamb, thanks for taking the time to review this PR. Great honor to me on this. I am glad to have a repo like As you may have already observed, the datafusion-cli currently supports the Datafusion SQL interface. Are you considering expanding its capabilities to encompass protocols such as MySQL client, Arrow Flight SQL, and others? This expansion would entail naming it act to a full-fledged server. I've observed similar approaches being implemented in downstream projects like InfluxDB. |
Apologies for missing this PR. I wanted to share that OpenDAL has native support for Huggingface. I'm considering whether integrating with OpenDAL would be beneficial, making it easier, more extensible, and maintainable to cool features like this one (as a response to #12357). |
I think using the connectivity offered by openDAL in a tool such as #11979 would be very interesting. Basically I think it would be valuable to distinguish between the query processing engine in DataFusion and the larger integrated applications that are built with DataFusion (and other components) |
marking as draft as I don't think this is waiting on review -- more like waiting on porting to another repo (like dft) |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
I plan to look into integrating hugging face into dft in the coming weeks. First I will check if it can be done with opendal and if not will try to leverage the implementation in this pr. |
Which issue does this PR close?
Closes #10720
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?