-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Dataset.unique() raises error in case of any null values #42142
Comments
Hello burton, I'd like to work on this issue! TIA. |
hi @Akshi22 , don't let me get in your way! though it looks like @ujjawal-khare-27 has already submitted a pr to fix this issue. maybe you can help there? |
For what it's worth, I just ran into this issue again, only this time in the context of |
Hi, is this issue still open? |
I believe the right way to fix this is going to require the underlying Merge operations to be Pyarrow based, instead of Python based (where we currently use a heapq iterator, which doesn't compare NaNs well) |
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>
## Why are these changes needed? Adds a Sentinel value for making it possible to sort. Fixes ray-project#42142 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>
What happened + What you expected to happen
I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling
Dataset.unique(colname)
on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on apandas.Series
works just fine, as does getting unique values via Python built-ins.Here are two versions of type error I got, seemingly from the same line of code:
and
Versions / Dependencies
macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: