Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add python serialization API's for ivf-pq and ivf_flat #186

Merged
merged 4 commits into from
Jun 14, 2024

Conversation

benfred
Copy link
Member

@benfred benfred commented Jun 12, 2024

No description provided.

@benfred benfred requested review from a team as code owners June 12, 2024 22:32
@benfred benfred added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jun 12, 2024
@benfred benfred self-assigned this Jun 12, 2024
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have some few comments on a first pass, PR looks great

cpp/src/neighbors/ivf_flat_c.cpp Outdated Show resolved Hide resolved
@@ -28,7 +28,7 @@ def get_last_error_text():
if c_err is NULL:
return
cdef bytes err = c_err
return err.decode("utf8")
return err.decode("utf8", "ignore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why was this change needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a great question!

So I originally had a bug in the ivf_pq deserialization code, that resulted in some bad data being passed to the mdspan serializer, which caused an error to be thrown that wasn't valid utf8 text.

Specifically this line https://github.com/rapidsai/raft/blob/b66b269ab6dcda48aef3a6ed9e7f604e99471d72/cpp/include/raft/core/detail/mdspan_numpy_serializer.hpp#L293 was writing out the error message unrecognized byteorder %c where the %c was pointing to some random data (\x93 in my case) - which couldn't get converted to utf8. This led to a unicode decode error being thrown rather than the actual error.

While I think we might want to consider always having error messages that are valid utf8 text (like converting that %c format to something like %x ) - I don't think we can guarantee that the exception text from our c++ layer is valid utf8, and ignoring any conversion errors results in a better error message than failing on the utf8 decode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense, and it's great to know! Thanks for the detailed explanation :)

@@ -166,3 +166,45 @@ def test_ivf_pq_search_params(params):
lut_dtype=params["lut"],
internal_distance_dtype=params["idd"],
)


def test_save_load():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A significant portion of this code will be the same for this test for all indices, I wonder if we should refactor the common code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored to reduce the duplicate code in the last commit - we now have a test_serialization.py script that tests each of cagra/ivf_flat/ivf_pq using the same common function

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great!

@benfred
Copy link
Member Author

benfred commented Jun 14, 2024

/merge

@rapids-bot rapids-bot bot merged commit 9dc3a4d into rapidsai:branch-24.08 Jun 14, 2024
57 checks passed
@benfred benfred deleted the ivf_python_serialize branch June 14, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants