Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: gRPC handler is reraising Exceptions in a way that breaks low-level Python Exception tracking #2221

Open
1 task done
bcalvert-graft opened this issue Aug 8, 2024 · 6 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@bcalvert-graft
Copy link

bcalvert-graft commented Aug 8, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Hello Milvus maintainers,

Thank you in advance for reading my bug-report.

There are three spots in grpc_handler.py (first spot, second spot and third spot) that use the following pattern to reraise an Exception

try:
    <some code>
except Exception as err:
    <do some other stuff>
    raise err from err
...

Respectfully, this raise err from err is not a great way to reraise exceptions; instead, as mentioned in the Python docs it should be

try:
    <some code>
except Exception as err:
    <do some other stuff>
    raise
...

Why is the first one not great? It attaches the err Exception as the __cause__ to itself. As I understand it, this is not the desired semantics for __cause__ and this breaks any tools that recursively unroll chained Exceptions (e.g. for processing stack traces). As a concrete example of such a tool, we actually uncovered this seeming bug in grpc_handler.py when using Milvus as part of a larger Dask computation; specifically, as part of Dask's error-handling, the library uses a utility called collect_causes,

def collect_causes(e: BaseException) -> list[BaseException]:
    causes = []
    while e.__cause__ is not None:
        causes.append(e.__cause__)
        e = e.__cause__
    return causes

When collect_causes encounters an Exception raised from one of the linked spots in grpc_handler.py it triggers an infinite loop of unrolling e.__cause__, consequently leaking memory until the process is killed.

Assuming you agree with the quick fix, I'm happy to submit a PR, but wanted to run it by you first to make sure I'm not missing something.

Cheers,
Brian

Expected Behavior

When handling Exceptions raised by the linked methods in grpc_handler.py, the Exceptions should not be attached as their own __cause__

Steps/Code To Reproduce behavior

In one process, run

In [1]: from milvus_lite.server import Server

In [2]: s = Server('test.db', '127.0.0.1:9999')

In [3]: s.start()
Out[3]: True

In [4]: s._p.wait()

In another process, run

In [1]: import pymilvus

In [2]: client = pymilvus.MilvusClient("http://localhost:9999")
   ...: collection_name = "fake_embeddings"
   ...: if client.has_collection(collection_name=collection_name):
   ...:     client.drop_collection(collection_name=collection_name)
   ...: 

In [3]: vector_dim = 512
   ...: client.create_collection(
   ...:     collection_name=collection_name,
   ...:     vector_field_name="vector",
   ...:     dimension=vector_dim,
   ...:     auto_id=True,
   ...:     enable_dynamic_field=True,
   ...:     metric_type="COSINE",
   ...: )

In [4]: import numpy as np

In [5]: vectors = np.random.normal(loc=0, scale=1, size=(vector_dim + 2,))

In [6]: try:
   ...:     client.insert(collection_name, {"vector": vectors})
   ...: except pymilvus.MilvusException as exc:
   ...:     assert exc.__cause__ is not exc, "Exception is its own cause"

The In [6]: step will throw an AssertionError

Environment details

- No hardware/software conditions of note
- Installed pymilvus with pip (version 2.4.4)
- Can reproduce the error with milvus_lite module, so I don't think server config has an impact?

Anything else?

No response

@bcalvert-graft bcalvert-graft added the kind/bug Something isn't working label Aug 8, 2024
@XuanYang-cn XuanYang-cn self-assigned this Aug 9, 2024
@XuanYang-cn
Copy link
Contributor

Hi @bcalvert-graft,

Thanks for the detailed report! Your insights are valuable. We appreciate you considering a quick fix PR.

You're absolutely right that the current behavior is counterproductive. We originally opted for raise e from e to minimize error stack depth in our catch-retry logic for network fluctuations and rate limits. However, it seems the trade-off is not worth the current issue.

@XuanYang-cn
Copy link
Contributor

/unassign
/assign @bcalvert-graft

@bcalvert-graft
Copy link
Author

Thanks @XuanYang-cn for the fast reply. I'll open up a PR as soon as I can

@bcalvert-graft
Copy link
Author

We originally opted for raise e from e to minimize error stack depth in our catch-retry logic for network fluctuations and rate limits.

Also, thank you for this clarification on the motivation for the raise err from err choice, I hadn't considered that aspect of it

@bcalvert-graft
Copy link
Author

@XuanYang-cn and whoever else I should ping, I've opened up #2225 from a fork of the pymilvus repo. The scope of the changes in that PR ended up being larger than the three spots I flagged in the issue description above, as I wrote grep to find instances of the raise <e> from <e> pattern and found more. I had to tweak those as well to get the test-case written above to pass.

@bcalvert-graft
Copy link
Author

Hi all,

Gentle bump on this thread. As mentioned above, I've opened up #2225. Are there additional things I should do to move this bugfix forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants