-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid metadata for SQLDocumentStore
#2868
Handle invalid metadata for SQLDocumentStore
#2868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promising! That's a nice default behavior for the SQLDocumentStore
for now. Left a few comments for improvements
haystack/document_stores/sql.py
Outdated
if value is None or isinstance(value, self.valid_metadata_types): | ||
valid_meta_orms.append(MetaDocumentORM(name=name, value=value)) | ||
else: | ||
logger.warning( | ||
f"Metadata '{name}' skipped for document {doc.id}, since it has invalid type: {type(value).__name__}.\n" | ||
f"SQLDocumentStore accepts only the following types: {', '.join([el.__name__ for el in self.valid_metadata_types])}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For speed reason I'd use a try-catch
in here. It also saves you from specifying the allowed types in line 112.
In addition, this is quite a bad error imho, so I'd raise the log to ERROR
or even EXCEPTION
. Personally, if this is an issue that requires action from the users in almost all cases, I'd go for EXCEPTION
. If you see any situation in which the users might want to ignore that for a good reason, then ERROR
is better (the only difference between the two is that exception
shows the stack trace while error
doesn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Agree on logging an
ERROR
. I think that there are cases when the user simply wants to ignore this error (see SQL based Datastores fail when document metadata has a list #2792 initial issue). -
About
try-catch
: the problem is that the error only shows up, during the first writing operation of the whole record, eg here:haystack/haystack/document_stores/sql.py
Line 401 in 8ee2b6b
self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()
IMHO, we could properly usetry-catch
, if the exception was raised during the creation ofMetaDocumentORM
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I see the point. There are two ways to go:
- This one (just fine), or
- See if SQLAlchemy offers any hook to deal with unsupported types, like
json
and Pydantic do. Do you mind having a look? If the library offers no help there, we can stick with your original solution.
@@ -453,6 +453,32 @@ def test_write_document_meta(document_store: BaseDocumentStore): | |||
assert document_store.get_document_by_id("4").meta["meta_field"] == "test4" | |||
|
|||
|
|||
@pytest.mark.parametrize("document_store", ["sql", "faiss", "milvus1", "milvus"], indirect=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would test all docstores here, so we're sure there is a consistent behavior across all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to test only docstores, whose metadata are written in SQL databases.
For other docstores, I think that we don't have this error...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right, because the other docstores would be happy to deal with lists in the meta, so they won't show this error.
Then we might even want to test only sql
. I got thinking here because, for example. pinecone
uses SQLDocumentStore
as well, and keeping track of all subclasses would be prone to errors. Let's just test sql
, it should be enough.
assert not "invalid_meta_field" in document_store.get_document_by_id("1").meta | ||
assert document_store.get_document_by_id("1").meta["valid_meta_field"] == "test1" | ||
assert not "invalid_meta_field" in document_store.get_document_by_id("2").meta | ||
assert document_store.get_document_by_id("2").meta["valid_meta_field"] == "test2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make it slightly stricter, how about testing the whole dictionary?
assert not "invalid_meta_field" in document_store.get_document_by_id("1").meta | |
assert document_store.get_document_by_id("1").meta["valid_meta_field"] == "test1" | |
assert not "invalid_meta_field" in document_store.get_document_by_id("2").meta | |
assert document_store.get_document_by_id("2").meta["valid_meta_field"] == "test2" | |
assert document_store.get_document_by_id("1").meta == {"valid_meta_field": "test1"} | |
assert document_store.get_document_by_id("2").meta == {"valid_meta_field": "test2"} |
…ndle_nostring_metadata_sqlstore
…e' into handle_nostring_metadata_sqlstore
@ZanSara Generally, SQLAlchemy prefers to let these checks to the DBs, but I found simple validators! Maybe the error handling and logging can be improved, even if now I think it's not so bad.
I also improved the test... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, good to go! The validation hooks for SQLAlchemy are really neat. I agree the database should always have the last word, but a bit of proactive validation won't hurt and saves a scary error to the users, so I really like it.
Thank you very much @anakin87 😊
* modify notebook * skip invalid metadata * Update Documentation & Code Style * fix nonetype * fix nonetype * drop nonetype from valid types * drop nonetype from valid types * fix * Update sql.py * sqlalchemy validation * removed newlines * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Related Issue(s): #2792
Proposed changes:
Valid types are
str, int, float, bool, bytes, bytearray, NoneType
Pre-flight checklist
In #2792, we seem to agree that it would be good to have a more expressive format for metadata value (not simply string).
We also understand that it would be a complex change that could break several things.
So to avoid blocking errors, for now, the best solution seems to skip writing metadata that is incorrectly managed by SqlAlchemy.
And I've started this implementation...
(Another possibility is to write a string representation in the DB even for complex metadata.)
I make the PR a draft, to run all the tests...