-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bail out when chain detection is expensive + edge case fix #1173
Conversation
When multiple Blank Nodes referenced each other, they would erroneously be marked as being part of a cycle and hence not a chain, leading to the chain detection algorithm not working and the blank nodes being listed at the top level of the graph with a non-deterministic identifier. This fixes that; if those Blank Nodes for a chain without cycles, they will now be represented as a tree without needing to generate identifiers for them.
The chain detection algorithm is relatively expensive. Since it is only an optimisation to make our Dataset data structure deterministic in more cases, it makes sense to avoid running it when it takes a lot of time — the different representations are equivalent.
This pull request is automatically built and testable in CodeSandbox. To see build info of the built libraries, click here or the icon next to each commit SHA. Latest deployment of this branch, based on commit 6c519a6:
|
// we consider the detection of chains (O(n^2), I think) to be too | ||
// expensive, and just incorporate them as regular Blank Nodes with | ||
// non-deterministic, ad-hoc identifiers into the SolidDataset: | ||
const maxBlankNodesToDetectChainsFor = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a constant in the constants.ts
file ? So far it's only RDF constants, but it feels weird to have a hard-coded limit defined so locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, since it's really only used locally, i.e. there's just a single mention of this variable - I mostly extracted it to be able to give it a name that explained its purpose, rather than for it to be a magic number. I'm afraid that pulling it out of its context will make it harder to understand what it does, and it's not meant to be re-used elsewhere in the codebase either.
This PR fixes #1158. Working on that also helped me discover that there was a (fairly common) case in which we did not correctly detect non-cyclic chains and thus generated non-deterministic identifiers for blank nodes in cases where that was not necessary - this is fixed in the first commit c0960e2.
The second commit (61d0c89) bails out of chain detection when a Resource contains more than 20 Blank Nodes. Chain detection is really an optimisation: when a Resource contains non-cyclic Blank Nodes (e.g. when using RDF lists, or mashlib's default representation of trusted apps), it ensures that the resulting SolidDataset will always have the exact same properties whenever it's parsed. This helps e.g. tools like SWR use something like dequal to verify that the same Resource fetched twice was unchanged. However, there will always be cases when we cannot do this, e.g. when there's a cycle of Blank Nodes. Therefore, it's not a problem to simply skip that work when it would take a lot of time.