-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Validator Layout and Resolver #247
Conversation
Another important change I forgot to mention is |
Amazing work! Thank you for sharing it! :)
I cherry-picked some changes there to reduce the size of the diff :) Feel free to rebase (the changes are almost identical, so you can choose your branch to resolve conflicts). Hopefully, there is not much trouble.
I like the idea and feel like it is a step in the right direction. To elaborate on the changes, I'd like to share my understanding of the status quo and the proposed changes (I'll omit the trait bounds for brevity). The current implementation looks like this (please, correct me if I am wrong):
So:
Let's take this schema -
+ dynamic dispatch costs With the proposed changes:
So, there are still vectors inside the concrete validator types, but each box will lead to the same slice of trait objects. I.e. the example above will have 1 allocation less ([1] and [3] will lead to the same slice). If my understanding is correct then, I think it is a really great step forward :) Please, let me know if I am missing anything there. Btw, I was thinking whether it could be even more compact like this: I.e. if each concrete validator type will hold
My first instinct is to have a generic there, but I need to check the code in more detail to have a better feeling on this :)
Sounds good to me
I'll work on splitting the benchmarks (I saw a nice example of that in PyO3 recently), so it is easier to assess the changes.
Interesting idea! But yeah, the presence of paths limits this option.
Cool! I didn't think about it. I believe it could be factored out and merged separately. Generally, I think it will worth it to split the changes into separate PRs starting with breaking the cycles & resolving everything during the compilation phase, then go to resolvers. |
Thanks, yes. This is the gist of it. The main goal is that if we encounter validators for schemas that already exist (e.g. ones with $id or resolved through $ref) in The main issue I've encountered with this is the handling of paths. What happens to paths after switching schema roots via $refs? Are they indefinitely appended, or is the root reset? If the root is always calculated from the schema with $id, then this isn't an issue, but if we want to keep the entire path starting from the first schema, this can get tricky. Although I don't recommend the latter as paths could get extremely long for recursive validations. Either way I would consider moving the schema path from the validator into the tree somehow, so validators would look something like |
Yes, the plan was creating separate PR-s that are properly tested and documented, but I don't have time for it right now, so I decided to submit this as is for discussions. Feel free to pick changes you like and continue what I started and create the said PR-s, as I have very limited time during weekdays. |
@tamasfe
That's a really cool idea! :)
In the current implementation, paths are joined together, so recursion is reflected in resulting errors (not in validator structs itself). So, the root is reset and is local in new validators, but in the end, all chunks are joined together. It is close to the Keyword Relative Location definition from Draft 2020-12 but does not include
Agree, probably to have proper relative locations we still need some validation-time path building.
Yep, good idea! |
@Stranger6667 hey, no problem :) Unfortunately I won't have time to continue this in the foreseeable future as even my weekends are chaotic right now. I'll eventually make some time to pitch in though! I think the next thing I'd do is factoring out schema paths from validators into the tree so that validators are really just context-independent reusable functions and only then work on the arena-like allocations and async boilerplates. |
I will close this one, as I work on a rewrite where some of these ideas are incorporated, so in the end, all refs are resolved during the compilation phase. The graph itself is in the CSR format, which should be very cache-friendly & should be traversed fast enough. It is not finished but I plan to share a WIP PR here in a week or so. |
Hi @Stranger6667, I've done some experiments I'd like to share.
This PR touches on the topics of #246 and #212. This isn't intended to be merged as is (or at all), I'm just sharing my thoughts.
Validator graph
I made a validator layout based on the arena+index/reference approach to support cycles, for this I made two opaque types:
ValidatorBuf
andValidatorRef
, the idea is that we collect allValidatorBuf
s first in theCompilationContext
then finally store them inJSONSchema
, and the validator tree itself only contains references.In order to be able to do this, the only way to obtain a
ValidatorRef
during compilation is through theCompilationContext
here. Currently the only thing it does is adding the validator to the pool, and returning a reference that can can be used in the tree.This method also enables further operations/optimizations of validators during compilation since we have all of them at one place no matter how deep the tree is.
The reason why
ValidatorBuf
andValidatorRef
are wrapper types is that we can freely change the underlying implementation, currently I usedArc
for storing the validators andWeak
insideValidatorRef
in debug mode to make sure that everything works, then finallyBox
and raw pointers in release mode. I'm confident about the safety of the latter as we always keep all validators inJSONSchema
, and everything currently is synchronous.Resolver
I started implementing an external resolver, while turning the current
Resolver
to anInternalResolver
, that will solely resolve JSON pointers (and fragments). I've figured that no matter what we will have to carry it around in the context due to the architecture, and I still have no idea whether it would be better to just box it or carry the generic type around everywhere.I also haven't figured out how to support both sync and async compilations as it will involve a lot of boilerplate either way, the best idea I have so far is just a macro that wraps compile functions that can recurse into
compile_validators
to be both sync and async.Other thoughts
It would be nice to split the benchmarks up into parts, currently it takes ages especially because it contains benchmarks not related to
jsonschema
, I had unfortunately no time to run them and do this refactor, so currently I can only guess the performance implications of the changes I made.I've experimented with reusing validators, meaning that if more validators would do the same thing, the references in the tree would simply point to the same instance, making the pool more compact and cache-friendly. This is currently not possible because the validators also contain their path inside the schema.
edit: I've also changed all ToString implementations to Display, when I was halfway into it I understood why it hasn't been done, but decided to finish it it anyway ...