Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Validator Layout and Resolver #247

Closed
wants to merge 4 commits into from

Conversation

tamasfe
Copy link
Contributor

@tamasfe tamasfe commented Jul 12, 2021

Hi @Stranger6667, I've done some experiments I'd like to share.

This PR touches on the topics of #246 and #212. This isn't intended to be merged as is (or at all), I'm just sharing my thoughts.

Validator graph

I made a validator layout based on the arena+index/reference approach to support cycles, for this I made two opaque types:

ValidatorBuf and ValidatorRef, the idea is that we collect all ValidatorBufs first in the CompilationContext then finally store them in JSONSchema, and the validator tree itself only contains references.

In order to be able to do this, the only way to obtain a ValidatorRef during compilation is through the CompilationContext here. Currently the only thing it does is adding the validator to the pool, and returning a reference that can can be used in the tree.

This method also enables further operations/optimizations of validators during compilation since we have all of them at one place no matter how deep the tree is.

The reason why ValidatorBuf and ValidatorRef are wrapper types is that we can freely change the underlying implementation, currently I used Arc for storing the validators and Weak inside ValidatorRef in debug mode to make sure that everything works, then finally Box and raw pointers in release mode. I'm confident about the safety of the latter as we always keep all validators in JSONSchema, and everything currently is synchronous.

Resolver

I started implementing an external resolver, while turning the current Resolver to an InternalResolver, that will solely resolve JSON pointers (and fragments). I've figured that no matter what we will have to carry it around in the context due to the architecture, and I still have no idea whether it would be better to just box it or carry the generic type around everywhere.

I also haven't figured out how to support both sync and async compilations as it will involve a lot of boilerplate either way, the best idea I have so far is just a macro that wraps compile functions that can recurse into compile_validators to be both sync and async.

Other thoughts

It would be nice to split the benchmarks up into parts, currently it takes ages especially because it contains benchmarks not related to jsonschema, I had unfortunately no time to run them and do this refactor, so currently I can only guess the performance implications of the changes I made.

I've experimented with reusing validators, meaning that if more validators would do the same thing, the references in the tree would simply point to the same instance, making the pool more compact and cache-friendly. This is currently not possible because the validators also contain their path inside the schema.

edit: I've also changed all ToString implementations to Display, when I was halfway into it I understood why it hasn't been done, but decided to finish it it anyway ...

@tamasfe
Copy link
Contributor Author

tamasfe commented Jul 12, 2021

Another important change I forgot to mention is #[non_exhaustive] on the errors, this is a breaking API change now, however new validators and their error types can be added in any release, so it would prevent further breaking changes.

@Stranger6667
Copy link
Owner

@tamasfe

Amazing work! Thank you for sharing it! :)

edit: I've also changed all ToString implementations to Display, when I was halfway into it I understood why it hasn't been done, but decided to finish it it anyway ...

I cherry-picked some changes there to reduce the size of the diff :) Feel free to rebase (the changes are almost identical, so you can choose your branch to resolve conflicts). Hopefully, there is not much trouble.

Validator graph

I like the idea and feel like it is a step in the right direction.

To elaborate on the changes, I'd like to share my understanding of the status quo and the proposed changes (I'll omit the trait bounds for brevity). The current implementation looks like this (please, correct me if I am wrong):

JSONSchema-validators-current

  1. JSONSchema.validators is a vector of Box<dyn Validate>, i.e. a pointer to heap allocation of other pointers (B_11, B_12, ...)
  2. Each box there leads to a validator that may have Vec<Box<dyn Validate>> with similar pointers to other heap allocations (B_21, B_22, ...)

So:

  • a separate vector for each level;
  • Each Box<dyn Validate> goes to a different allocation.

Let's take this schema - {"items": {"type": "string"}}, so the way references are followed is roughly like this:

  1. Follow the reference to the slice of Box<dyn Validate> [1] (iterate over self.validators)
  2. Deref Box that wraps dyn Validate leading to ItemsObjectValidator [2]
  3. Follow ItemsObjectValidator.validators to another slice of Box<dyn Validate> [3]
  4. Deref Box and get StringTypeValidator [4]

+ dynamic dispatch costs

With the proposed changes:

  1. JSONSchema.validators is effectively a vector of *const (dyn Validate), which seems the same number of indirections as before (a pointer to a slice of pointers to trait objects).
  2. On this level, validators may have a vector of pointers that lead to the same allocation of trait objects.

So, there are still vectors inside the concrete validator types, but each box will lead to the same slice of trait objects. I.e. the example above will have 1 allocation less ([1] and [3] will lead to the same slice).

If my understanding is correct then, I think it is a really great step forward :) Please, let me know if I am missing anything there.

Btw, I was thinking whether it could be even more compact like this:

JSONSchema-validators-maybe

I.e. if each concrete validator type will hold &[&dyn Validate] (or &[Box<dyn Validate>] or &[* const dyn Validate]) that will lead to the same allocation. But it is probably could be discussed later.

Resolver

I started implementing an external resolver, while turning the current Resolver to an InternalResolver, that will solely resolve JSON pointers (and fragments). I've figured that no matter what we will have to carry it around in the context due to the architecture, and I still have no idea whether it would be better to just box it or carry the generic type around everywhere.

My first instinct is to have a generic there, but I need to check the code in more detail to have a better feeling on this :)

I also haven't figured out how to support both sync and async compilations as it will involve a lot of boilerplate either way, the best idea I have so far is just a macro that wraps compile functions that can recurse into compile_validators to be both sync and async.

Sounds good to me

It would be nice to split the benchmarks up into parts, currently it takes ages especially because it contains benchmarks not related to jsonschema, I had unfortunately no time to run them and do this refactor, so currently I can only guess the performance implications of the changes I made.

I'll work on splitting the benchmarks (I saw a nice example of that in PyO3 recently), so it is easier to assess the changes.

I've experimented with reusing validators, meaning that if more validators would do the same thing, the references in the tree would simply point to the same instance, making the pool more compact and cache-friendly. This is currently not possible because the validators also contain their path inside the schema.

Interesting idea! But yeah, the presence of paths limits this option.

Another important change I forgot to mention is #[non_exhaustive] on the errors, this is a breaking API change now, however new validators and their error types can be added in any release, so it would prevent further breaking changes.

Cool! I didn't think about it. I believe it could be factored out and merged separately.

Generally, I think it will worth it to split the changes into separate PRs starting with breaking the cycles & resolving everything during the compilation phase, then go to resolvers.

@Stranger6667 Stranger6667 mentioned this pull request Jul 12, 2021
1 task
@tamasfe
Copy link
Contributor Author

tamasfe commented Jul 12, 2021

If my understanding is correct then, I think it is a really great step forward :)

Thanks, yes. This is the gist of it.

The main goal is that if we encounter validators for schemas that already exist (e.g. ones with $id or resolved through $ref) in compile_validators, we can simply return references to them instead of calling compile, so a validator can even contain itself as a child. (this logic is not in this PR)

The main issue I've encountered with this is the handling of paths. What happens to paths after switching schema roots via $refs? Are they indefinitely appended, or is the root reset?

If the root is always calculated from the schema with $id, then this isn't an issue, but if we want to keep the entire path starting from the first schema, this can get tricky. Although I don't recommend the latter as paths could get extremely long for recursive validations.

Either way I would consider moving the schema path from the validator into the tree somehow, so validators would look something like Vec<(SchemaPath, ValidatorRef)>. Currently validators cannot get reused because they contain the schema path, so the array gets blown up with validators like type: integer, which is pretty much just the same function.

@tamasfe
Copy link
Contributor Author

tamasfe commented Jul 12, 2021

Generally, I think it will worth it to split the changes into separate PRs starting with breaking the cycles & resolving everything during the compilation phase, then go to resolvers.

Yes, the plan was creating separate PR-s that are properly tested and documented, but I don't have time for it right now, so I decided to submit this as is for discussions. Feel free to pick changes you like and continue what I started and create the said PR-s, as I have very limited time during weekdays.

@Stranger6667 Stranger6667 mentioned this pull request Aug 15, 2021
@Stranger6667
Copy link
Owner

@tamasfe
I am so sorry, I completely missed your previous message :(

The main goal is that if we encounter validators for schemas that already exist (e.g. ones with $id or resolved through $ref) in compile_validators, we can simply return references to them instead of calling compile, so a validator can even contain itself as a child. (this logic is not in this PR)

That's a really cool idea! :)

The main issue I've encountered with this is the handling of paths. What happens to paths after switching schema roots via $refs? Are they indefinitely appended, or is the root reset?

In the current implementation, paths are joined together, so recursion is reflected in resulting errors (not in validator structs itself). So, the root is reset and is local in new validators, but in the end, all chunks are joined together.

It is close to the Keyword Relative Location definition from Draft 2020-12 but does not include $ref in the resulting paths. Here I followed the way it is implemented in Python's jsonschema library.

If the root is always calculated from the schema with $id, then this isn't an issue, but if we want to keep the entire path starting from the first schema, this can get tricky. Although I don't recommend the latter as paths could get extremely long for recursive validations.

Agree, probably to have proper relative locations we still need some validation-time path building.

Either way I would consider moving the schema path from the validator into the tree somehow, so validators would look something like Vec<(SchemaPath, ValidatorRef)>. Currently validators cannot get reused because they contain the schema path, so the array gets blown up with validators like type: integer, which is pretty much just the same function.

Yep, good idea!

@tamasfe
Copy link
Contributor Author

tamasfe commented Aug 15, 2021

@Stranger6667 hey, no problem :)

Unfortunately I won't have time to continue this in the foreseeable future as even my weekends are chaotic right now. I'll eventually make some time to pitch in though!

I think the next thing I'd do is factoring out schema paths from validators into the tree so that validators are really just context-independent reusable functions and only then work on the arena-like allocations and async boilerplates.

@Stranger6667
Copy link
Owner

I will close this one, as I work on a rewrite where some of these ideas are incorporated, so in the end, all refs are resolved during the compilation phase. The graph itself is in the CSR format, which should be very cache-friendly & should be traversed fast enough. It is not finished but I plan to share a WIP PR here in a week or so.

@tamasfe tamasfe deleted the feat-schema-type branch May 12, 2022 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants