-
-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance regression from 3.2.0 to 4.0.1 #853
Comments
I noticed this too, a set of tests I run for an application that uses jsonschema heavily the test suite now takes 60 minutes vs 14 before. |
It will take some digging, but there is indeed a "known" performance chance in 4.0.0: https://github.com/Julian/jsonschema/blob/main/CHANGELOG.rst#v400 specifically:
If someone could confirm (or deny) whether that's the culprit it'd be helpful, but yeah this may need some investigating (some I won't know I have time for for at least a few days). CC @skamensky as well who I know was interested in doing some performance optimization -- here's another benchmark we may want to adopt or use to drive any change. |
I was also looking into upgrading from jsonschema 3.2.0 to >4.0.0 in order to use the uuid string validation that came with the 2019-09 draft. We also noticed around a 4.5x slower runtime when running our json schema validation tests. I'd need to confirm, but I don't think we are using the |
In case it's helpful, here's a benchmark related to the Altair package (uses the from urllib.request import urlopen
import json
import jsonschema
with urlopen('https://raw.githubusercontent.com/vega/schema/master/vega-lite/v4.8.1.json') as f:
schema = json.load(f)
with urlopen('https://raw.githubusercontent.com/vega/vega-lite/master/examples/specs/bar.vl.json') as f:
instance = json.load(f)
print(jsonschema.__version__)
%timeit jsonschema.validate(instance, schema) On a Google Colab CPU runtime, I get the following:
|
I tried this same schema + instance with the usual optimization of creating Validator instance only once (ie. do what `jsonschema.validate() does: validate schema itself once, create validator instance...), and then calling it in a loop. Short test on Python 3.9.x, forcing use of Draft7Validator (as spec'd by in the schema)
...which makes the impact even bigger: 45x slower. Update / cProfile output (with 100 calls to
Might create a flame graph later, but no promises... hope this helps to narrow it down. At the moment this blocks updating few projects to 4.x series (and to draft 2020-12). |
Thanks! That does help a bit, and possibly points to some code that was merged to support location independent identifiers ( But specifically my broad guess from just seeing the profile is that each time a ref is being resolved we're repeatedly walking over schemas looking for location independent identifiers, when instead we should be doing that 0 or 1 time if a location-independent identifier is being dereferenced. Will require a bit more investigation but yeah that's at least a lead I suspect. (Certainly PRs are welcome to help if someone gets to it first). |
Another quick note here about the impact of this: in altair_viewer, with jsonschema<4.0, the test suite runs in 30 seconds. With jsonschema 4.0 or newer, the test suite times out after 6 hours. We fixed this in altair-viz/altair_viewer#44 by pinning to |
My company also suffered a 19x slowdown when going from 3.2 to 4.1 which caused all sorts of issues. With this much evidence of a serious performance regression, it might be wise to add a warning about it in the README or CHANGELOG. |
Indeed, quite a lot of time is spent on finding subschemas inside class RefResolver:
...
@lru_cache()
def _find_ids(self):
return list(self._finditem(self.referrer, "$id"))
def resolve(self, ref):
"""
Resolve the given reference.
"""
url = self._urljoin_cache(self.resolution_scope, ref).rstrip("/")
uri, fragment = urldefrag(url)
# Use `_find_ids` below
for subschema in self._find_ids():
target_uri = self._urljoin_cache(
self.resolution_scope, subschema["$id"],
)
if target_uri.rstrip("/") == uri.rstrip("/"):
if fragment:
subschema = self.resolve_fragment(subschema, fragment)
return url, subschema
return url, self._remote_cache(url) All tests pass. Running against code from the first message in this issue: |
This is definitely a reasonable assumption so sounds like such a patch would be very much appreciated. Thanks for the investigation. |
Great! Happy to help :) After some more adjustments: class RefResolver:
...
@lru_cache()
def _find_ids(self):
return list(self._finditem(self.referrer, "$id"))
@lru_cache()
def _find_in_subschema(self, url):
uri, fragment = urldefrag(url)
for subschema in self._find_ids():
target_uri = self._urljoin_cache(
self.resolution_scope, subschema["$id"],
)
if target_uri.rstrip("/") == uri.rstrip("/"):
if fragment:
subschema = self.resolve_fragment(subschema, fragment)
return url, subschema
return None
def resolve(self, ref):
"""
Resolve the given reference.
"""
url = self._urljoin_cache(self.resolution_scope, ref).rstrip("/")
match = self._find_in_subschema(url)
if match is not None:
return match
return url, self._remote_cache(url) Cuts the execution time to 11.305032968521118 sec I'll submit a patch shortly |
v4.3.1 is out with great thanks to @Stranger6667 for putting in the time to make the fix. I haven't fully tested myself against the examples above but please do share feedback. |
updated my measurements in the OP. thank you guys for putting in the work! ❤️ it is a 3x improvement, but still a little bit slower than the 3.2 version. I think this is a manageable slowdown now and from my side the ticket could be closed. |
I just tried the altair_viewer test suite with jsonschema 4.3; it's back to executing under 30 seconds (altair-viz/altair_viewer#46). Thanks so much for the fix! |
Awesome! :) There is a small follow-up patch that reduces the number of calls to the subschema search routine - not a dramatic improvement, but seems like a nice to have :) In my use case, I'd be happy to upgrade Thanks to everybody involved for such detailed reports & reviewing patches :) |
Thanks all for the feedback and a big thanks again to @Stranger6667. Sounds like we can close this for now. |
Hi @Julian, first of all thanks for this library and all the hard work. We rely heavily on it and are delighted to see
Draft 2020-12
support as it is required for OpenAPI 3.1.0When updating, I noticed our test suite ran noticeably slower. I compiled a quick reproduction for you. On my machine I get a 5x slowdown with the new version. Is this performance hit to be expected due to the new capabilities or is it a regression?
EDIT: included measurement for
4.2.1
releaseEDIT: included measurement for
4.3.1
releaseThe text was updated successfully, but these errors were encountered: