Creating lots of validator classes uses lots of memory #868

privatwolke · 2021-10-28T09:34:21Z

We're using jsonschema to run validation on payloads coming into our web services and we've been noticing a steady memory increase in our services since upgrading to version 4+ of jsonschema. There also seems to be a performance drain, but it seems that this is already reported in #853. Further investigation revealed that the problem seems to stem from a piece of code that is heavily based on jsonschema documentation. I've narrowed the memory leak down to this snippet that reproduces the problem:

from time import time
import jsonschema

def extend_with_default(validator_class):
    # ref: https://python-jsonschema.readthedocs.io/en/stable/faq/#why-doesn-t-my-schema-s-default-property-set-the-default-on-my-instance
    validate_properties = validator_class.VALIDATORS["properties"]

    def set_defaults(validator, properties, instance, schema):
        for property, subschema in properties.items():
            if "default" in subschema:
                instance.setdefault(property, subschema["default"])

        for error in validate_properties(
            validator, properties, instance, schema,
        ):
            yield error

    return jsonschema.validators.extend(
        validator_class, {"properties" : set_defaults},
    )

i = 0
start = time()

while True:
    i += 1
    if time() - start > 120:
        break
    extend_with_default(jsonschema.Draft4Validator)({
        'properties': {
            'foo': {
                'type': 'string'
            }
        }
    }).validate({'foo': 'bar'})

print(f'completed {i} iterations')

With jsonschema==3.2.0, this manages to do 1436757 iterations on my machine and shows this memory consumption:

However, when updating to 4.1.2 we only get 6763 iterations done and have steadily rising memory:

I've been getting similiar results on Python 3.7, 3.8, 3.9 and 3.10 as well as when using Draft7Validator instead of Draft4Validator.

When we don't wrap the validator with extend_with_default the memory consumption remains steady also on 4.1.2 and we get 2324680 iterations done (that function seems to be a performance drain also in 3.2.0, but memory-wise it was alright):

The text was updated successfully, but these errors were encountered:

Julian · 2021-10-30T18:38:08Z

Hi. Is your actual code doing what your timing example does? It's unclear why there are different memory characteristics, so perhaps there's something to look into regardless, but it'd be quite odd to call extend_with_default in a loop, that's a function that creates a class, so you wouldn't do that over and over again. Realistic code would firstly do that only once:

ValidatorThatSetsDefaults = extend_with_default(jsonschema.Draft4Validator)

and if you always validate with the same schema (which seems to be the case from the example, but unclear whether that matches your real code), then creating a validator instance only once as well:

validator = ValidatorThatSetsDefaults({'properties': {'foo': {'type': 'string'}}})

i = 0
start = time()

while True:
    i += 1
    if time() - start > 120:
        break
    validator.validate({'foo': 'bar'})

of course with whatever your real schema is.

With that, on my macbook air, I get around 28 million iterations on 4.1.2 (I didn't try yet with 3.2).

On PyPy without warming the JIT that number goes up to ~140 million iterations.

Happy to look into performance issues for pathological cases as well, but want to make sure you observe the issue in a real scenario as well?

privatwolke · 2021-11-03T13:16:48Z

Hey! Thanks for your answer. It was actually close to what we had in our codebase 🙈. Creating the class once indeed gets rid of the memory issue. It's certainly more efficient to do it this way, but in theory I don't think that creating a class that shouldn't have any references to it after the line is executed should leave anything behind memory-wise. So as you said, there may still be something that is worth checking out :-)

Iterations on my machine:

3.2.0: 3089741
4.1.2: 2303999

Julian · 2021-11-03T13:18:11Z

Yep indeed, it still may be worth looking into, just way lower priority :)

salty-horse · 2021-12-29T12:53:10Z

This is a bug in attrs in a caching mechanism that runs when classes of the same name are created, like the closure in validators.create(). See bug python-attrs/attrs#826

Upgrade to attrs 21.3.0 to fix this.

Consider making the next jsonschema release depend on this attrs version, or exclude the bad versions.

Julian · 2021-12-29T12:57:16Z

A PR would certainly be welcome adding the pin from below.

Though I'm still unconvinced any reasonable code should encounter this issue, so if you have some that does I'd love to see that too.

salty-horse · 2021-12-29T13:13:56Z

Before pinning the version I think we should wait for confirmation by the OP. I also wonder if it fixes #853 (I had both issues open and accidentally replied there at first. Sorry.)

Here's my use-case. I admit it's probably not reasonable :) I'm already working on fixes.

In my codebase, APIs have schemas of different versions, as I slowly moved up from draft 3. The requests JSON is wrapped in case-insensitive dict, so in the old jsonschema that supported types= this code executed for every request:

api_schema_json.setdefault('$schema', 'http://json-schema.org/draft-03/schema#')

# This one-liner works for any schema version, and has no equivalent now that `types=` is removed.
jsonschema.validate(request_json, api_schema_json, types={'object': (dict, CaseInsensitiveDict)})

When migrating to the new TypeChecker, I wrote this code instead:

# Find the correct version to use
original_validator = jsonschema.validators.validator_for(api_schema_json, default=jsonschema.validators.Draft3Validator)

# Extend the type checker
Validator = jsonschema.validators.extend(
        original_validator,
        type_checker=original_validator.TYPE_CHECKER.redefine('object', lambda _, x: isinstance(x, (dict, CaseInsensitiveDict))))

# Run validation
Validator(schema=api_schema_json).validate(reques_json)

Regardless of the attrs bug, I will now make sure to only create one "type-checker-patched" validator per version, and will also call Validator(schema=api_schema_json) once per API.

Julian · 2021-12-29T13:34:07Z

I see. validator_for could possibly interact better with these kinds of attempts to "override" a draft's validator, but probably in the interim it's yeah easy enough just for you to have a dict mapping original validator classes to your overridden ones and use that.

Julian · 2021-12-29T13:34:42Z

(I strongly doubt this issue has anything to do with #853, as I said somewhere above -- #853 should now be fixed after some unrelated caching.)

salty-horse · 2021-12-29T13:42:35Z

On a related note, I wonder if the docs should strongly encourage creating and storing validator objects per object when used in production code. I suspect it's not uncommon for users to execute Validator(schema).validate(input) per input.

Julian · 2021-12-29T13:54:05Z

I wouldn't be so sure, I'd suspect users who have found the Validator classes will know, just like in normal Python code, that if you want to do things many times, you should create instances and reuse them, not repeatedly create them. Doing so isn't really anything specific to jsonschema, but I have no evidence one way or the other.

Way more likely are users who just use jsonschema.validate everywhere I think, but I don't know how to solve that, as nearly the first thing in the docstring for jsonschema.validate says:

If you know you have a valid schema already, especially if you intend to validate multiple instances with the same schema, you likely would prefer using the Validator.validate method directly on a specific validator (e.g. Draft7Validator.validate).

(though it didn't always say that IIRC.)

But I'm certainly open to adding it regardless of what I think if you can come up with some short wording in a PR, specifically pointing it out at worst can't hurt.

privatwolke · 2021-12-29T22:38:39Z

I can have a look at this with my original snippet early in the new year when I'm back from vacation. Thank you for looking at this even though it stemmed from an unreasonable usage 😊

Julian · 2023-02-23T09:57:06Z

Hi there!

As part of a major upgrade in $ref behavior in #1049, I've already bumped the attrs lower-pin mentioned above

I've also just attempted to run the initial example, and here at least don't observe any memory growth after ~5 minutes of runtime.

Going to close this as hopefully fixed either directly via the attrs bump or via some other change since this was filed (sorry it took so long) -- but if you observe this behavior again feel free to follow up with a new reproducer.

Julian changed the title ~~Memory Leak in 4.0.0+ when using extend_with_default from documentation~~ Creating lots of validator classes uses lots of memory Nov 3, 2021

Julian added Bug Something doesn't work the way it should. Help Wanted An enhancement or bug for which a pull request is welcomed and which should have a clear definition. labels Nov 3, 2021

Zsailer mentioned this issue Jun 23, 2022

Meeting Notes 2022 jupyter-server/team-compass#15

Closed

Julian closed this as completed Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating lots of validator classes uses lots of memory #868

Creating lots of validator classes uses lots of memory #868

privatwolke commented Oct 28, 2021 •

edited

Loading

Julian commented Oct 30, 2021

privatwolke commented Nov 3, 2021 •

edited

Loading

Julian commented Nov 3, 2021

salty-horse commented Dec 29, 2021 •

edited

Loading

Julian commented Dec 29, 2021

salty-horse commented Dec 29, 2021 •

edited

Loading

Julian commented Dec 29, 2021

Julian commented Dec 29, 2021

salty-horse commented Dec 29, 2021

Julian commented Dec 29, 2021

privatwolke commented Dec 29, 2021

Julian commented Feb 23, 2023

Creating lots of validator classes uses lots of memory #868

Creating lots of validator classes uses lots of memory #868

Comments

privatwolke commented Oct 28, 2021 • edited Loading

Julian commented Oct 30, 2021

privatwolke commented Nov 3, 2021 • edited Loading

Julian commented Nov 3, 2021

salty-horse commented Dec 29, 2021 • edited Loading

Julian commented Dec 29, 2021

salty-horse commented Dec 29, 2021 • edited Loading

Julian commented Dec 29, 2021

Julian commented Dec 29, 2021

salty-horse commented Dec 29, 2021

Julian commented Dec 29, 2021

privatwolke commented Dec 29, 2021

Julian commented Feb 23, 2023

privatwolke commented Oct 28, 2021 •

edited

Loading

privatwolke commented Nov 3, 2021 •

edited

Loading

salty-horse commented Dec 29, 2021 •

edited

Loading

salty-horse commented Dec 29, 2021 •

edited

Loading