Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating lots of validator classes uses lots of memory #868

Closed
privatwolke opened this issue Oct 28, 2021 · 12 comments
Closed

Creating lots of validator classes uses lots of memory #868

privatwolke opened this issue Oct 28, 2021 · 12 comments
Labels
Bug Something doesn't work the way it should. Help Wanted An enhancement or bug for which a pull request is welcomed and which should have a clear definition.

Comments

@privatwolke
Copy link

privatwolke commented Oct 28, 2021

We're using jsonschema to run validation on payloads coming into our web services and we've been noticing a steady memory increase in our services since upgrading to version 4+ of jsonschema. There also seems to be a performance drain, but it seems that this is already reported in #853. Further investigation revealed that the problem seems to stem from a piece of code that is heavily based on jsonschema documentation. I've narrowed the memory leak down to this snippet that reproduces the problem:

from time import time
import jsonschema

def extend_with_default(validator_class):
    # ref: https://python-jsonschema.readthedocs.io/en/stable/faq/#why-doesn-t-my-schema-s-default-property-set-the-default-on-my-instance
    validate_properties = validator_class.VALIDATORS["properties"]

    def set_defaults(validator, properties, instance, schema):
        for property, subschema in properties.items():
            if "default" in subschema:
                instance.setdefault(property, subschema["default"])

        for error in validate_properties(
            validator, properties, instance, schema,
        ):
            yield error

    return jsonschema.validators.extend(
        validator_class, {"properties" : set_defaults},
    )

i = 0
start = time()

while True:
    i += 1
    if time() - start > 120:
        break
    extend_with_default(jsonschema.Draft4Validator)({
        'properties': {
            'foo': {
                'type': 'string'
            }
        }
    }).validate({'foo': 'bar'})

print(f'completed {i} iterations')

With jsonschema==3.2.0, this manages to do 1436757 iterations on my machine and shows this memory consumption:
320

However, when updating to 4.1.2 we only get 6763 iterations done and have steadily rising memory:
412

I've been getting similiar results on Python 3.7, 3.8, 3.9 and 3.10 as well as when using Draft7Validator instead of Draft4Validator.

When we don't wrap the validator with extend_with_default the memory consumption remains steady also on 4.1.2 and we get 2324680 iterations done (that function seems to be a performance drain also in 3.2.0, but memory-wise it was alright):
412-undecorated

@Julian
Copy link
Member

Julian commented Oct 30, 2021

Hi. Is your actual code doing what your timing example does? It's unclear why there are different memory characteristics, so perhaps there's something to look into regardless, but it'd be quite odd to call extend_with_default in a loop, that's a function that creates a class, so you wouldn't do that over and over again. Realistic code would firstly do that only once:

ValidatorThatSetsDefaults = extend_with_default(jsonschema.Draft4Validator)

and if you always validate with the same schema (which seems to be the case from the example, but unclear whether that matches your real code), then creating a validator instance only once as well:

validator = ValidatorThatSetsDefaults({'properties': {'foo': {'type': 'string'}}})

i = 0
start = time()

while True:
    i += 1
    if time() - start > 120:
        break
    validator.validate({'foo': 'bar'})

of course with whatever your real schema is.

With that, on my macbook air, I get around 28 million iterations on 4.1.2 (I didn't try yet with 3.2).

On PyPy without warming the JIT that number goes up to ~140 million iterations.

Happy to look into performance issues for pathological cases as well, but want to make sure you observe the issue in a real scenario as well?

@privatwolke
Copy link
Author

privatwolke commented Nov 3, 2021

Hey! Thanks for your answer. It was actually close to what we had in our codebase 🙈. Creating the class once indeed gets rid of the memory issue. It's certainly more efficient to do it this way, but in theory I don't think that creating a class that shouldn't have any references to it after the line is executed should leave anything behind memory-wise. So as you said, there may still be something that is worth checking out :-)

Iterations on my machine:

  • 3.2.0: 3089741
  • 4.1.2: 2303999

@Julian
Copy link
Member

Julian commented Nov 3, 2021

Yep indeed, it still may be worth looking into, just way lower priority :)

@Julian Julian changed the title Memory Leak in 4.0.0+ when using extend_with_default from documentation Creating lots of validator classes uses lots of memory Nov 3, 2021
@Julian Julian added Bug Something doesn't work the way it should. Help Wanted An enhancement or bug for which a pull request is welcomed and which should have a clear definition. labels Nov 3, 2021
@salty-horse
Copy link

salty-horse commented Dec 29, 2021

This is a bug in attrs in a caching mechanism that runs when classes of the same name are created, like the closure in validators.create(). See bug python-attrs/attrs#826

Upgrade to attrs 21.3.0 to fix this.

Consider making the next jsonschema release depend on this attrs version, or exclude the bad versions.

@Julian
Copy link
Member

Julian commented Dec 29, 2021

A PR would certainly be welcome adding the pin from below.

Though I'm still unconvinced any reasonable code should encounter this issue, so if you have some that does I'd love to see that too.

@salty-horse
Copy link

salty-horse commented Dec 29, 2021

Before pinning the version I think we should wait for confirmation by the OP. I also wonder if it fixes #853 (I had both issues open and accidentally replied there at first. Sorry.)

Here's my use-case. I admit it's probably not reasonable :) I'm already working on fixes.

In my codebase, APIs have schemas of different versions, as I slowly moved up from draft 3. The requests JSON is wrapped in case-insensitive dict, so in the old jsonschema that supported types= this code executed for every request:

api_schema_json.setdefault('$schema', 'http://json-schema.org/draft-03/schema#')

# This one-liner works for any schema version, and has no equivalent now that `types=` is removed.
jsonschema.validate(request_json, api_schema_json, types={'object': (dict, CaseInsensitiveDict)})

When migrating to the new TypeChecker, I wrote this code instead:

# Find the correct version to use
original_validator = jsonschema.validators.validator_for(api_schema_json, default=jsonschema.validators.Draft3Validator)

# Extend the type checker
Validator = jsonschema.validators.extend(
        original_validator,
        type_checker=original_validator.TYPE_CHECKER.redefine('object', lambda _, x: isinstance(x, (dict, CaseInsensitiveDict))))

# Run validation
Validator(schema=api_schema_json).validate(reques_json)

Regardless of the attrs bug, I will now make sure to only create one "type-checker-patched" validator per version, and will also call Validator(schema=api_schema_json) once per API.

@Julian
Copy link
Member

Julian commented Dec 29, 2021

I see. validator_for could possibly interact better with these kinds of attempts to "override" a draft's validator, but probably in the interim it's yeah easy enough just for you to have a dict mapping original validator classes to your overridden ones and use that.

@Julian
Copy link
Member

Julian commented Dec 29, 2021

(I strongly doubt this issue has anything to do with #853, as I said somewhere above -- #853 should now be fixed after some unrelated caching.)

@salty-horse
Copy link

On a related note, I wonder if the docs should strongly encourage creating and storing validator objects per object when used in production code. I suspect it's not uncommon for users to execute Validator(schema).validate(input) per input.

@Julian
Copy link
Member

Julian commented Dec 29, 2021

I wouldn't be so sure, I'd suspect users who have found the Validator classes will know, just like in normal Python code, that if you want to do things many times, you should create instances and reuse them, not repeatedly create them. Doing so isn't really anything specific to jsonschema, but I have no evidence one way or the other.

Way more likely are users who just use jsonschema.validate everywhere I think, but I don't know how to solve that, as nearly the first thing in the docstring for jsonschema.validate says:

If you know you have a valid schema already, especially if you intend to validate multiple instances with the same schema, you likely would prefer using the Validator.validate method directly on a specific validator (e.g. Draft7Validator.validate).

(though it didn't always say that IIRC.)

But I'm certainly open to adding it regardless of what I think if you can come up with some short wording in a PR, specifically pointing it out at worst can't hurt.

@privatwolke
Copy link
Author

I can have a look at this with my original snippet early in the new year when I'm back from vacation. Thank you for looking at this even though it stemmed from an unreasonable usage 😊

@Julian
Copy link
Member

Julian commented Feb 23, 2023

Hi there!

As part of a major upgrade in $ref behavior in #1049, I've already bumped the attrs lower-pin mentioned above

I've also just attempted to run the initial example, and here at least don't observe any memory growth after ~5 minutes of runtime.

Going to close this as hopefully fixed either directly via the attrs bump or via some other change since this was filed (sorry it took so long) -- but if you observe this behavior again feel free to follow up with a new reproducer.

@Julian Julian closed this as completed Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something doesn't work the way it should. Help Wanted An enhancement or bug for which a pull request is welcomed and which should have a clear definition.
Projects
None yet
Development

No branches or pull requests

3 participants