Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter to limit depth of recursion across relationships #97

Open
grant-humphries opened this issue Apr 6, 2017 · 15 comments
Open

Comments

@grant-humphries
Copy link

Hi folks, I'm attempting to upgrade from version 0.3.1 to 0.3.3. The recursion across all relationships to create schema nodes for more distantly related tables (as opposed to just primary relationships) is one of the features that is leading me to upgrade, but the ORM that I'm working with is pretty massive. When calling SQLAlchemySchemaNode against one of our tables I let the code run for a minute or so and add_nodes method got called just under 41,000 times. I've let it run for several minutes longer than that, but the task I was attempting never completed and python was chewing up a pretty good chunk of memory during that time.

In light of this I propose that a recursion depth parameter be added that limits how many tables away from the root an relationship can be and still be added to the schema. It may be a good idea to have a default value in place for this as well to avoid situations like the one I describe.

@tisdall
Copy link
Collaborator

tisdall commented Apr 6, 2017

Ideally the best solution would be for it stop if it reaches a node it's already visited. However, that might be complicated to implement and requires keeping track of all visited nodes.

A recursion depth count would stop it running infinitely, but the end result coming from the method would probably be unusable. It'd probably be good to have it throw an exception if the recursion goes beyond a given depth. (this would probably good to have either way)

I haven't worked on this project in years, though, so I'm not likely the best person to implement this. A recursion depth counter shouldn't be too hard to implement and I'd accept a PR for that if it had tests and docs.

@grant-humphries
Copy link
Author

Thanks for taking the time to respond to this @tisdall. This package is pretty useful to my company so I might be able to convince them to allow me some time to create a pull request for this. A couple of responses to your comments:

Ideally the best solution would be for it stop if it reaches a node it's already visited. However, that might be complicated to implement and requires keeping track of all visited nodes.

It seems like a straight forward solution to this would be add the sqlalchemy class names to a set after each node visit. Since sets use hash tables the look-ups should still be pretty quick even if the ORM is really large. Does anything come to mind off hand that would circumvent this approach? It looks like something similar has been done with in prevent circular dependencies with parent_ parameter.

A recursion depth count would stop it running infinitely, but the end result coming from the method would probably be unusable.

Can you elaborate on why the schema may be unusable in those cases? It seems like it would be possible to stop adding relationships once they are more than x levels away from the root table.

In our case I can't imagine a scenario where we want tables more the three or four levels from the root added to the schema so it would still be nice to be able prevent the creation of large parts of the schema that are never intended to be used.

@tisdall
Copy link
Collaborator

tisdall commented Apr 16, 2017

@grant-humphries - What I meant was that if you had data/schema that went 8 layers deep but returned only 4, the result wouldn't be very usable as it's missing data and if you used that to edit the data the end result would save only 4 layers and truncate the rest. So, I think it should be an exception and leave it up to the implementer to deal with it as they want. In other words, it should always be the case that objectify(dictify(x)) should always return x if there's no exception.

In your specific example, you'd specify the limit to recursion to be 4 since you expect nothing to be deeper than that. Anything that went deeper would be an exception.

@tisdall
Copy link
Collaborator

tisdall commented Apr 16, 2017

It's been a while since I used SQLAlchemy so I can't remember how it handles some things... I don't think you can just look at class names as you could have a table that has foreign keys to itself. I think you'd have to look at the table and the primary keys to identify if it's the exact same row you've already visited.

@maparent
Copy link

maparent commented Apr 24, 2017

This may sound simple-minded, but what about immediately setting the __colanderalchemy__ property of the class in the SQLAlchemySchemaNode initializer? That node is not complete yet, but it is already a valid reference.
Then, when we get the relationship _class in get_schema_from_relationship (schema.py:460), we can check if the attribute is set and return it directly if so.
The parents_ attribute will not be valid with this approach, but it does not seem to be used except to (unsuccessfully) avoid the loop.

@tisdall
Copy link
Collaborator

tisdall commented Apr 25, 2017

@grant-humphries - can you provide a failing example that we can run tests against? Take a look at the 'tests' directory here for how the tests work and see if you can modify one to mimic your data structure and hopefully replicate the problem.

@Joishi
Copy link

Joishi commented Nov 2, 2017

@tisdall - I'll take a look, as I am encountering the same issue

@Joishi
Copy link

Joishi commented Nov 6, 2017

@tisdall - Actually, looking at my models and your tests and doing some debugging, I'm not noticing any sort of recursive circle issues. What I am noticing is that it's recursing as deeply down the stack as necessary to create a full schema. My ORM contains about 400 models, many of which are intertwined. To create a full schema for 1 model ends up creating over 250k schemas (many of which are duplicates, just at different levels in different relationships).

I have added some additional models to the tests and added one more test to allow for includes to support dot notation. I will likely submit a pull request this week for review.

@tisdall
Copy link
Collaborator

tisdall commented Nov 6, 2017

@Joishi - How are you planning on handling the issue in the PR?

@Joishi
Copy link

Joishi commented Nov 6, 2017

I'm not sure what you mean? This is the first time I've worked on a shared project in Github, so I'm not familiar with the expected workflow. I have forked the repo and made some local changes (not pushed yet) .. was going to submit a pull request from my fork. Or maybe I'm not understanding what you are asking correctly. :)

@tisdall
Copy link
Collaborator

tisdall commented Nov 6, 2017

@Joishi - well, are you adding a parameter to limit the recursion, or are you implementing something else?

@Joishi
Copy link

Joishi commented Nov 6, 2017

@tisdall - I'm going to add support for dot notation of includes.. For example:

includes = [
'col1',
'rel1',
'rel1.col1',
]

This will generate a schema that contains what you specify in the includes, and only those values. If you don't specify an includes, it will get everything by default (as it does now).

The above example would generate a schema that includes the relationship rel1, but only the col1 of that relationship. Any relationships THAT object has will not be traversed.

@tisdall
Copy link
Collaborator

tisdall commented Nov 6, 2017

@Joishi - Just be aware that I haven't touched anything related to this library for quite some time, so I can't guarantee that it'll be merged any time soon... I'll likely need some input from some other users before going ahead with a merge.

Joishi pushed a commit to Joishi/ColanderAlchemy that referenced this issue Nov 7, 2017
If test is ran to generate full Route schema, it creates 113 total
SQLAlchemySchemaNode objects as it traverses all relationships. For
small orms this may not be an issue, but the problem gets exponentially
worse as the size of the orm grows. This commit will enable users to
explicitly state which relationships and columns to include in subnodes
by allowing for dot-notation syntax in the includes keyword arg.

Resolves: stefanofontanelli#97
@Joishi
Copy link

Joishi commented Nov 7, 2017

@tisdall - I submitted the pull request but it failed on 2.6 support .. I'm guessing it has to do with the list comprehension I used. I will review what 2.6 support will require.

Oh ... or not. This is in the build log:

Exception: SQLAlchemy requires Python 2.7 or higher.

@tisdall
Copy link
Collaborator

tisdall commented Nov 8, 2017

Yeah, SQLA dropped support for Python 2.6 so we probably should too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants