Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: Replace Peewee with SQLAlchemy/Alembic #1417

Merged
merged 80 commits into from
Dec 11, 2016

Conversation

washort
Copy link

@washort washort commented Nov 21, 2016

  • Describe current DB schema using SQLAlchemy's ORM
  • Switch test setup to SQLAlchemy
  • Rewrite all queries using SQLAlchemy query API
    • Rewrite queries in CLI commands
    • Rewrite queries in HTTP handlers
  • Fix admin views to work with SQLAlchemy
  • Set up Alembic for schema migrations
    • Add to the project (using Flask-Migrate)
    • Transfer migrations from v1 to Alembic (only the is_draft one)
    • Rename old migrations folder to old_migrations (might just delete it)
    • Add some code in startup script to check if we're at a recent version of the DB, and if not stop execution (with the option to disable this with a configuration override).
    • Add "protection" script in the new migrations folder that prints a message to those who try to run migration "old style".
    • Check if we need to call alembic stamp head when creating the DB at the first time.
  • Replace MeteredModel with SQLAlchemy timing events

Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting 😃

return fn(*args, **kwargs)
except DoesNotExist:
rv = fn(*args, **kwargs)
if rv is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLA will never raise an exception for missing rows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its .get() and .first() methods return None when there's no entry, so I believe that's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can get rid of this helper, as it seems that Flask-SQLAlchemy has its own: first_or_404 and get_or_404.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLA will never raise an exception for missing rows?

See also .one() and .one_or_none()

@@ -4,6 +4,7 @@ Flask-Admin==1.1.0
Flask-RESTful==0.3.5
Flask-Login==0.3.2
Flask-OAuthLib==0.9.2
Flask-SQLAlchemy==2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another project I used alchy which also has Flask-Alchy which is a drop-in replacement for it. The benefit is that we won't need Flask session whenever we're using the DB. The downside is that alchy never seemed to gain mind share unlike Flask-SQLAlchemy.

TL;DR: we can keep Flask-SQLA and see if it adds too much boilerplate to the jobs code. If it isn't, keep it. Otherwise consider alchy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suspicion is that it won't add boilerplate, calling create_app() should be most of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: alchy, it seems the author's mostly moved onto https://github.com/dgilland/sqlservice#history

data_source_id = Column(db.Integer, db.ForeignKey("data_sources.id"), nullable=True)
data_source = db.relationship(DataSource)
latest_query_data_id = Column(db.Integer, db.ForeignKey("query_results.id"), nullable=True)
latest_query_data = db.relationship(QueryResult)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both latest_query_data_id and latest_query_data? (applies to all similar fields) SQLA doesn't have a convenience method to get the object id instead of loading the object itself otherwise?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLA is a bit more explicit about this stuff; the foo_id field is the actual db column, and foo is the attribute for the related ORM object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I added a few comments, hopefully they're helpful... didn't have time to review it all.

init_db()
db.session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the db.session.commit()?

I assume create_db or init_db calls Flask-SQLAlchemy's create_all(), which will implicitly call a commit:
http://stackoverflow.com/questions/34410091/flask-sqlalchemy-how-can-i-call-db-create-all-and-db-drop-all-without-trigg

class TimestampMixin(object):
updated_at = Column(db.DateTime(True), default=db.func.now(),
onupdate=db.func.now(), nullable=False)
created_at = Column(db.DateTime(True), default=db.func.now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should use server_default?

http://stackoverflow.com/a/33532154/770425

class BelongsToOrgMixin(object):
@classmethod
def get_by_id_and_org(cls, object_id, org):
return cls.get(cls.id == object_id, cls.org == org)
return cls.query.filter(cls.id == object_id, cls.org == org).first()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this query ever return more than one result? If not, should probably use one() or one_or_none()

@classmethod
def get_by_slug(cls, slug):
return cls.get(cls.slug == slug)
return cls.query.filter(cls.slug == slug).first()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this should also be a one() or one_or_none() as I think you want errors if more than one result is returned for a given slug.

name = Column(db.String(100))
permissions = Column(postgresql.ARRAY(db.String(255)),
default=DEFAULT_PERMISSIONS)
created_at = Column(db.DateTime(True), default=db.func.now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want server_default()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but I don't want to change the schema until after I get things working as-is. (Hi Jeff! It's been a while! Never expected to see someone from SFSH commenting on my code :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm curios what's SFSH :-)

@jeffwidman
Copy link
Contributor

Fixes #1124

@arikfr
Copy link
Member

arikfr commented Nov 28, 2016

@washort I finished with my work on the frontend (for now) and want to give you a hand here. I rebased you branch with the latest master & fixed an issue with the settings/DATABASE_URL. Do you have some unpushed work or can I do a force push with these changes?

@arikfr
Copy link
Member

arikfr commented Nov 28, 2016

I updated the tests code and now we get real failures, but still many tests fail just because the database runs out of connections. I tried to compare how we manage the connection/session in setUp/tearDown with other projects and couldn't spot an obvious issue. :-(

Calling engine#dispose (9f43542) seems to fix this, but is it the right usage? Why I haven't seen this in any other example?

@arikfr
Copy link
Member

arikfr commented Nov 28, 2016

Another SQLA question:

    @classmethod
    def get_by_id_and_org(cls, visualization_id, org):
        return cls.query.join(Query).filter(cls.id == visualization_id, Query.org == org).one()

With peewee I could pass to such method either Organization object or just an Organization id. But in SQLA it seems that for Query.org I can only use Organization object and for Query.org_id I can use only integers.

Any middle ground?

@washort
Copy link
Author

washort commented Nov 28, 2016

No, you have to match up the right value with the right attribute. This is only really an issue in unit tests though, I think, because as far as I can tell, in the rest of the code you should only be using object ids when they're in query parameters; the rest of the time you can just pass around objects.

@arikfr
Copy link
Member

arikfr commented Nov 28, 2016

Working today on this branch was a reminder why I never liked SQLAlchemy in the first place :-\ It's very powerful, but why the simple stuff are so hard and verbose?

@arikfr
Copy link
Member

arikfr commented Nov 29, 2016

@washort
When I created the getredash version of your branch, it somehow got messed up and included lots of unrelated commits. I tried rebasing, which resulted in a lot of wasted time and weird result. Eventually I just cherry-picked our commits on top of the latest master and it worked.

But I had to force push the result over your branch... I hope you didn't have anything uncommitted.

@washort
Copy link
Author

washort commented Dec 1, 2016

Looks like we're experiencing test failures due to webpack not running in CircleCI.

@arikfr
Copy link
Member

arikfr commented Dec 1, 2016

Replace GFKBase usage with bridge tables

I'm not 100% sure about this. How will it work? Any reference implementation/documentation?

@arikfr
Copy link
Member

arikfr commented Dec 1, 2016

All tests pass now (I changed configuration to run Webpack) 💯

But I greped the code for things like update_instance and save and still found usage of it. Mainly in the CLI but also in places where we were too lazy to write the tests ;-)

@washort
Copy link
Author

washort commented Dec 1, 2016

That's why I fixed the tests first -- to see where it'd be profitable to write more tests :-)

Re bridge tables - this is what I was looking at: http://docs.sqlalchemy.org/en/latest/_modules/examples/generic_associations/table_per_related.html

@arikfr
Copy link
Member

arikfr commented Dec 4, 2016

Re bridge tables - this is what I was looking at: http://docs.sqlalchemy.org/en/latest/_modules/examples/generic_associations/table_per_related.html

This will work for AccessPermission, but won't work for ApiKey. For ApiKey we need to do the opposite lookup - given an api key, find the object associated with it.

updated_at = Column(db.DateTime(True), default=db.func.now(),
onupdate=db.func.now(), nullable=False)
created_at = Column(db.DateTime(True), default=db.func.now(),
nullable=False)


class ChangeTrackingMixin(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about implementing this with a an before_update or after_update event?

It seems that SQLA has the tools to determine if something was changed in the event. I tried to experiment with it, but couldn't get the event to trigger. :\

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out why the event didn't trigger and implemented it: e8739b3

If you have no comments, I will push this change to your branch.

The part I'm not happy about is how we deduce the user who changed the object, but I'm not sure it's that bad. Eventually this code is only relevant for the API, which is Flask based... and I added some safeguards to make sure it doesn't cause harm outside of Flask context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue with the way I set who changed the object is that it can't be changed :-\ Not a huge deal, mainly an issue in tests at the moment but feels wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at this next. What was the issue with the way it's done now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly the fact that you need to call record_changes and the manual "calculation" of what changed.

But apparently doing it in the after_insert/after_update events is wrong (SQLA complains about using Session.add there) and using before_flush also introduces its own challenges.

If I won't find a solution for this today, I will revert back to your version, apply record_changes where needed and revisit this in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yeah I didn't want to try to get too magic at this point, might be interesting to investigate later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always try to maintain balance between "magic" and "hassle" :-) At first it seemed like a good balance point here, but as this starts to become too complex, I think I will revert to the explicit version you had.

@arikfr
Copy link
Member

arikfr commented Dec 7, 2016

@washort I did some updates to the CLI tests:

  • Switch to using the "main" manager instance (the FlaskGroup) object, to ensure that the Flask app is set (there were tests failing because it was expecting the FLASK_APP env variable).
  • Added missing Session.commit calls in the CLI commands.
  • Switch to using flask.cli.AppGroup to create the subcommands so we don't have to wrap everything with with_app_context decorator.

There are still some (4?) tests failing because the CLI creates its own app_context/db session. I'm not sure how to solve this :-( One option is to create our own with_app_context decorator that will reuse an existing app context if one is available, but it feels wrong to change production code to accommodate tests...

Also moved old migrations to old_migrations folder (before deleting them entirely).
@arikfr
Copy link
Member

arikfr commented Dec 7, 2016

Added Alembic (with Flask-Migrate). Updated the tasks to reflect integration status.


from redash import models
from redash.query_runner import query_runners
from redash.query_runner import get_configuration_schema_for_query_runner_type
from redash.utils.configuration import ConfigurationContainer

manager = click.Group(help="Data sources management commands.")
manager = AppGroup(help="Data sources management commands.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, I didn't see that.

@arikfr
Copy link
Member

arikfr commented Dec 8, 2016

Aside from more fixes to broken functionality (if there is anything left) the only thing I want to add in this branch before merging is "Replace MeteredModel with SQLAlchemy timing events". All the rest can be a follow up IMO.

@washort
Copy link
Author

washort commented Dec 9, 2016

Had a look at the query-timing stuff available in SQLAlchemy - the main difficulty with replicating the current behavior is that by the time queries are executed, the only information available is the query text and its parameters; model class, method name etc. aren't accessible.

Any thoughts on how you want to handle this? Obviously we can parse the SQL to retrieve table names and operation type, if that's the path you prefer.

Never mind, I was looking at after_cursor_execute, but after_execute gives access to the query before it's rendered to a string.

@arikfr
Copy link
Member

arikfr commented Dec 9, 2016 via email

@washort
Copy link
Author

washort commented Dec 10, 2016

In that case I think we're done. I still want to change the schema a bit but that can happen in a new branch.

@arikfr arikfr changed the title WIP: Replace Peewee with SQLAlchemy/Alembic Change: Replace Peewee with SQLAlchemy/Alembic Dec 11, 2016
@arikfr arikfr merged commit 19960ee into getredash:master Dec 11, 2016
@arikfr
Copy link
Member

arikfr commented Dec 11, 2016

I've added some more metrics and .... it's merged! :)

@jeffwidman
Copy link
Contributor

jeffwidman commented Dec 12, 2016

Major congrats you two, this was a lot of work!

PS: @washort indeed surprised to see you on here. Hope you and fam are well. This is kinda like the inverse of when a IRL coworker told me he was googling something and found what he needed on SO, then realized I'd written the answer.

@arikfr SFSH is a mailing list of a loosely affiliated group of folks, many (but not all) of whom attend gracepres.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants