perf: Implement model specific lookups by id to improve performance #20974

bkyryliuk · 2022-08-04T15:58:43Z

SUMMARY

Currently the access checks are happening 2ce in the explore flow, once during the object lookup and later doing an explicit check. Existing implementation was spending > 500 ms in our deployment on doing object by id lookup and that contributed to ~1 - 1.5 second slowdown of the explore endpoint.

The proposed solution skips the filter on the chart / datasource lookup and let's explore to handle the access checks and return 403 if it was denied. Prior behavior was to return 404.

Out of scope

superset/superset/views/utils.py

Line 104 in f0ca158

def get_permissions(

function takes > 1 second in our deployment, it is part of the explore flow while bootsrapping user data

superset/superset/views/utils.py

Line 72 in f0ca158

    
           def bootstrap_user_data(user: User, include_perms: bool = False) -> Dict[str, Any]:

I will have separate PR to optimize the permissions lookup.

Before

After

TESTING INSTRUCTIONS

[x] Deployed and tested on dropbox staging

ADDITIONAL INFORMATION

More granular breakdown

@classmethod
def find_by_id(
    cls, model_id: Union[str, int], session: Session = None
) -> Optional[Model]:
    """
    Find a model by id, if defined applies `base_filter`
    """
    with newrelic.agent.FunctionTrace("superset.dao.base.find_by_id.session", group="DAO"):
        session = session or db.session
    with newrelic.agent.FunctionTrace("superset.dao.base.find_by_id.query_contruction", group="DAO"):
        query = session.query(cls.model_cls)
        if cls.base_filter:
            data_model = SQLAInterface(cls.model_cls, session)
            query = cls.base_filter(  # pylint: disable=not-callable
                cls.id_column_name, data_model
            ).apply(query, None)
        id_filter = {cls.id_column_name: model_id}
    try:
        with newrelic.agent.FunctionTrace("superset.dao.base.find_by_id.query_execution", group="DAO"):
            return query.filter_by(**id_filter).one_or_none()
    except StatementError:
        # can happen if int is passed instead of a string or similar
        return None

bkyryliuk · 2022-08-04T16:02:54Z

superset/common/query_context_processor.py

@@ -491,6 +491,8 @@ def get_viz_annotation_data(
        chart = ChartDAO.find_by_id(annotation_layer["value"])
        if not chart:
            raise QueryObjectValidationError(_("The chart does not exist"))
+        if not chart.datasource:


addresses mypy issue in

viz_obj = get_viz( datasource_type=chart.datasource.type, datasource_id=chart.datasource.id, form_data=form_data, force=force, )

codecov · 2022-08-05T17:07:06Z

Codecov Report

Merging #20974 (ab59aa4) into master (eb5369f) will decrease coverage by 0.03%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master   #20974      +/-   ##
==========================================
- Coverage   66.38%   66.34%   -0.04%     
==========================================
  Files        1767     1767              
  Lines       67232    67317      +85     
  Branches     7138     7138              
==========================================
+ Hits        44633    44664      +31     
- Misses      20773    20827      +54     
  Partials     1826     1826

Flag	Coverage Δ
hive	`53.15% <16.66%> (-0.05%)`	⬇️
mysql	`80.92% <66.66%> (-0.11%)`	⬇️
postgres	`80.98% <66.66%> (-0.12%)`	⬇️
presto	`53.05% <16.66%> (-0.05%)`	⬇️
python	`81.46% <66.66%> (-0.12%)`	⬇️
sqlite	`79.58% <66.66%> (-0.11%)`	⬇️
unit	`50.49% <66.66%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/common/query_context_processor.py	`88.49% <0.00%> (-1.51%)`	⬇️
superset/dao/base.py	`95.40% <100.00%> (ø)`
superset/explore/utils.py	`100.00% <100.00%> (ø)`
superset/explore/form_data/commands/utils.py	`92.30% <0.00%> (-7.70%)`	⬇️
superset/explore/form_data/api.py	`89.24% <0.00%> (-6.46%)`	⬇️
superset/views/datasource/utils.py	`93.33% <0.00%> (-2.02%)`	⬇️
superset/models/sql_lab.py	`77.01% <0.00%> (-1.74%)`	⬇️
superset/explore/permalink/api.py	`91.37% <0.00%> (-1.73%)`	⬇️
superset/models/helpers.py	`39.40% <0.00%> (-1.38%)`	⬇️
... and 8 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

villebro

Thanks for looking into this perf issue!

villebro · 2022-08-08T07:15:25Z

superset/dao/base.py

+        cls,
+        model_id: Union[str, int],
+        session: Session = None,
+        no_filter: bool = False,


nit: no_filter feels slightly ambiguous. Would skip_filter or ignore_filter be more explicit?

Agreed, let's go with skip_filter

nit nit: can we maybe call this skip_base_filter to make it even more explicit? I was confused for a while why get_by_id would need still need a filter.

tests/unit_tests/charts/dao/dao_tests.py

villebro · 2022-08-08T07:24:30Z

tests/unit_tests/datasets/dao/dao_tests.py

+    from superset.connectors.sqla.models import SqlaTable
+    from superset.models.core import Database
+
+    engine = session.get_bind()
+    SqlaTable.metadata.create_all(engine)  # pylint: disable=no-member
+
+    db = Database(database_name="my_database", sqlalchemy_uri="sqlite://")
+    sqla_table = SqlaTable(
+        table_name="my_sqla_table",
+        columns=[],
+        metrics=[],
+        database=db,
+    )
+
+    session.add(db)
+    session.add(sqla_table)
+    session.flush()
+    yield session


Same note about missing cleanup as previously.

ktmud · 2022-08-08T17:23:02Z

I wonder if the slowness in your check permission query is because postgres did not automatically create indexes on the sqlatable_user table (ref).

superset_test=> \d sqlatable_user;
                             Table "public.sqlatable_user"
  Column  |  Type   | Collation | Nullable |                  Default                   
----------+---------+-----------+----------+--------------------------------------------
 id       | integer |           | not null | nextval('sqlatable_user_id_seq'::regclass)
 user_id  | integer |           |          | 
 table_id | integer |           |          | 
Indexes:
    "sqlatable_user_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "sqlatable_user_table_id_fkey" FOREIGN KEY (table_id) REFERENCES tables(id)
    "sqlatable_user_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

villebro

LGTM

bkyryliuk · 2022-08-08T20:31:05Z

I wonder if the slowness in your check permission query is because postgres did not automatically create indexes on the sqlatable_user table (ref).

superset_test=> \d sqlatable_user;
                             Table "public.sqlatable_user"
  Column  |  Type   | Collation | Nullable |                  Default                   
----------+---------+-----------+----------+--------------------------------------------
 id       | integer |           | not null | nextval('sqlatable_user_id_seq'::regclass)
 user_id  | integer |           |          | 
 table_id | integer |           |          | 
Indexes:
    "sqlatable_user_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "sqlatable_user_table_id_fkey" FOREIGN KEY (table_id) REFERENCES tables(id)
    "sqlatable_user_user_id_fkey" FOREIGN KEY (user_id) REFERENCES ab_user(id)

sqlatable_user

Yeah - we have both indexes:

new relic states that a lot of time is spent in python code, not sql execution. I have 2 more places I want to optimized for better perf: https://apache-superset.slack.com/archives/G013HAE6Y0K/p1659722615103529?thread_ts=1659395755.971919&cid=G013HAE6Y0K and will have PRs coming up soon

villebro

Nice polishing in the last commit 👍

…20974) * Implement model specific lookups by id to improve performance * Address comments e.g. better variable names and test cleanup * commit after cleanup * even better name and test cleanup via rollback Co-authored-by: Bogdan Kyryliuk <bogdankyryliuk@dropbox.com>

pull-request-size bot added the size/M label Aug 4, 2022

bkyryliuk commented Aug 4, 2022

View reviewed changes

bkyryliuk added the v1.5 label Aug 4, 2022

bkyryliuk changed the title ~~[WIP] Implement model specific lookups by id to improve performance~~ perf: Implement model specific lookups by id to improve performance [WIP] Aug 4, 2022

bkyryliuk force-pushed the bogdan/dao_query branch 2 times, most recently from ce579da to b5da0be Compare August 4, 2022 17:57

pull-request-size bot added size/L and removed size/M labels Aug 4, 2022

bkyryliuk force-pushed the bogdan/dao_query branch 9 times, most recently from a8b2d63 to 28db76e Compare August 5, 2022 16:40

apache deleted a comment from codecov bot Aug 5, 2022

bkyryliuk force-pushed the bogdan/dao_query branch 2 times, most recently from 9f56905 to eeb4fa5 Compare August 5, 2022 16:58

Implement model specific lookups by id to improve performance

a695fc6

bkyryliuk force-pushed the bogdan/dao_query branch from eeb4fa5 to a695fc6 Compare August 5, 2022 17:23

bkyryliuk requested a review from a team August 5, 2022 19:23

bkyryliuk changed the title ~~perf: Implement model specific lookups by id to improve performance [WIP]~~ perf: Implement model specific lookups by id to improve performance Aug 5, 2022

villebro reviewed Aug 8, 2022

View reviewed changes

Address comments e.g. better variable names and test cleanup

0ffe859

rusackas requested review from michael-s-molina, stephenLYZ and kgabryje August 8, 2022 17:10

commit after cleanup

fc63ff2

villebro approved these changes Aug 8, 2022

View reviewed changes

even better name and test cleanup via rollback

ab59aa4

bkyryliuk requested a review from villebro August 8, 2022 23:40

villebro approved these changes Aug 9, 2022

View reviewed changes

bkyryliuk merged commit 17b5803 into apache:master Aug 9, 2022

bkyryliuk deleted the bogdan/dao_query branch August 9, 2022 18:15

michael-s-molina removed the v1.5 label Jan 3, 2023

mistercrunch added 🍒 1.5.2 🍒 1.5.3 🚢 2.1.3 🚢 2.1.0 and removed 🍒 1.5.3 🍒 1.5.2 🚢 2.1.3 🚢 2.1.0 labels Feb 18, 2024

mistercrunch added 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 and removed 🚢 2.1.3 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Implement model specific lookups by id to improve performance #20974

perf: Implement model specific lookups by id to improve performance #20974

bkyryliuk commented Aug 4, 2022 •

edited

Loading

bkyryliuk Aug 4, 2022 •

edited

Loading

codecov bot commented Aug 5, 2022 •

edited

Loading

villebro left a comment

villebro Aug 8, 2022

bkyryliuk Aug 8, 2022

ktmud Aug 8, 2022

villebro Aug 8, 2022

bkyryliuk Aug 8, 2022

ktmud commented Aug 8, 2022

villebro left a comment

bkyryliuk commented Aug 8, 2022 •

edited

Loading

villebro left a comment

perf: Implement model specific lookups by id to improve performance #20974

perf: Implement model specific lookups by id to improve performance #20974

Conversation

bkyryliuk commented Aug 4, 2022 • edited Loading

SUMMARY

Out of scope

Before

After

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

bkyryliuk Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 5, 2022 • edited Loading

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

villebro Aug 8, 2022

Choose a reason for hiding this comment

bkyryliuk Aug 8, 2022

Choose a reason for hiding this comment

ktmud Aug 8, 2022

Choose a reason for hiding this comment

villebro Aug 8, 2022

Choose a reason for hiding this comment

bkyryliuk Aug 8, 2022

Choose a reason for hiding this comment

ktmud commented Aug 8, 2022

villebro left a comment

Choose a reason for hiding this comment

bkyryliuk commented Aug 8, 2022 • edited Loading

villebro left a comment

Choose a reason for hiding this comment

bkyryliuk commented Aug 4, 2022 •

edited

Loading

bkyryliuk Aug 4, 2022 •

edited

Loading

codecov bot commented Aug 5, 2022 •

edited

Loading

bkyryliuk commented Aug 8, 2022 •

edited

Loading