-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Minor optimization for updating DatasetDAO columns and metrics #20473
refactor: Minor optimization for updating DatasetDAO columns and metrics #20473
Conversation
properties["columns"] = cls.update_columns( | ||
model, properties.get("columns", []), commit=commit | ||
) | ||
properties["columns"] = cls.update_columns(model, properties["columns"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get(...)
is not needed here per the check on line #165.
for column in property_columns: | ||
column_id = column.get("id") | ||
|
||
if column_id: | ||
column_obj = db.session.query(TableColumn).get(column_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leverage the columns defined by the model rather than re-querying for each column individually.
superset/datasets/dao.py
Outdated
column_obj = DatasetDAO.update_column(column_obj, column, commit=commit) | ||
column_obj = DatasetDAO.update_column( | ||
column_by_id.get(column_id), | ||
commit=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to commit when updating the column. It can all be committed when the dataset is updated.
superset/datasets/dao.py
Outdated
else: | ||
column_obj = DatasetDAO.create_column(column, commit=commit) | ||
column_obj = DatasetDAO.create_column(column, commit=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to commit when creating the column. Also the TableColumn
object has no table_id
at this time and thus the record needs to be updated when the entire dataset is committed. Also autoflush=True
by default so the new column should have a primary key associated with it.
superset/datasets/dao.py
Outdated
column_obj = db.session.query(TableColumn).get(column_id) | ||
column_obj = DatasetDAO.update_column(column_obj, column, commit=commit) | ||
column_obj = DatasetDAO.update_column( | ||
column_by_id.get(column_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same behavior as before, i.e., the input column_obj
is None
if for some reason it doesn't exist—which shouldn't happen.
Codecov Report
@@ Coverage Diff @@
## master #20473 +/- ##
==========================================
- Coverage 66.75% 66.75% -0.01%
==========================================
Files 1740 1740
Lines 65149 65143 -6
Branches 6898 6898
==========================================
- Hits 43492 43486 -6
Misses 19908 19908
Partials 1749 1749
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
b74e5e6
to
cc1533a
Compare
cc1533a
to
a242b25
Compare
# Note for new columns the primary key is undefined sans a commit/flush. | ||
columns.append(DatasetDAO.create_column(properties, commit=False)) | ||
|
||
for id_ in {obj.id for obj in model.columns} - {obj.id for obj in columns}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimized looping.
DatasetDAO.update_column( | ||
column_by_id[properties["id"]], | ||
properties, | ||
commit=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit once for the dataset.
model: TableColumn, | ||
properties: Dict[str, Any], | ||
commit: bool = True, | ||
) -> TableColumn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return is not Optional[...]
per the superset/dao/base.py
methods.
@@ -287,24 +296,27 @@ def find_dataset_metric( | |||
return db.session.query(SqlMetric).get(metric_id) | |||
|
|||
@classmethod | |||
def delete_metric( | |||
cls, model: SqlMetric, commit: bool = True | |||
) -> Optional[TableColumn]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have been SqlMetric
and not TableColumn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the command still return the metric when it is deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, prob not needed
metrics.append(DatasetDAO.create_metric(properties, commit=False)) | ||
|
||
for id_ in {obj.id for obj in model.metrics} - {obj.id for obj in metrics}: | ||
DatasetDAO.delete_column(metric_by_id[id_], commit=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't SQLA automatically delete orphaned metrics once they are not associated with a dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to always skip auto-commit.
…ics (apache#20473) Co-authored-by: John Bodley <john.bodley@airbnb.com>
SUMMARY
This PR makes the minimal updates as possible to the DatasetDAO update logic to improve performance as the request was timing out when we (Airbnb) were trying to update a very very large dataset. If this is not suffice, the next and more aggressive update would be to add bulk updating.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
CI.
ADDITIONAL INFORMATION