Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability: incorporate early pruning optimizations #368

Merged
merged 7 commits into from
Apr 28, 2021

Conversation

dorisjlee
Copy link
Member

Overview

Incorporating early pruning optimizations described in our recent paper to speed up cases where the visualization search space is large (e.g., VisList contains more than top K visualizations and the dataframe has a large number of rows).

Changes

  • Adding in execute_approx_sample for approximating samples for early pruning
  • Apply early pruning based on empirically-tested width and length criteria
  • Added tests and warning messages when early pruning conditions are met

Others:

  • Increasing the sampling start and cap for overall sampling
  • Adding config flag to turn lazy maintenance on/off
  • Added cached method for df.unique()

Example Output

Here are some results from preliminary experiments on the 1M Airbnb dataset. We duplicate a single visualization N number of times and measuring the performance. The time is the sum of the time it takes to go through the search space and the time that it takes to recompute the top K visualization upon display of the VisList. The significant speedup in the former time supersedes extra time it takes for recompute, around 17, typically no more than a few more than 15 (which k=15).

Zooming out, the speedup can be significant when searching through over 100+ vis:

@dorisjlee dorisjlee requested a review from thyneb19 April 27, 2021 20:35
@codecov
Copy link

codecov bot commented Apr 28, 2021

Codecov Report

Merging #368 (1cf6439) into master (1dbbcb9) will increase coverage by 0.20%.
The diff coverage is 97.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #368      +/-   ##
==========================================
+ Coverage   84.46%   84.67%   +0.20%     
==========================================
  Files          51       51              
  Lines        3902     3948      +46     
==========================================
+ Hits         3296     3343      +47     
+ Misses        606      605       -1     
Impacted Files Coverage Δ
lux/executor/Executor.py 80.85% <88.88%> (+1.36%) ⬆️
lux/executor/PandasExecutor.py 95.93% <95.00%> (-0.18%) ⬇️
lux/_config/config.py 86.86% <100.00%> (+0.60%) ⬆️
lux/core/frame.py 82.06% <100.00%> (+0.32%) ⬆️
lux/core/series.py 55.55% <100.00%> (+1.70%) ⬆️
lux/executor/SQLExecutor.py 84.45% <100.00%> (ø)
lux/interestingness/interestingness.py 90.90% <100.00%> (+2.95%) ⬆️
lux/processor/Compiler.py 98.01% <100.00%> (+0.04%) ⬆️
lux/vis/Vis.py 75.73% <100.00%> (+0.14%) ⬆️
lux/vis/VisList.py 52.84% <100.00%> (+1.51%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dbbcb9...1cf6439. Read the comment docs.

Copy link
Contributor

@thyneb19 thyneb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Doris the changes overall look good, also thanks for doing the quick fix for the SQLExecutor. I think there is more work that needs to be done to make the pruning actually compatible with the SQLExecutor case, but maybe we can merge in the PandasExecutor implementation first.

@dorisjlee
Copy link
Member Author

Thanks @thyneb19! There definitely needs to be more work to support pruning in SQLExecutor in the future, which would require adding an implementation for execute_approx_sample. Will merge this in for now!

@dorisjlee dorisjlee merged commit a0cb921 into lux-org:master Apr 28, 2021
dorisjlee added a commit that referenced this pull request Apr 30, 2021
…ales (#262)

* Add support to improve temporal action to display different timescales

* Resolve PR comments

* Add support to improve temporal action to display different timescales

* Resolve PR comments

* Reformat files using black

* "All-column" vis when only few columns in dataframe #199 (#336)

Co-authored-by: Caitlyn Chen <caitlynachen@berkeley.edu>
Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>

* documentation and cleaning
* added notebook gallery
* update README
* removed scatterplot message in SQLExecutor
* fixed typo in SQL documentation

* update README and bump version

* bump version

* clear propagated vis data intent after PandasExecutor completes execute (#297)

* fix black to stable version

* Scalability: incorporate early pruning optimizations (#368)

* changes from perf branch to config
* added flag for turning on/off lazy maintain optimization

* merged in approx early pruning code

* increase overall sampling start and cap

* Adjust width and length criteria for early pruning vislist based on experiment results; Add warning message and test for early pruning

* black version update

* version lock on black

* * fixed sql tests (added approx to execute constructor)
* fixed sampling config test
* improved Executor documentation

* timescale feature
* adding weekday
* adding docs
* bugfix for y axis line chart export
* fixing temporal axis by adding timescale variable in Clause

Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>
Co-authored-by: Caitlyn Chen <caitlynachen@gmail.com>
Co-authored-by: Caitlyn Chen <caitlynachen@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants