Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve automatic bin determination for histograms via start, end, and step attributes #285

Merged
merged 29 commits into from
Mar 3, 2021

Conversation

micahtyong
Copy link
Member

@micahtyong micahtyong commented Feb 22, 2021

In this PR

Closes #265 and #217 by modifying Histogram.py and providing a start, end, and step bin for the Altair renderer. The bin width, step, is stored in the Vis object during PandasExecutor.py#execute_binning().

Changes

  • Modifications in vislib/altair/Histogram.py and executor/PandasExecutor.py
  • Additional instance variable, .__bin_size, in Vis.py

Example Output

Here is an updated screenshot for histogram binning along the x-axis. As an example,

df = pd.read_csv("https://github.com/covidvis/covid19-vis/blob/master/data/interventionFootprintByState.csv?raw=True",index_col=0)
df['dateBefore'] = pd.to_datetime(df['dateBefore'], format='%Y-%M-%d')
df

This yields the following output for Distribution histograms:
Output

Then, we can specify intent to view Filter histograms.

df.intent = ["severityScore"]
df

Screen Shot 2021-03-02 at 9 26 06 PM

Here is another screenshot to show histogram binning along the y-axis, which remains unchanged.
Screen Shot 2021-02-22 at 1 21 21 AM

lux/vislib/altair/Histogram.py Outdated Show resolved Hide resolved
lux/vislib/altair/Histogram.py Outdated Show resolved Hide resolved
lux/vislib/altair/Histogram.py Outdated Show resolved Hide resolved
@micahtyong
Copy link
Member Author

Hi @domoritz, thank you for your feedback so far! In my most recent commit (e.g. 9216dba), I explore the latter approach from an earlier discussion where markbar is no longer needed. Feel free to comment with your thoughts!

@domoritz
Copy link
Contributor

you're very welcome.

Your latest changes definitely produce better results but require Vega-Lite to do the data transformation. It should work well for small to medium-sized data.

@domoritz
Copy link
Contributor

Just to note, the last commit reverts the improvement (for potentially better performance).

@dorisjlee
Copy link
Member

Thanks @micahtyong! There is some custom bar size determination that we had to do since Lux delegate the binning to the executor side. The new implementation looks good for Filter type actions where the xmin/xmax extent is fixed and the filter changes. However, it causes some issues when the extent range is very different, in the case of Distribution action. Our earlier implementation worked better in this case.
image

We could look into reimplementing something along the lines of Vega-Lite's bin.js. Thanks for pointing this out @domoritz!

@domoritz
Copy link
Contributor

We could look into reimplementing something along the lines of Vega-Lite's bin.js. Thanks for pointing this out @domoritz!

As a first step, I would suggest passing the step size explicitly to Vega-Lite so that the bars are sized correctly. Then as a next step, you could improve how Lux determines the bin offset and step size.

@micahtyong
Copy link
Member Author

Example Test Suite

The following are examples for reviewers to try. There are also additional tests in test_vis.py which validate bin width. To set up, be sure to include

import lux
import pandas as pd
lux.config.default_display = "lux"
lux.config.plotting_backend = "vegalite"

Olympic Dataset

df = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/olympic.csv?raw=True")
df

olympic_dist

Now, let's add a filter.

df.intent=["Height"]
df

olympic_filter

Car Dataset

df = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/car.csv?raw=True")
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
df

car_dist

df.intent=["Acceleration"]
df

car_filter

COVID Dataset

df = pd.read_csv("https://github.com/covidvis/covid19-vis/blob/master/data/interventionFootprintByState.csv?raw=True",index_col=0)
df['dateBefore'] = pd.to_datetime(df['dateBefore'], format='%Y-%M-%d')
df

covid_dist

df.intent = ["severityScore"]
df

covid_filter

Copy link
Contributor

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results look good to me. I left a few small comments for improvements.

lux/vislib/altair/Histogram.py Show resolved Hide resolved
lux/vislib/altair/Histogram.py Outdated Show resolved Hide resolved
lux/vislib/altair/Histogram.py Outdated Show resolved Hide resolved
lux/vislib/altair/Histogram.py Show resolved Hide resolved
lux/vislib/altair/Histogram.py Show resolved Hide resolved
tests/test_vis.py Outdated Show resolved Hide resolved
@dorisjlee
Copy link
Member

Thanks @micahtyong! The results look much better now for histograms in both Distribution and Filter.

@dorisjlee dorisjlee merged commit 952b642 into lux-org:master Mar 3, 2021
@dorisjlee
Copy link
Member

Thanks @micahtyong, the new histograms look great! I think the cutoff that you included for the bin width really helped with the low number of datapoints filter examples.
Special thanks @domoritz for all the helpful feedback!

@domoritz
Copy link
Contributor

domoritz commented Mar 3, 2021

Thank you @micahtyong for the improvements to binning.

@micahtyong micahtyong deleted the binned-charts branch March 3, 2021 10:04
@micahtyong
Copy link
Member Author

Thank you @domoritz for all the helpful feedback, both technical and from your HCI expertise! I really enjoyed the initiated conversations from this PR.

Thank you @dorisjlee for the feedback, review, and pointer to the executor! I learned a lot from this enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Binned charts should provide start, end, and step
3 participants