-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support to improve temporal action to display different timescales #262
Conversation
lux/action/temporal.py
Outdated
return recommendation | ||
|
||
|
||
def create_vis(ldf, col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename create_vis
to be specific to temporal, since it is easily mistaken as the other create_vis
inside the renderer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch! I'll update the function name.
lux/action/temporal.py
Outdated
""" | ||
visuals = [] | ||
converted_string = ldf[col].astype(str) | ||
if converted_string.str.contains("-").any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be done without the regex string by simply extracting the timescale values from the datetime attribute via Pandas, such as .dt.year
, .dt.month
, etc. We should avoid having to do astype
since this can be an expensive operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll update the action to directly extract the different timescales from the datetime attribute.
] | ||
} | ||
) | ||
test_data = [airbnb_df, flights_df, date_df, pytest.car_df, pytest.olympic] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the flights dataset, the results look a bit weird since the different timescales are created even for attributes that only encompass a single timescale.
df = pd.read_csv("../../lux-datasets/data/flights.csv")
df['year'] = pd.to_datetime(df['year'], format='%Y')
df['month'] = pd.to_datetime(df['month'], format='%M')
df['day'] = pd.to_datetime(df['day'], format='%d')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a closer look at these results. The fixes/changes to utilize the datetime attribute and discarding timescales that are not applicable should end up resolving this issue. I'll make the changes and ensure that these visualizations are proper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in your updated results that the range for the y-axis doesn't fit the data well (e.g. the minimum of the data points are far higher than the minimum of the y-range). Do you know why that might be?
lux/action/temporal.py
Outdated
parsed_date = converted_string.str.extract(r"([0-9]{4})?-?([0-9]{2})?-?([0-9]{1,2})?") | ||
valid_year, valid_month, valid_day = parsed_date.apply(lambda x: not x.isnull().all()).values | ||
|
||
date_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll update the logic in the temporal action to use compute_date_granularity()
in order to discard time scales that aren't applicable.
lux/action/temporal.py
Outdated
day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1) | ||
visuals.append(day_vis) | ||
else: | ||
single_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this scoring behave when there are more than two temporal attributes? Do we show the unseparated overall first, then move onto all (year)
, (month )
, etc? Perhaps add an example of this in the use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer to this can be included into the long description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scoring will result in the overall visualizations first, then all the (year)
visualizations, all the (month)
visualizations, etc. Is there an alternate ordering that should be considered? I'll also add an example for this situation to the testing and use cases and update the long description with the scoring behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the current ordering can probably be improved. We have a suite of functions in interestingness.py
that give you a sense of how visualizations are ranked. Perhaps weighted_correlation
is a good place to start? This doesn't have to be in this PR, but just something to think about.
Thanks @mantejpanesar! This looks really great as a first issue, I left some comments in the code review for you to address. |
lux/action/default.py
Outdated
@@ -19,8 +19,7 @@ def register_default_actions(): | |||
lux.config.register_action("correlation", correlation, no_vis) | |||
lux.config.register_action("distribution", univariate, no_vis, "quantitative") | |||
lux.config.register_action("occurrence", univariate, no_vis, "nominal") | |||
lux.config.register_action("temporal", univariate, no_vis, "temporal") | |||
lux.config.register_action("geographical", geomap, no_vis, "geographical") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micahtyong I'm not sure if you are depending on this, but if so, let's not remove this import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am, thanks for the catch! @jerrysong1324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
pass | ||
recommendation = { | ||
"action": "Temporal", | ||
"description": "Show trends over <p class='highlight-descriptor'>time-related</p> attributes.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to try adding in a long description field as well? Our long descriptions include the description and also how the graphs are sorted and other additional info for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying what the long descriptions are used for. I'll add how the graphs are sorted in the long description.
lux/action/temporal.py
Outdated
day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1) | ||
visuals.append(day_vis) | ||
else: | ||
single_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer to this can be included into the long description.
Hi @dorisjlee, As requested, I've updated the temporal action to parse the temporal fields using the datetime attribute via Pandas and also handled discarding timescales that are not applicable. I have provided an example output for the stocks dataset, the flights dataset, and an example output using fabricated data (check-in and check-out dates) to highlight the scoring behavior for multiple temporal columns with multiple timescales. Let me know if there's any other changes I should make! Stocks Dataset VisualizationThis demonstrates that timescales which are not applicable are no longer presented in the visualizations. Flights Dataset VisualizationThis demonstrates the removal of erroneous visualizations for attributes that only encompass a single timescale. Scoring BehaviorThis demonstrates the scoring behavior for the temporal action. When visualizations are generated via parsing a temporal column with the datetime attribute, all the overall visualizations are shown first, followed by all the (year) visualizations, followed by all the (month) visualizations, and finally all the (day) visualizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mantejpanesar, just stopping by. I left a few comments on this PR, but overall, the results look great for a first issue (adding or modifying data types can be tricky!).
] | ||
} | ||
) | ||
test_data = [airbnb_df, flights_df, date_df, pytest.car_df, pytest.olympic] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in your updated results that the range for the y-axis doesn't fit the data well (e.g. the minimum of the data points are far higher than the minimum of the y-range). Do you know why that might be?
lux/action/temporal.py
Outdated
|
||
day_col = col + " (day)" | ||
day_df = LuxDataFrame({day_col: pd.to_datetime(formatted_date.dt.day, format="%d")}) | ||
day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it odd that we're building the data frame and vis directly from the action class rather than using classes like VisList
and PandasExecutor
. At the same time, I see where you were coming from (we're essentially extending making copies of the same data frame just using different time scales). I'll leave my comment here in case Doris has any input on that!
Hi @micahtyong, thanks for the review! I'm not sure why the y-axis seems to have a minimum far below the data points. The fix in this PR relies on the existing functionality to populate the visualizations; it appears that the same y-axis range is being produced even when the changes from this PR are not added -- for instance, the flights dataset -- but I can take a closer look at it! |
…org#336) Co-authored-by: Caitlyn Chen <caitlynachen@berkeley.edu> Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>
* added notebook gallery * update README * removed scatterplot message in SQLExecutor * fixed typo in SQL documentation
* changes from perf branch to config * added flag for turning on/off lazy maintain optimization * merged in approx early pruning code * increase overall sampling start and cap * Adjust width and length criteria for early pruning vislist based on experiment results; Add warning message and test for early pruning * black version update * version lock on black * * fixed sql tests (added approx to execute constructor) * fixed sampling config test * improved Executor documentation
* adding weekday * adding docs * bugfix for y axis line chart export * fixing temporal axis by adding timescale variable in Clause
Codecov Report
@@ Coverage Diff @@
## master #262 +/- ##
==========================================
+ Coverage 79.94% 84.85% +4.90%
==========================================
Files 50 52 +2
Lines 3615 4007 +392
==========================================
+ Hits 2890 3400 +510
+ Misses 725 607 -118
Continue to review full report at Codecov.
|
Thanks @mantejpanesar, @micahtyong! I have made some final fixes and tests for this feature. A future to-do would be to extend the temporal action to support the different timescales on temporal attributes more generically, beyond just the Record-based line charts, but also when a measure value is present (e.g., in the case of Enhance). This would make sense for use cases like this, where the measure is not counts. But this is great as a first start! |
In this PR
This PR addresses #243 by adding a separate
temporal
action, which generates visualizations at different timescales for temporal columns. Supports separate visualizations for the following formats:%Y-%m-%d
,%Y-%m
, and single component temporal columns (%Y
,%m
,%d
).Changes
temporal
action to handle the creation of visualizations for temporal related columns.test/test_action.py
to verify that the correct type and number of visualizations are being generated.Visualizations
The following images showcases the new functionality for visualizing temporal columns tested on the Airbnb dataset.
Temporal Visualization Before
Temporal Visualization After