Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support to improve temporal action to display different timescales #262

Merged
merged 18 commits into from
Apr 30, 2021

Conversation

mantejpanesar
Copy link
Contributor

In this PR

This PR addresses #243 by adding a separate temporal action, which generates visualizations at different timescales for temporal columns. Supports separate visualizations for the following formats: %Y-%m-%d, %Y-%m, and single component temporal columns (%Y, %m, %d).

Changes

  • Add a new temporal action to handle the creation of visualizations for temporal related columns.
  • Tests in test/test_action.py to verify that the correct type and number of visualizations are being generated.

Visualizations

The following images showcases the new functionality for visualizing temporal columns tested on the Airbnb dataset.

Temporal Visualization Before

Screen Shot 2021-02-11 at 12 29 55 PM

Temporal Visualization After

Screen Shot 2021-02-11 at 12 21 29 PM

return recommendation


def create_vis(ldf, col):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename create_vis to be specific to temporal, since it is easily mistaken as the other create_vis inside the renderer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! I'll update the function name.

"""
visuals = []
converted_string = ldf[col].astype(str)
if converted_string.str.contains("-").any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be done without the regex string by simply extracting the timescale values from the datetime attribute via Pandas, such as .dt.year, .dt.month, etc. We should avoid having to do astype since this can be an expensive operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll update the action to directly extract the different timescales from the datetime attribute.

]
}
)
test_data = [airbnb_df, flights_df, date_df, pytest.car_df, pytest.olympic]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the flights dataset, the results look a bit weird since the different timescales are created even for attributes that only encompass a single timescale.

df = pd.read_csv("../../lux-datasets/data/flights.csv")
df['year'] = pd.to_datetime(df['year'], format='%Y')
df['month'] = pd.to_datetime(df['month'], format='%M')
df['day'] = pd.to_datetime(df['day'], format='%d')

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look at these results. The fixes/changes to utilize the datetime attribute and discarding timescales that are not applicable should end up resolving this issue. I'll make the changes and ensure that these visualizations are proper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in your updated results that the range for the y-axis doesn't fit the data well (e.g. the minimum of the data points are far higher than the minimum of the y-range). Do you know why that might be?

parsed_date = converted_string.str.extract(r"([0-9]{4})?-?([0-9]{2})?-?([0-9]{1,2})?")
valid_year, valid_month, valid_day = parsed_date.apply(lambda x: not x.isnull().all()).values

date_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discard time scales that are not applicable, for example a day will show up as all *-*-01 if it only has year and month information

df = pd.read_csv("../../lux-datasets/data/stocks.csv")
df

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll update the logic in the temporal action to use compute_date_granularity() in order to discard time scales that aren't applicable.

day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1)
visuals.append(day_vis)
else:
single_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this scoring behave when there are more than two temporal attributes? Do we show the unseparated overall first, then move onto all (year), (month ), etc? Perhaps add an example of this in the use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer to this can be included into the long description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoring will result in the overall visualizations first, then all the (year) visualizations, all the (month) visualizations, etc. Is there an alternate ordering that should be considered? I'll also add an example for this situation to the testing and use cases and update the long description with the scoring behavior.

Copy link
Member

@micahtyong micahtyong Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the current ordering can probably be improved. We have a suite of functions in interestingness.py that give you a sense of how visualizations are ranked. Perhaps weighted_correlation is a good place to start? This doesn't have to be in this PR, but just something to think about.

@dorisjlee
Copy link
Member

Thanks @mantejpanesar! This looks really great as a first issue, I left some comments in the code review for you to address.

@@ -19,8 +19,7 @@ def register_default_actions():
lux.config.register_action("correlation", correlation, no_vis)
lux.config.register_action("distribution", univariate, no_vis, "quantitative")
lux.config.register_action("occurrence", univariate, no_vis, "nominal")
lux.config.register_action("temporal", univariate, no_vis, "temporal")
lux.config.register_action("geographical", geomap, no_vis, "geographical")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micahtyong I'm not sure if you are depending on this, but if so, let's not remove this import.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am, thanks for the catch! @jerrysong1324

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

pass
recommendation = {
"action": "Temporal",
"description": "Show trends over <p class='highlight-descriptor'>time-related</p> attributes.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to try adding in a long description field as well? Our long descriptions include the description and also how the graphs are sorted and other additional info for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying what the long descriptions are used for. I'll add how the graphs are sorted in the long description.

day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1)
visuals.append(day_vis)
else:
single_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer to this can be included into the long description.

@mantejpanesar
Copy link
Contributor Author

Hi @dorisjlee, As requested, I've updated the temporal action to parse the temporal fields using the datetime attribute via Pandas and also handled discarding timescales that are not applicable. I have provided an example output for the stocks dataset, the flights dataset, and an example output using fabricated data (check-in and check-out dates) to highlight the scoring behavior for multiple temporal columns with multiple timescales. Let me know if there's any other changes I should make!

Stocks Dataset Visualization

This demonstrates that timescales which are not applicable are no longer presented in the visualizations.
StocksDataset

Flights Dataset Visualization

This demonstrates the removal of erroneous visualizations for attributes that only encompass a single timescale.
FlightsDataset

Scoring Behavior

This demonstrates the scoring behavior for the temporal action. When visualizations are generated via parsing a temporal column with the datetime attribute, all the overall visualizations are shown first, followed by all the (year) visualizations, followed by all the (month) visualizations, and finally all the (day) visualizations.
ScoringBehavior

Copy link
Member

@micahtyong micahtyong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mantejpanesar, just stopping by. I left a few comments on this PR, but overall, the results look great for a first issue (adding or modifying data types can be tricky!).

]
}
)
test_data = [airbnb_df, flights_df, date_df, pytest.car_df, pytest.olympic]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in your updated results that the range for the y-axis doesn't fit the data well (e.g. the minimum of the data points are far higher than the minimum of the y-range). Do you know why that might be?


day_col = col + " (day)"
day_df = LuxDataFrame({day_col: pd.to_datetime(formatted_date.dt.day, format="%d")})
day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it odd that we're building the data frame and vis directly from the action class rather than using classes like VisList and PandasExecutor. At the same time, I see where you were coming from (we're essentially extending making copies of the same data frame just using different time scales). I'll leave my comment here in case Doris has any input on that!

@mantejpanesar
Copy link
Contributor Author

Hi @micahtyong, thanks for the review! I'm not sure why the y-axis seems to have a minimum far below the data points. The fix in this PR relies on the existing functionality to populate the visualizations; it appears that the same y-axis range is being produced even when the changes from this PR are not added -- for instance, the flights dataset -- but I can take a closer look at it!

mantejpanesar and others added 13 commits March 25, 2021 22:13
…org#336)

Co-authored-by: Caitlyn Chen <caitlynachen@berkeley.edu>
Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>
* added notebook gallery
* update README
* removed scatterplot message in SQLExecutor
* fixed typo in SQL documentation
* changes from perf branch to config
* added flag for turning on/off lazy maintain optimization

* merged in approx early pruning code

* increase overall sampling start and cap

* Adjust width and length criteria for early pruning vislist based on experiment results; Add warning message and test for early pruning

* black version update

* version lock on black

* * fixed sql tests (added approx to execute constructor)
* fixed sampling config test
* improved Executor documentation
* adding weekday
* adding docs
* bugfix for y axis line chart export
* fixing temporal axis by adding timescale variable in Clause
@codecov
Copy link

codecov bot commented Apr 30, 2021

Codecov Report

Merging #262 (2de3e6b) into master (b8b64bf) will increase coverage by 4.90%.
The diff coverage is 87.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   79.94%   84.85%   +4.90%     
==========================================
  Files          50       52       +2     
  Lines        3615     4007     +392     
==========================================
+ Hits         2890     3400     +510     
+ Misses        725      607     -118     
Impacted Files Coverage Δ
lux/action/correlation.py 86.66% <ø> (ø)
lux/action/enhance.py 100.00% <ø> (ø)
lux/action/univariate.py 91.30% <ø> (+0.91%) ⬆️
lux/vislib/altair/Choropleth.py 94.20% <0.00%> (ø)
lux/vislib/altair/LineChart.py 83.33% <57.14%> (+1.51%) ⬆️
lux/vislib/matplotlib/ScatterChart.py 75.29% <60.00%> (ø)
lux/core/sqltable.py 73.11% <73.11%> (ø)
lux/executor/SQLExecutor.py 84.45% <82.22%> (+70.91%) ⬆️
lux/executor/Executor.py 80.85% <88.88%> (+1.36%) ⬆️
lux/vis/Vis.py 75.73% <90.00%> (+0.88%) ⬆️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8b64bf...2de3e6b. Read the comment docs.

@dorisjlee dorisjlee merged commit ada6173 into lux-org:master Apr 30, 2021
@dorisjlee
Copy link
Member

Thanks @mantejpanesar, @micahtyong! I have made some final fixes and tests for this feature. A future to-do would be to extend the temporal action to support the different timescales on temporal attributes more generically, beyond just the Record-based line charts, but also when a measure value is present (e.g., in the case of Enhance). This would make sense for use cases like this, where the measure is not counts. But this is great as a first start!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants