Adding support to improve temporal action to display different timescales #262

mantejpanesar · 2021-02-11T20:37:38Z

In this PR

This PR addresses #243 by adding a separate temporal action, which generates visualizations at different timescales for temporal columns. Supports separate visualizations for the following formats: %Y-%m-%d, %Y-%m, and single component temporal columns (%Y, %m, %d).

Changes

Add a new temporal action to handle the creation of visualizations for temporal related columns.
Tests in test/test_action.py to verify that the correct type and number of visualizations are being generated.

Visualizations

The following images showcases the new functionality for visualizing temporal columns tested on the Airbnb dataset.

Temporal Visualization Before

Temporal Visualization After

dorisjlee · 2021-02-12T09:33:23Z

lux/action/temporal.py

+    return recommendation
+
+
+def create_vis(ldf, col):


Rename create_vis to be specific to temporal, since it is easily mistaken as the other create_vis inside the renderer

Thanks for the catch! I'll update the function name.

dorisjlee · 2021-02-12T09:37:39Z

lux/action/temporal.py

+    """
+    visuals = []
+    converted_string = ldf[col].astype(str)
+    if converted_string.str.contains("-").any():


I think this can be done without the regex string by simply extracting the timescale values from the datetime attribute via Pandas, such as .dt.year, .dt.month, etc. We should avoid having to do astype since this can be an expensive operation.

Sounds good, I'll update the action to directly extract the different timescales from the datetime attribute.

dorisjlee · 2021-02-12T09:44:01Z

tests/test_action.py

+            ]
+        }
+    )
+    test_data = [airbnb_df, flights_df, date_df, pytest.car_df, pytest.olympic]


For the flights dataset, the results look a bit weird since the different timescales are created even for attributes that only encompass a single timescale.

df = pd.read_csv("../../lux-datasets/data/flights.csv") df['year'] = pd.to_datetime(df['year'], format='%Y') df['month'] = pd.to_datetime(df['month'], format='%M') df['day'] = pd.to_datetime(df['day'], format='%d')

I'll take a closer look at these results. The fixes/changes to utilize the datetime attribute and discarding timescales that are not applicable should end up resolving this issue. I'll make the changes and ensure that these visualizations are proper.

I noticed in your updated results that the range for the y-axis doesn't fit the data well (e.g. the minimum of the data points are far higher than the minimum of the y-range). Do you know why that might be?

dorisjlee · 2021-02-12T09:48:19Z

lux/action/temporal.py

+        parsed_date = converted_string.str.extract(r"([0-9]{4})?-?([0-9]{2})?-?([0-9]{1,2})?")
+        valid_year, valid_month, valid_day = parsed_date.apply(lambda x: not x.isnull().all()).values
+
+        date_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=4)


We should discard time scales that are not applicable, for example a day will show up as all *-*-01 if it only has year and month information

df = pd.read_csv("../../lux-datasets/data/stocks.csv") df

Sounds good, I'll update the logic in the temporal action to use compute_date_granularity() in order to discard time scales that aren't applicable.

dorisjlee · 2021-02-12T09:49:31Z

lux/action/temporal.py

+            day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1)
+            visuals.append(day_vis)
+    else:
+        single_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=5)


How does this scoring behave when there are more than two temporal attributes? Do we show the unseparated overall first, then move onto all (year), (month ), etc? Perhaps add an example of this in the use cases.

The answer to this can be included into the long description.

The scoring will result in the overall visualizations first, then all the (year) visualizations, all the (month) visualizations, etc. Is there an alternate ordering that should be considered? I'll also add an example for this situation to the testing and use cases and update the long description with the scoring behavior.

In my opinion, the current ordering can probably be improved. We have a suite of functions in interestingness.py that give you a sense of how visualizations are ranked. Perhaps weighted_correlation is a good place to start? This doesn't have to be in this PR, but just something to think about.

dorisjlee · 2021-02-12T09:55:28Z

Thanks @mantejpanesar! This looks really great as a first issue, I left some comments in the code review for you to address.

jerrysong1324 · 2021-02-12T19:21:47Z

lux/action/default.py

@@ -19,8 +19,7 @@ def register_default_actions():
    lux.config.register_action("correlation", correlation, no_vis)
    lux.config.register_action("distribution", univariate, no_vis, "quantitative")
    lux.config.register_action("occurrence", univariate, no_vis, "nominal")
-    lux.config.register_action("temporal", univariate, no_vis, "temporal")
-    lux.config.register_action("geographical", geomap, no_vis, "geographical")


@micahtyong I'm not sure if you are depending on this, but if so, let's not remove this import.

Yeah, I am, thanks for the catch! @jerrysong1324

Thanks for the catch!

jerrysong1324 · 2021-02-12T19:25:20Z

lux/action/temporal.py

+                pass
+    recommendation = {
+        "action": "Temporal",
+        "description": "Show trends over <p class='highlight-descriptor'>time-related</p> attributes.",


Want to try adding in a long description field as well? Our long descriptions include the description and also how the graphs are sorted and other additional info for the user.

Thanks for clarifying what the long descriptions are used for. I'll add how the graphs are sorted in the long description.

jerrysong1324 · 2021-02-12T19:26:51Z

lux/action/temporal.py

+            day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1)
+            visuals.append(day_vis)
+    else:
+        single_vis = Vis([lux.Clause(col, data_type="temporal")], source=ldf, score=5)


The answer to this can be included into the long description.

mantejpanesar · 2021-03-01T07:22:28Z

Hi @dorisjlee, As requested, I've updated the temporal action to parse the temporal fields using the datetime attribute via Pandas and also handled discarding timescales that are not applicable. I have provided an example output for the stocks dataset, the flights dataset, and an example output using fabricated data (check-in and check-out dates) to highlight the scoring behavior for multiple temporal columns with multiple timescales. Let me know if there's any other changes I should make!

Stocks Dataset Visualization

This demonstrates that timescales which are not applicable are no longer presented in the visualizations.

Flights Dataset Visualization

This demonstrates the removal of erroneous visualizations for attributes that only encompass a single timescale.

Scoring Behavior

This demonstrates the scoring behavior for the temporal action. When visualizations are generated via parsing a temporal column with the datetime attribute, all the overall visualizations are shown first, followed by all the (year) visualizations, followed by all the (month) visualizations, and finally all the (day) visualizations.

micahtyong

Hi @mantejpanesar, just stopping by. I left a few comments on this PR, but overall, the results look great for a first issue (adding or modifying data types can be tricky!).

micahtyong · 2021-03-06T07:32:54Z

tests/test_action.py

+            ]
+        }
+    )
+    test_data = [airbnb_df, flights_df, date_df, pytest.car_df, pytest.olympic]


I noticed in your updated results that the range for the y-axis doesn't fit the data well (e.g. the minimum of the data points are far higher than the minimum of the y-range). Do you know why that might be?

micahtyong · 2021-03-06T07:52:43Z

lux/action/temporal.py

+
+    day_col = col + " (day)"
+    day_df = LuxDataFrame({day_col: pd.to_datetime(formatted_date.dt.day, format="%d")})
+    day_vis = Vis([lux.Clause(day_col, data_type="temporal")], source=day_df, score=1)


I find it odd that we're building the data frame and vis directly from the action class rather than using classes like VisList and PandasExecutor. At the same time, I see where you were coming from (we're essentially extending making copies of the same data frame just using different time scales). I'll leave my comment here in case Doris has any input on that!

mantejpanesar · 2021-03-12T07:35:03Z

Hi @micahtyong, thanks for the review! I'm not sure why the y-axis seems to have a minimum far below the data points. The fix in this PR relies on the existing functionality to populate the visualizations; it appears that the same y-axis range is being produced even when the changes from this PR are not added -- for instance, the flights dataset -- but I can take a closer look at it!

…org#336) Co-authored-by: Caitlyn Chen <caitlynachen@berkeley.edu> Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>

* added notebook gallery * update README * removed scatterplot message in SQLExecutor * fixed typo in SQL documentation

…te (lux-org#297)

* changes from perf branch to config * added flag for turning on/off lazy maintain optimization * merged in approx early pruning code * increase overall sampling start and cap * Adjust width and length criteria for early pruning vislist based on experiment results; Add warning message and test for early pruning * black version update * version lock on black * * fixed sql tests (added approx to execute constructor) * fixed sampling config test * improved Executor documentation

…ejpanesar-master

* adding weekday * adding docs * bugfix for y axis line chart export * fixing temporal axis by adding timescale variable in Clause

codecov · 2021-04-30T22:31:09Z

Codecov Report

Merging #262 (2de3e6b) into master (b8b64bf) will increase coverage by 4.90%.
The diff coverage is 87.19%.

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   79.94%   84.85%   +4.90%     
==========================================
  Files          50       52       +2     
  Lines        3615     4007     +392     
==========================================
+ Hits         2890     3400     +510     
+ Misses        725      607     -118

Impacted Files	Coverage Δ
lux/action/correlation.py	`86.66% <ø> (ø)`
lux/action/enhance.py	`100.00% <ø> (ø)`
lux/action/univariate.py	`91.30% <ø> (+0.91%)`	⬆️
lux/vislib/altair/Choropleth.py	`94.20% <0.00%> (ø)`
lux/vislib/altair/LineChart.py	`83.33% <57.14%> (+1.51%)`	⬆️
lux/vislib/matplotlib/ScatterChart.py	`75.29% <60.00%> (ø)`
lux/core/sqltable.py	`73.11% <73.11%> (ø)`
lux/executor/SQLExecutor.py	`84.45% <82.22%> (+70.91%)`	⬆️
lux/executor/Executor.py	`80.85% <88.88%> (+1.36%)`	⬆️
lux/vis/Vis.py	`75.73% <90.00%> (+0.88%)`	⬆️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8b64bf...2de3e6b. Read the comment docs.

dorisjlee · 2021-04-30T22:31:33Z

Thanks @mantejpanesar, @micahtyong! I have made some final fixes and tests for this feature. A future to-do would be to extend the temporal action to support the different timescales on temporal attributes more generically, beyond just the Record-based line charts, but also when a measure value is present (e.g., in the case of Enhance). This would make sense for use cases like this, where the measure is not counts. But this is great as a first start!

Add support to improve temporal action to display different timescales

03d4b71

dorisjlee requested review from dorisjlee and jerrysong1324 February 12, 2021 03:21

Merge branch 'master' into master

449e51a

dorisjlee requested changes Feb 12, 2021

View reviewed changes

jerrysong1324 reviewed Feb 12, 2021

View reviewed changes

mantejpanesar added 3 commits February 28, 2021 20:15

Fix merge conflict

85b300d

Merge branch 'master' of https://github.com/mantejpanesar/lux

c3d419d

Resolve PR comments

e078d2f

micahtyong reviewed Mar 6, 2021

View reviewed changes

mantejpanesar and others added 13 commits March 25, 2021 22:13

Add support to improve temporal action to display different timescales

f536e2b

Resolve PR comments

29da420

Resolve conflicts

79ac79a

Reformat files using black

cf5f940

"All-column" vis when only few columns in dataframe lux-org#199 (lux-…

d6cca26

…org#336) Co-authored-by: Caitlyn Chen <caitlynachen@berkeley.edu> Co-authored-by: Doris Lee <dorisjunglinlee@gmail.com>

documentation and cleaning

e3a283c

* added notebook gallery * update README * removed scatterplot message in SQLExecutor * fixed typo in SQL documentation

update README and bump version

658c236

bump version

4d8899b

clear propagated vis data intent after PandasExecutor completes execu…

1dbbcb9

…te (lux-org#297)

fix black to stable version

f1085d9

Merge branch 'master' of git://github.com/mantejpanesar/lux into mant…

296261e

…ejpanesar-master

timescale feature

2de3e6b

* adding weekday * adding docs * bugfix for y axis line chart export * fixing temporal axis by adding timescale variable in Clause

dorisjlee merged commit ada6173 into lux-org:master Apr 30, 2021

dorisjlee mentioned this pull request Jun 24, 2021

Improve Temporal action to display different time scales #243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support to improve temporal action to display different timescales #262

Adding support to improve temporal action to display different timescales #262

mantejpanesar commented Feb 11, 2021

dorisjlee Feb 12, 2021

mantejpanesar Feb 13, 2021

dorisjlee Feb 12, 2021

mantejpanesar Feb 13, 2021

dorisjlee Feb 12, 2021

mantejpanesar Feb 13, 2021

micahtyong Mar 6, 2021

dorisjlee Feb 12, 2021

mantejpanesar Feb 13, 2021

dorisjlee Feb 12, 2021

jerrysong1324 Feb 12, 2021

mantejpanesar Feb 13, 2021

micahtyong Mar 6, 2021 •

edited

Loading

dorisjlee commented Feb 12, 2021

jerrysong1324 Feb 12, 2021

micahtyong Feb 12, 2021

mantejpanesar Feb 13, 2021

jerrysong1324 Feb 12, 2021

mantejpanesar Feb 13, 2021

jerrysong1324 Feb 12, 2021

mantejpanesar commented Mar 1, 2021

micahtyong left a comment

micahtyong Mar 6, 2021

micahtyong Mar 6, 2021

mantejpanesar commented Mar 12, 2021

codecov bot commented Apr 30, 2021

dorisjlee commented Apr 30, 2021

Adding support to improve temporal action to display different timescales #262

Adding support to improve temporal action to display different timescales #262

Conversation

mantejpanesar commented Feb 11, 2021

In this PR

Changes

Visualizations

Temporal Visualization Before

Temporal Visualization After

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micahtyong Mar 6, 2021 • edited Loading

Choose a reason for hiding this comment

dorisjlee commented Feb 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mantejpanesar commented Mar 1, 2021

Stocks Dataset Visualization

Flights Dataset Visualization

Scoring Behavior

micahtyong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mantejpanesar commented Mar 12, 2021

codecov bot commented Apr 30, 2021

Codecov Report

dorisjlee commented Apr 30, 2021

micahtyong Mar 6, 2021 •

edited

Loading