Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] px.scatter "ols" not producing linear trend line #3683

Closed
jconoranderson opened this issue Apr 18, 2022 · 13 comments
Closed

[BUG] px.scatter "ols" not producing linear trend line #3683

jconoranderson opened this issue Apr 18, 2022 · 13 comments

Comments

@jconoranderson
Copy link

jconoranderson commented Apr 18, 2022

Describe your context
Please provide us your environment, so we can easily reproduce the issue.

  • replace the result of pip list | grep dash below
dash                               2.3.1
dash-auth                          1.4.1
dash-bootstrap-components          1.0.2
dash-core-components               2.0.0
dash-html-components               2.0.0
dash-table                         5.0.0
  • if frontend related, tell us your Browser, Version and OS

    • OS: macOS Version 12.2.1 (21D62)
    • Browser: Chrome
    • Version: 100.0.4896.88 (Official Build) (x86_64)

Describe the bug

"ols" (original least squares) function to add a linear trend line is not producing a regression line. It is instead something closer to polynomial.

The code I'm using to create the graph is as follows:

 elif beh_gph == 'ols':
            dfg[date_frmt] = pd.to_datetime(dfg[date_frmt])
            print(dfg)
            fig = px.scatter(dfg, x=date_frmt, y="Episode_Count", color="Target",
                             labels={"Episode_Count": tally + " per Shift",
                                     "Target": "Target",
                                     "Yr_Mnth": "Date"},
                             trendline="ols", title="Aggregate Behavior Data: " + patient + " - " + today)
            fig.update_xaxes(tickangle=45,)
            fig.update_layout(template='plotly_white', hovermode="x unified")

Instead of a logistic regression line per the example here - https://plotly.com/python/linear-fits/

I'm getting this:

enter image description here

The x and y values are just floating point numbers and date values respectively.

The Plotly version is 5.7.0

Expected behavior

Linear regression line.

@jconoranderson
Copy link
Author

I updated to the latest dash (2.3.1) and the problem still persists...

@alexcjohnson
Copy link
Collaborator

@nicolaskruchten I can't quite tell what we're falling back on here but I'm guessing this just means ols trendlines don't support dates? Any hunch how hard this would be?

@jconoranderson
Copy link
Author

Possibly, it worked as expected when I ran the same code a Mac. This is currently being run on Windows 10. I can provide my entire codebase if that might help?

@alexcjohnson
Copy link
Collaborator

A full reproducible example would be great, yes. Simplified to the minimal case if you can. My hunch about what's happening here: we're not able to use dates as the x data in the curve fitting algorithm, so it's using row indices as the x data during fitting, but somehow the indices used are out of order on Windows. If that's the case, then even on Mac where the indices are ordered correctly, the fit looks right only because your dates happen to be evenly spaced.

@nicolaskruchten nicolaskruchten transferred this issue from plotly/dash Apr 19, 2022
@nicolaskruchten
Copy link
Contributor

nicolaskruchten commented Apr 19, 2022

This is meant to work even on non-evenly-spaced dates: dates are converted to floats and the regression happens there, then the X values are converted back into dates the original X values are provided to Plotly.js. I'll take a look sometime this week. The relevant code starts here https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/express/_core.py#L331

@jconoranderson
Copy link
Author

Hi @nicolaskruchten... was there any update on this?

@nicolaskruchten
Copy link
Contributor

No update yet, no. Can you provide a fully runnable example including data please?

The standard test case I use for OLS with dates on the X axis is this px.scatter(px.data.stocks(indexed=True, datetimes=True), trendline="ols") and it looks as expected to me.

@nicolaskruchten
Copy link
Contributor

Also can you confirm the version of Plotly you are using? The latest is 5.7.0

@sdelu
Copy link

sdelu commented Aug 18, 2022

Hello,

I'm not sure if this has been resolved but I am seeing a similar issue when trying to plot data that only has date values on the first of each month (though this is in a Jupyter environment and not Dash so I'm not sure if it is exactly the same case).

I've included the code below. This code does work properly when I run it in Google Colab, and there is a specific difference in the data that I don't understand (more below).

OS: Windows 10 v 20H2 build 19042.1826

  • plotly 5.9.0
  • jupyterlab 3.3.2
import pandas as pd
import plotly.express as px
import datetime


df = pd.DataFrame( {'Date': ['2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-05-01','2018-06-01','2018-07-01','2018-08-01','2018-09-01','2018-10-01'],
                    'Units' : [36.044379,31.036306,34.354977,33.189577,32.906101,35.679296,48.577445,53.967781,51.684226,32.638374]})


df['Date'] = pd.to_datetime(df['Date'])
df['Date_serial'] = [(d - datetime.datetime(1970,1,1)).days for d in df['Date']]
df['Datevalue'] = df['Date'].values.astype(int)

fig = px.scatter(df, x = 'Date', y = 'Units', trendline = 'ols', trendline_color_override = 'red')

fig2 = px.scatter(df, x='Date_serial', y = 'Units', trendline = 'ols', trendline_color_override = 'red')

fig.show()
fig2.show()

This produces two plots, the first of which uses the Datetime column and has a non-linear trendline. The second plot I converted the dates into a serialized format and the trendline is now linear.

plotly_graph_example_08182022

But as I noted, the plots render as expected when I run them in Google Colab. The major difference between the results I get in my environment and what I get in Google Colab are the values in the DateValue field.

Colab results:

Date | Units | Date_serial | Datevalue
-- | -- | -- | --
2018-01-01 | 36.044379 | 17532 | 1514764800000000000
2018-02-01 | 31.036306 | 17563 | 1517443200000000000
2018-03-01 | 34.354977 | 17591 | 1519862400000000000
2018-04-01 | 33.189577 | 17622 | 1522540800000000000
2018-05-01 | 32.906101 | 17652 | 1525132800000000000


My results:

-- | -- | -- | --
2018-01-01 | 36.044379 | 17532 | 1581514752
2018-02-01 | 31.036306 | 17563 | -153812992
2018-03-01 | 34.354977 | 17591 | -612827136
2018-04-01 | 33.189577 | 17622 | 1946812416
2018-05-01 | 32.906101 | 17652 | 2068578304



I have no idea why the Datevalue numbers are so different, but I imagine the values being out of order is part of (or the entire) issue.

EDIT -- If I convert to int64 instead of int I get the same values as I see in Colab. It looks like this line in the Plotly code linked above converts to int, which I suspect produces the negative values for me:

  if x.dtype.type == np.datetime64:
                        x = x.astype(int) / 10**9  # convert to unix epoch seconds

Colab environment:
Python 3.7.13
Plotly 5.5.0
Pandas 1.3.5

My current environment:
Python 3.10.0
Plotly 5.9.0
Pandas 1.4.2

I also replicated my error on an older environment:
Python 3.8.3
Plotly 5.6.0
Pandas 1.0.5

Let me know if any other details would be helpful.

@m-ad
Copy link
Contributor

m-ad commented Jan 9, 2023

I can replicate this with current plotly 5.11.0, pandas 1.4.1, statsmodels 0.13.5 and Python 3.8.15 on Win10 Enterprise 64bit 22H2.

I ran the example code from @sdelu and got the same plots.

Likewise, the example px.scatter(px.data.stocks(indexed=True, datetimes=True), trendline="ols") looks like this:

grafik

@m-ad
Copy link
Contributor

m-ad commented Jan 9, 2023

By the way, this issue is not limited to "ols", so maybe the issue can be renamed to something along the lines of "broken trendlines with datetime x-axis". Here is how the the second "lowess" example from the documentation looks on my system:

import plotly.express as px

df = px.data.stocks(datetimes=True)
fig = px.scatter(df, x="date", y="GOOG", trendline="lowess", trendline_options=dict(frac=0.1))
fig.show()

grafik

@m-ad
Copy link
Contributor

m-ad commented Jan 9, 2023

Hm... it seems that @sdelu was on the right track. I changed this line from

x = x.astype(int) / 10**9  # convert to unix epoch seconds

to

x = x.astype(np.int64) / 10**9  # convert to unix epoch seconds

Now the examples all work just fine.

m-ad added a commit to m-ad/plotly.py that referenced this issue Jan 9, 2023
nicolaskruchten added a commit that referenced this issue Jan 10, 2023
@nicolaskruchten
Copy link
Contributor

Fixed in 5.12!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants