Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superset 4.1 with Pinot not aggregating timestamps correctly #30219

Open
3 tasks done
anthony-df-nguyen opened this issue Sep 10, 2024 · 5 comments · Fixed by #31341
Open
3 tasks done

Superset 4.1 with Pinot not aggregating timestamps correctly #30219

anthony-df-nguyen opened this issue Sep 10, 2024 · 5 comments · Fixed by #31341
Labels
data:connect:pinot Related to Pinot viz:charts:line Related to the Line chart

Comments

@anthony-df-nguyen
Copy link

Bug description

Issue

It appears that when using Pinot (1.2.0) with Superset 4.1, columns that are epoch_ms formatted fields are not aggregating correctly without a lot of workarounds. It worked okay for us back when we were on Superset 2.0.

On Superset 2.0, when we used the old Time Series Line Chart, if we use an epoch_ms column as the time column, the query being made was the following which was working.

SELECT DATETIMECONVERT(incident_date_epoch, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:DAYS'), count(DISTINCT report_id) AS count_1 FROM "default".incident_reports GROUP BY DATETIMECONVERT(incident_date_epoch, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:DAYS') ORDER BY count(DISTINCT report_id) DESC LIMIT 10000;

Example of previously working Time Series Line Chart
image

Now on Superset 4.0, we use the newer Line Chart but using the same time column uses this query which appears to throw an error

SELECT CAST(DATE_TRUNC( 'QUARTER', CAST(DATETIMECONVERT(incident_date_epoch, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MILLISECONDS') AS TIMESTAMP) ) AS TIMESTAMP) AS incident_date_epoch, COUNT(DISTINCT report_id) AS "COUNT_DISTINCT(report_id)" FROM "default".incident_reports GROUP BY CAST(DATE_TRUNC( 'QUARTER', CAST(DATETIMECONVERT(incident_date_epoch, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MILLISECONDS') AS TIMESTAMP) ) AS TIMESTAMP) ORDER BY COUNT(DISTINCT report_id) DESC LIMIT 10000;

Error
image

Workarounds

I was able to workaround this doing 2 things, which are still not ideal

Workaround 1

I can make a calculated column in Superset for each epoch_ms field im using, in which i set Data Type to DATETIME and Date Time Format as %Y-%m-%d %H:%M:%S.%f and have the SQL expression ToDateTime(my_epoch_ms_column, 'yyyy-MM-dd')
image

Which will finally successfully plot on the Line Chart
image

which produces this SQL query
SELECT CAST(DATE_TRUNC('DAY', CAST(TODATETIME(incident_date_epoch, 'yyyy-MM-dd') AS TIMESTAMP)) AS TIMESTAMP) AS parsed_incident_date_epoch, COUNT(DISTINCT report_id) AS "COUNT_DISTINCT(report_id)" FROM "default".incident_reports GROUP BY CAST(DATE_TRUNC('DAY', CAST(TODATETIME(incident_date_epoch, 'yyyy-MM-dd') AS TIMESTAMP)) AS TIMESTAMP) ORDER BY COUNT(DISTINCT report_id) DESC LIMIT 10000;

Workaround 2

It appears that simply renaming the time column in the chart itself will finally get the chart to plot, although the format still looks ugly

image

Which produces SQL query
SELECT CAST(DATE_TRUNC( 'DAY', CAST(DATETIMECONVERT(incident_date_epoch, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MILLISECONDS') AS TIMESTAMP) ) AS TIMESTAMP) AS "Incident Date Epoch", COUNT(DISTINCT report_id) AS "COUNT_DISTINCT(report_id)" FROM "default".incident_reports GROUP BY CAST(DATE_TRUNC( 'DAY', CAST(DATETIMECONVERT(incident_date_epoch, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MILLISECONDS') AS TIMESTAMP) ) AS TIMESTAMP) ORDER BY COUNT(DISTINCT report_id) DESC LIMIT 10000;

Other Threads

I've seen these other issues which dont appear to have a clear resolution to the issue im describing

How to reproduce the bug

  1. Have a Pinot dataset in which you have a time column formatted in epoch_ms
  2. Make sure that in your Dataset, you've set the column as temporal and formatted as epoch_ms
  3. Build a Line Chart in which you select the epoch_ms time column as the time column
  4. Pick any other field as your metric just to get something to plot
  5. You should see the error

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.10

Node version

I don't know

Browser

Chrome

Additional context

This is the logs I see in superset when trying the line chart presented in my issues

Traceback (most recent call last):
File "/app/superset/connectors/sqla/models.py", line 1724, in query
df = self.database.get_df(
File "/app/superset/models/core.py", line 677, in get_df
self.db_engine_spec.execute(cursor, sql_, self)
File "/app/superset/db_engine_specs/base.py", line 1828, in execute
raise cls.get_dbapi_mapped_exception(ex) from ex
File "/app/superset/db_engine_specs/base.py", line 1824, in execute
cursor.execute(query)
File "/usr/local/lib/python3.10/site-packages/pinotdb/db.py", line 44, in g
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pinotdb/db.py", line 312, in execute
raise exceptions.DatabaseError(msg)
pinotdb.exceptions.DatabaseError: {'errorCode': 150,
'message': 'SQLParsingError:\n'
'org.apache.pinot.sql.parsers.SqlCompilationException: '
"'as(cast(datetrunc('DAY', "
"cast(datetimeconvert(incident_date_epoch, '1:MILLISECONDS:EPOCH', "
"'1:MILLISECONDS:EPOCH', '1:MILLISECONDS'), 'TIMESTAMP')), "
"'TIMESTAMP'), incident_date_epoch)' should appear in GROUP BY "
'clause.\n'
'\tat '
'org.apache.pinot.sql.parsers.CalciteSqlParser.validateGroupByClause(CalciteSqlParser.java:191)\n'
'\tat '
'org.apache.pinot.sql.parsers.CalciteSqlParser.validate(CalciteSqlParser.java:174)\n'
'\tat '
'org.apache.pinot.sql.parsers.CalciteSqlParser.queryRewrite(CalciteSqlParser.java:550)\n'
'\tat '
'org.apache.pinot.sql.parsers.CalciteSqlParser.compileSqlNodeToPinotQuery(CalciteSqlParser.java:484)'}

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
@dosubot dosubot bot added data:connect:pinot Related to Pinot viz:charts:line Related to the Line chart labels Sep 10, 2024
Copy link

dosubot bot commented Sep 10, 2024

Hey @anthony-df-nguyen! I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those bugs together! 😊

I found a similar closed issue that might be relevant to your problem: refactor(pinot) regression on area chart (DB engine error). This issue was resolved by updating the pinotdb library to version 5.1.0 and creating a calculated column with the expression cast(dateHeureDebut as timestamp) in Superset [1].

The root cause of the problem you're experiencing is that the epoch_ms_to_dttm method in the PinotEngineSpec class is converting the timestamps using the same epoch format for both input and output, which does not change the granularity of the timestamps. This results in incorrect aggregation. The current implementation of epoch_ms_to_dttm is:

@classmethod
def epoch_ms_to_dttm(cls) -> str:
    return (
        "DATETIMECONVERT({col}, '1:MILLISECONDS:EPOCH', "
        + "'1:MILLISECONDS:EPOCH', '1:MILLISECONDS')"
    )

Workarounds:

  1. Modify the epoch_ms_to_dttm method to convert the milliseconds epoch to a more appropriate format for aggregation, such as seconds or a specific date-time format. For example:

    @classmethod
    def epoch_ms_to_dttm(cls) -> str:
        return (
            "DATETIMECONVERT({col}, '1:MILLISECONDS:EPOCH', "
            + "'1:SECONDS:EPOCH', '1:SECONDS')"
        )
  2. Use a custom SQL expression in Superset to handle the conversion and aggregation correctly.

These changes should help ensure that the timestamps are converted from milliseconds to a more appropriate format for aggregation, thus resolving the issue [2].

To continue talking to Dosu, mention @dosu.

@anthony-df-nguyen anthony-df-nguyen changed the title Superset 4.1 with Pinot not aggregating epoch_ms timestamps correctly Superset 4.1 with Pinot not aggregating timestamps correctly Sep 18, 2024
@anthony-df-nguyen
Copy link
Author

anthony-df-nguyen commented Sep 18, 2024

Update to this thread is that it appears to happen with other normal timestamp fields as well, not just epoch_ms

Pic 1: My day field is a string in format %Y-%m-%d
image

Pic 2: Bar chart cant aggregate the day field properly
image

I tried both Workaround 1 and 2 in my original message and this fixed it.

Workaround 1. Forcing the X-axis to have another title got it to plot
image

Workaround 2. Making a calculated column without any transformations fixes it
Calculated Column
image

Working Chart
image

@anthony-df-nguyen
Copy link
Author

@villebro Hello, im looking to see if you have any realistic estimation of when this can be looked at. Sorry if im tagging the wrong person, I just saw you on #26906 so I figured you might be the best person to ask.

@villebro
Copy link
Member

Hello @anthony-df-nguyen , this should be fixed as of #31341. I believe it'll be included in the next patch release of 4.0 and 4.1.

@villebro
Copy link
Member

Ping @michael-s-molina @sadpandajoe as you're the release managers for 4.0 and 4.1 respectively

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:connect:pinot Related to Pinot viz:charts:line Related to the Line chart
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants