Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Analytics source: expose “isDataGolden” flag #12013

Closed
Tracked by #10938
ThaliaBarrera opened this issue Apr 14, 2022 · 9 comments · Fixed by #12426
Closed
Tracked by #10938

Google Analytics source: expose “isDataGolden” flag #12013

ThaliaBarrera opened this issue Apr 14, 2022 · 9 comments · Fixed by #12426

Comments

@ThaliaBarrera
Copy link
Contributor

Tell us about the problem you're trying to solve

Google Analytics Reporting API v4 may return provisional or incomplete data – usually when it’s fresh. When this occurs, the returned data will set the flag “isDataGolden” to false, and the connector will log a warning to the sync log.

Having a warning in the logs acts as a heads-up, but it doesn't help to filter out not-golden data for analysis. Having the flag replicated to the destination would be more helpful.

Describe the solution you’d like

I'd like the “isDataGolden” flag to be replicated to the destination by the GA connector.

Are you willing to submit a PR?

Yes

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Apr 19, 2022

When using "incremental" sync mode, it would seem this attribute is a make or break aspect that should be used by the connector's logic along with its cursor field?

isDataGolden boolean Indicates if response to this request is golden or not. Data is golden when the exact same request will not produce any new results if asked at a later point in time.

from https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportData.FIELDS.is_data_golden

If the data from the report is not declared golden by google API yet, then the associated rows should still keep replicating during the following incremental syncs until it is flagged "golden" (no new updates required).

Exposing the isDataGolden is a requirement to use it as part of the fields for proper dedupe logic etc downstream

@davydov-d
Copy link
Collaborator

@ChristopheDuong just to clarify - would it be enough to extend the schemas and records with a new field within this issue? Or should we go further and make a compound cursor field? Or a compound primary key? or both?

@ChristopheDuong
Copy link
Contributor

  • Extend the schemas and records with a new field would be nice to improve the Full Refresh - Append sync mode but is not sufficient.
  • Make a compound cursor field would be ideal to improve the Incremental - X sync modes

@sherifnada
Copy link
Contributor

@davydov-d to echo chris' point, when isDataGolden=false we need to keep resyncing this data until the isDataGolden flag is true.

The reason relates to the definition of that flag. From the google docs:

isDataGolden: Indicates if response to this request is golden or not. Data is golden when the exact same request will not produce any new results if asked at a later point in time.

If the value is false for a report, this means the result of the report will be updated in the future, and we should therefore resync this date period.

@davydov-d davydov-d self-assigned this Apr 26, 2022
@davydov-d
Copy link
Collaborator

@sherifnada I got the point, thanks

The problem is isDataGolden indicates if response to the request (not a report) is golden or not. What if we get true, false, true, true in the sequence of requests? Should we resync all the date periods after getting first isDataGolden=false (even though some of them are golden)?
In the docs, I can not find any guarantees non-golden data can not precede golden data.

@davydov-d
Copy link
Collaborator

@sherifnada I got the point, thanks

The problem is isDataGolden indicates if response to the request (not a report) is golden or not. What if we get true, false, true, true in the sequence of requests? Should we resync all the date periods after getting first isDataGolden=false (even though some of them are golden)? In the docs, I can not find any guarantees non-golden data can not precede golden data.

never mind, that's not a problem since we can have a more complex structure of the stream state

@sherifnada
Copy link
Contributor

@ThaliaBarrera do you know how long the isDataGolden flag takes to flip to true? For example, if it takes at most 2 days, then it would be a lot simpler to just sync data from the past 2 days. Othewise we'd need to keep track of the exact date chunks for which isDataGolden has been false which is a more complicated implemenetation

@davydov-d
Copy link
Collaborator

@ThaliaBarrera do you know how long the isDataGolden flag takes to flip to true? For example, if it takes at most 2 days, then it would be a lot simpler to just sync data from the past 2 days. Othewise we'd need to keep track of the exact date chunks for which isDataGolden has been false which is a more complicated implemenetation

I've already started implementing the second option and want to add it's gonna be much more complicated solution since python CDK does not support compound cursor fields, only nested - that's challenging

At the same time it looks like syncing data for 2 previous days does not require manipulations with the cursor field

@sherifnada
Copy link
Contributor

@davydov-d I think 2 days is probably the right lookback window. These docs shared by Thalia indicate data processing time is 24-48hours. So we should probably go for the following solution:

  1. always sync data from 2 days ago
  2. include isDataGolden flag in output
  3. write a section in the docs which explains that: the connector always syncs data from 2 days ago because google analytics updates data up to 48 hours after the fact, and that the isDataGolden flag should be used to determine whether data is finished processing

davydov-d added a commit that referenced this issue Apr 28, 2022
davydov-d added a commit that referenced this issue May 3, 2022
* #12013 source GA to Beta: always sync data from two days ago

* #12013 GA to Beta: fix changelog

* #12013 source GA to Beta: rm odd file

* #12013 Source GA to Beta: comment out integration tests

* #12013 expose isDataGolden field, assume missing field equals False

* #12013 expose isDataGOlden flag: reword docs

* auto-bump connector version

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
suhomud pushed a commit that referenced this issue May 23, 2022
* #12013 source GA to Beta: always sync data from two days ago

* #12013 GA to Beta: fix changelog

* #12013 source GA to Beta: rm odd file

* #12013 Source GA to Beta: comment out integration tests

* #12013 expose isDataGolden field, assume missing field equals False

* #12013 expose isDataGOlden flag: reword docs

* auto-bump connector version

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants