Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Search Console (GSC) Incremental stream isn't incremental #7479

Closed
danieldiamond opened this issue Oct 29, 2021 · 18 comments
Closed

Google Search Console (GSC) Incremental stream isn't incremental #7479

danieldiamond opened this issue Oct 29, 2021 · 18 comments
Assignees
Labels
area/connectors Connector related issues community priority/high High priority type/bug Something isn't working

Comments

@danieldiamond
Copy link
Contributor

Enviroment

Is this your first time deploying Airbyte: no
OS Version / Instance: Linux EC2 m5.2xlarge
Deployment: Docker
Airbyte Version: 0.30.20-alpha
Source name: GSC 0.1.6
destination: Redshift 0.3.19
Description: Finally able to try out GSC connector. Unfortunately incremental syncs are not working properly. Thinking it could be linked to improper setting of state

  • Airbyte version: 0.30.20-alpha
  • OS Version / Instance: AWS EC2
  • Deployment: Docker
  • Source Connector and version: GSC 0.1.6
  • Destination Connector and version: Redshift 0.3.19
  • Severity: High
  • Step where error happened: Sync Job

Current Behavior

Full refresh on streams configured to be incremental

Expected Behavior

Incremental sync

Logs

If applicable, please upload the logs from the failing operation.
For sync jobs, you can download the full logs from the UI by going to the sync attempt page and
clicking the download logs button at the top right of the logs display window.

LOG

logs are too large. stored them here: https://airbytehq.slack.com/archives/C01MFR03D5W/p1635482528310300

Steps to Reproduce

Are you willing to submit a PR?

Remove this with your answer.

@danieldiamond danieldiamond added the type/bug Something isn't working label Oct 29, 2021
@sherifnada sherifnada added the priority/high High priority label Oct 29, 2021
@sherifnada sherifnada added this to the Connectors 2021-11-12 milestone Oct 29, 2021
@harshithmullapudi harshithmullapudi added the area/connectors Connector related issues label Nov 1, 2021
@harshithmullapudi
Copy link
Contributor

Thanks for creating issue this helps us to track them. Team will look into it soon.

@yevhenii-ldv yevhenii-ldv self-assigned this Nov 3, 2021
@yevhenii-ldv
Copy link
Contributor

yevhenii-ldv commented Nov 4, 2021

Hi, @danieldiamond.
I investigated your request, according to the logs provided by you, as I understand it, this is the first launch that works as Full Refresh (collects all information starting from date the one specified in the start_date configuration), the next launch will work as Incremental - reading will start from saved data from the state file.
I checked the code and it keeps the Incremental in good condition.
For the purity of the experiment, I created a connector using UI and started synchronization for the GSC (attached image):
image

  1. The first launch synchronizes all data (in my case, starting from January 2021) and creates a state file at the end of synchronization.
  2. Synchronizes data based on the state file. Perhaps you are confused by the fact that as a result, a large number of records are obtained, it is because the connector in my case starts reading from 04-11-2021 inclusive, and since this connector provides statistic information on all pages of the site, then there can be quite a lot of these records in one day. In this case, if you synchronize every 5 minutes, then there may be many duplicate records with the date of the current day.

Could you please clarify what exactly you are having a problem with so that I can take a deeper look at the connector code, since at the moment i am unable to reproduce your problem?

@danieldiamond
Copy link
Contributor Author

hi @yevhenii-ldv thanks for investigating this! to confirm though, yes one day might contain a lot of records, but (as shown in the screenshots below) the 8m records span across ~1 year's worth of page analytics. The subsequent sync reads all 8m records again

Screen Shot 2021-11-05 at 12 11 31 am

Screen Shot 2021-11-05 at 12 12 07 am

@yevhenii-ldv
Copy link
Contributor

yevhenii-ldv commented Nov 4, 2021

May I ask you to check if it was a banal small failure of the connector? Please create a new Sourse GSC again and try to read the data Incrementally 2 or 3 times (by setting the date, for example, the beginning of autumn, so that it does not take too much of your time)? And write me what happened with the connector.
Since I am unable to reproduce this error in any way 😞

@danieldiamond
Copy link
Contributor Author

created a new connector. does the same thing.
i wonder if its because i'm using the service account key authentication rather than oauth. historically been having issues with airbyte and this type of authentication.

Screen Shot 2021-11-07 at 9 37 01 am

Screen Shot 2021-11-07 at 9 37 35 am

@danieldiamond
Copy link
Contributor Author

normalization - 2021-11-06 15:10:09 INFO () LineGobbler(voidCall):82 - 15:10:09 | 1 of 2 START table model google_search_console.sites......................................................... [RUN]
normalization - 2021-11-06 15:10:10 INFO () LineGobbler(voidCall):82 - 15:10:10 | 2 of 2 START incremental model google_search_console.search_analytics_by_page................................ [RUN]
normalization - 2021-11-06 15:10:11 INFO () LineGobbler(voidCall):82 - 15:10:11 | 1 of 2 OK created table model google_search_console.sites.................................................... [SELECT in 1.47s]
normalization - 2021-11-06 15:10:17 INFO () LineGobbler(voidCall):82 - 15:10:17 | 2 of 2 OK created incremental model google_search_console.search_analytics_by_page........................... [INSERT 0 0 in 7.42s]

there's now ~30M records in the _airbyte_raw_search_analytics_by_page and zero records in the search_analytics_by_page (normalized) table

@danieldiamond
Copy link
Contributor Author

regarding this seperate issue of no records in the final table @ChristopheDuong i wonder if this is due to the new incremental normalization process

@yevhenii-ldv
Copy link
Contributor

Hello @danieldiamond,
I researched this problem far and wide.
The GSC connector supports incremental synchronization and I have successfully used it (also all tests that check this functionality work out without any complaints).
I also tried various destinations to check the increment:

  • CSV and JSON for local data storage, and everything worked fine.
  • BigQuery with normalization and also the data are saved correctly and the Incremental works properly.

And the last thing left was to reproduce your configuration, and I created a connection where the source was GSC and the destination was Redshift.
Here is a screenshot of my results.
image
Everything works out correctly and properly.

In this regard, could you make sure that the versions of the Connector and Destination are updated to the latest versions and re-create the connection. Because at the present moment in time I am by no means able to recreate your problem.

@danieldiamond
Copy link
Contributor Author

Hi @yevhenii-ldv thank you so much for the further investigation. Can I please confirm that you checked this using the service account configuration? Or at least can 💯 confirm without any doubt there is no difference between the two authentication methods on streaming?
Thanks again, I'll try sit with this and investigate further on my end.

@yevhenii-ldv
Copy link
Contributor

@danieldiamond
I can assure you that I have not changed the connector configuration in any way between the two syncs.

I created a connection -> started syncing -> waited 1 minute after it finished -> started a second sync.

@danieldiamond
Copy link
Contributor Author

@yevhenii-ldv looking at your screenshot, i see incremental model.search_analytics_by_page, which leads me to believe you're not testing this on the airbyte version listed in the description but rather using one of the later versions which use the new normalization process which performs incremental in a different way. is that correct?

Airbyte Version: 0.30.20-alpha (or 0.30.23-alpha i.e. pre-normalization revamp)
Source name: GSC 0.1.6
destination: Redshift 0.3.19

also posting the connection schema below to confirm its configured correctly

{
  "connectionId": "409bd766-2f56-4be7-95b5-c96a160f624b",
  "name": "default",
  "namespaceDefinition": "customformat",
  "namespaceFormat": "google_search_console",
  "prefix": "",
  "sourceId": "f1063ef2-af11-412c-8158-538e7154f755",
  "destinationId": "744cf008-8d04-4d35-b4b8-daa5e2f4130b",
  "operationIds": [
    "c123bff8-264e-4074-9707-a7f406385fef"
  ],
  "syncCatalog": {
    "streams": [
      {
        "stream": {
          "name": "search_analytics_by_page",
          "jsonSchema": {
            "type": "object",
            "$schema": "http://json-schema.org/draft-07/schema#",
            "properties": {
              "ctr": {
                "type": [
                  "null",
                  "number"
                ],
                "multipleOf": 1e-25
              },
              "date": {
                "type": [
                  "null",
                  "string"
                ],
                "format": "date"
              },
              "page": {
                "type": [
                  "null",
                  "string"
                ]
              },
              "clicks": {
                "type": [
                  "null",
                  "integer"
                ]
              },
              "position": {
                "type": [
                  "null",
                  "number"
                ],
                "multipleOf": 1e-25
              },
              "site_url": {
                "type": [
                  "null",
                  "string"
                ]
              },
              "impressions": {
                "type": [
                  "null",
                  "integer"
                ]
              },
              "search_type": {
                "type": [
                  "null",
                  "string"
                ]
              }
            }
          },
          "supportedSyncModes": [
            "full_refresh",
            "incremental"
          ],
          "sourceDefinedCursor": true,
          "defaultCursorField": [
            "date"
          ],
          "sourceDefinedPrimaryKey": [],
          "namespace": null
        },
        "config": {
          "syncMode": "incremental",
          "cursorField": [
            "date"
          ],
          "destinationSyncMode": "append",
          "primaryKey": [],
          "aliasName": "search_analytics_by_page",
          "selected": true
        }
      },
      {
        "stream": {
          "name": "sites",
          "jsonSchema": {
            "type": "object",
            "$schema": "http://json-schema.org/draft-07/schema#",
            "properties": {
              "siteUrl": {
                "type": [
                  "null",
                  "string"
                ]
              },
              "permissionLevel": {
                "type": [
                  "null",
                  "string"
                ]
              }
            }
          },
          "supportedSyncModes": [
            "full_refresh"
          ],
          "sourceDefinedCursor": null,
          "defaultCursorField": [],
          "sourceDefinedPrimaryKey": [],
          "namespace": null
        },
        "config": {
          "syncMode": "full_refresh",
          "cursorField": [],
          "destinationSyncMode": "overwrite",
          "primaryKey": [],
          "aliasName": "sites",
          "selected": true
        }
      }
    ]
  },
  "schedule": null,
  "status": "active",
  "resourceRequirements": {
    "cpu_request": null,
    "cpu_limit": null,
    "memory_request": null,
    "memory_limit": null
  }
}

@yevhenii-ldv
Copy link
Contributor

@danieldiamond In my last testing, I used the following versions:

Airbyte Version: 0.30.39-alpha
Source name: GSC 0.1.6
destination: Redshift 0.3.20

@danieldiamond
Copy link
Contributor Author

  1. In your logs, how many records were read in the first sync and second sync?
  2. Can you please check this on the 0.30.23 version.

I'm thinking it might be possible that the connector is still reading all the records (not doing an incremental load) but then processing only the latest records per the new normalisation incremental process and "_airbyte_ab_id" column which is masking the underlying issue.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Nov 17, 2021

a new value for _airbyte_ab_id is randomly generated for each row everytime the destination writes it. This doesn't depend on the record, so it wouldn't re-use the same _airbyte_ab_id as last sync.

Thus, I don't think incremental normalization would be able to mask these based on the _airbyte_ab_id

@yevhenii-ldv
Copy link
Contributor

yevhenii-ldv commented Nov 19, 2021

Hi @danieldiamond!
We are already using the 0.32.0 version, I don't see much point in starting synchronization on the much previous version if no errors were found in the current version.

@danieldiamond
Copy link
Contributor Author

danieldiamond commented Nov 21, 2021

okay thanks @yevhenii-ldv but to be sure, this wasn't troubleshooted on pre-0.30.24 versions and i would imagine all source connectors should be able to function in any airbyte version unless noted otherwise.

on not seeing much point in trying to getting this resolved on previous versions, a quick search of normalization on the slack group will surface various issues associated with the incremental normalization introduced from 0.30.24. some users including myself have not been able to upgrade to 0.32.0 due to the introduced performance issues.

@ChristopheDuong am i able to upgrade to 0.32.0+ without initiating the new incremental normalization process?

@yevhenii-ldv
Copy link
Contributor

Сan not reproduce the issue with using Airbyte 0.32.0 version.

@danieldiamond
Copy link
Contributor Author

danieldiamond commented Dec 14, 2021

@yevhenii-ldv i'm hoping im just doing something silly here that you can help with.
this is a brand new instance 0.33.11-alpha. incremental doesnt appear to be working (GSC 0.1.7)

  • start_date: 2020-01-01 (even though data starts from 2020-08-01
  • end_date: left empty
  • schema:
    • search_analytics_by_page: incremental + append
    • sites: full_refresh + overwrite

i checked the destination and max(date) is 2021-12-11 which has 26724 records

first_sync.txt
second_sync.txt

Screen Shot 2021-12-14 at 10 36 52 pm

Screen Shot 2021-12-15 at 8 47 09 am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community priority/high High priority type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants