Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination Postgres: Very poor performance #21557

Closed
michelgalle opened this issue Jan 18, 2023 · 2 comments
Closed

Destination Postgres: Very poor performance #21557

michelgalle opened this issue Jan 18, 2023 · 2 comments
Labels
area/connectors Connector related issues community connectors/destinations-database connectors/source/postgres frozen Not being actively worked on releaseStage/alpha team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@michelgalle
Copy link

Environment

  • Airbyte version: 0.40.26
  • OS Version / Instance: AMI 2 , AWS EC2
  • Deployment: Docker Compose
  • Source Connector and version: Postgres 1.0.37
  • Destination Connector and version: Postgres 0.3.26
  • Step where error happened: Normalization

Current Behavior

For big tables it is impossible to run the normalization jobs. After a initial backfill that took 33 hours, the next sync which added another ~200000 rows has been running normalization for over 12 hours now.
I am trying with append mode only but deduped + history is even worse.

Expected Behavior

I would like the correct indexes to be created by ingestion and also dbt normalization step.

Steps to Reproduce

  1. Have a very big table (200,000,000 rows)
  2. Full sync first
  3. Synchronize it for the second time (incremental)

Looking at the tables generated and compiled dbt it is clear that there is space for improvements.

  1. The raw table, which is generated by destination connector I believe, has only a index on primary key _airbyte_ab_id but the subsequent model that runs the incremental step queries the table using a filter on _airbyte_emitted_at which means that it will always run a seq scan of the whole table.
  2. The final table has only an index on _airbyte_emitted_at which is good if we want to run any subsequent incremental models using that column, but does not have an index on _airbyte_ab_id which is the unique key used for updating the table by dbt. The code generated by dbt for incremental models for postgres, runs a delete and then an insert. The delete snippet is below:
    delete from "defaultdb"._heroku_db_test."measurements"
    where (_airbyte_ab_id) in (
        select (_airbyte_ab_id)
        from "measurements__dbt_tmp153913835609"
    );

Without an index on _airbyte_ab_id this will too run a seq scan on table.
4. Finally, incremental code generated for postgres might not use an index even if there is one because there is extra logic for the query (not sure about this, did not test it):

and coalesce(
      cast(_airbyte_emitted_at as timestamp with time zone ) >=
      (select max(cast(_airbyte_emitted_at as timestamp with time zone)) from "defaultdb"._heroku_db_test."measurements"),    
      true)

Not sure how good postgres is in figuring out at run time whether the first part of the coalesce is true or not, because if it can't figure that out during planning, it will also not use an index.

Are you willing to submit a PR?

No.

@gvillafanetapia
Copy link
Contributor

Having the same issue. We are adding manually an index on raw _airbyte_emitted_at but also found that this has a really high cardinality because the timestamp is for the exact moment in which the datapoint was retrieved from the API.

Downstream we have an incremental dbt model which filters on _airbyte_emitted_at.

Would be great to implement partitioning by this date and or indexing at a raw level. Something similar to what was implemented here would be great.

@igrankova igrankova changed the title Very poor performance on Postgres Destination Destination Postgres: Very poor performance Jun 6, 2023
@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
@cgardens
Copy link
Contributor

cgardens commented Apr 3, 2024

We have just certified our Postgres Destination (which includes reviewing a lot of performance dimensions). This should be resolved. Please reopen if you run into this again.

@cgardens cgardens closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community connectors/destinations-database connectors/source/postgres frozen Not being actively worked on releaseStage/alpha team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants