Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Merge Upsert for existing Glue Tables on Primary Keys #503

Closed
4 tasks
jiteshsoni opened this issue Jan 4, 2021 · 4 comments
Closed
4 tasks

Enable Merge Upsert for existing Glue Tables on Primary Keys #503

jiteshsoni opened this issue Jan 4, 2021 · 4 comments
Assignees
Labels
feature minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@jiteshsoni
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently, we have a table overwrite feature or drop/add partitions. We should add functionality so that we can merge upsert data into existing table.

Describe the solution you'd like
We will follow the below steps:

  • Read existing table using Athena
  • Check if existing table have duplicates
  • Merge the new pandas dataframe into existing table using Pandas on the primary key merged_df = pd.concat([existing_df[~existing_df.index.isin(delta_df.index)], delta_df])
  • Overwrite the existing Athena table.
    Write to Glue catalog
    res = wr.s3.to_parquet(
    df=merged_df,
    path=s3_path_prefix,
    dataset=True,
    database=database_name,
    table=table_name,
    mode="overwrite"
    )_

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

@jiteshsoni jiteshsoni self-assigned this Jan 4, 2021
@jiteshsoni
Copy link
Contributor Author

@igorborgest Please review the approach.

@igorborgest
Copy link
Contributor

Hi @jiteshsoni,

This approach seems good to me. My only recommendation in advance it to read the data directly from s3 using wr.s3.read_parquet_table() instead of fetch it from Athena. If we will not process/filter the data in the Athena query, I think we can save time and money skipping it.

What do you think?

@jiteshsoni
Copy link
Contributor Author

@igorborgest Good call out. I will make sure to incorporate your recommendation.

@jiteshsoni jiteshsoni changed the title Enable Merge Upsert for existing Athena Table on Primary Keys Enable Merge Upsert for existing Glue Table on Primary Keys Jan 6, 2021
@jiteshsoni jiteshsoni changed the title Enable Merge Upsert for existing Glue Table on Primary Keys Enable Merge Upsert for existing Glue Tables on Primary Keys Jan 6, 2021
@igorborgest igorborgest added this to the 2.4.0 milestone Jan 18, 2021
@igorborgest igorborgest added minor release Will be addressed in the next minor release ready to release labels Jan 18, 2021
@igorborgest
Copy link
Contributor

Released on version 2.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

No branches or pull requests

2 participants