Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected high costs on Google Cloud Storage #2085

Closed
gregorp90 opened this issue Jan 16, 2024 · 5 comments
Closed

Unexpected high costs on Google Cloud Storage #2085

gregorp90 opened this issue Jan 16, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@gregorp90
Copy link

Environment

Delta-rs version: 0.10.2

Environment:

  • Cloud provider: Google Cloud Storage
  • Other: Python 3.10

Bug

What happened:
Not sure if this is a bug, but it was recommended I post this issue here on stack overflow: https://stackoverflow.com/questions/77639348/delta-rs-package-incurs-high-costs-on-gcs/77681169#77681169.

I'm using the package to store files on the Google Cloud Storage dual-region bucket. I use the following code to store the data:

def save_data(self, df: Generator[pa.RecordBatch, Any, None]):
    write_deltalake(
        f"gs://<my-bucket-name>",
        df,
        schema=df_schema,
        partition_by="my_id",
        mode="append",
        max_rows_per_file=self.max_rows_per_file,
        max_rows_per_group=self.max_rows_per_file,
        min_rows_per_group=int(self.max_rows_per_file / 2)
    )

The input data is a generator since I'm taking the data from a Postgres database in batches. I am saving similar data into two different tables and I'm also saving a SUCCESS file for each uploaded partition.

I have around 25,000 partitions and most of them only have a single parquet file in them. The total number of rows that I've inserted is around 700,000,000. This incurred the following costs:

Class A operations: 127,000.
Class B operations: 109,856,507.
Download Worldwide Destinations: 300 gibibyte.
The number of class A operations makes sense to me when accounting for 2 writes per partition + an additional success file -- these are inserts. Some partitions probably have more than 1 file, so the number is a bit higher than 25,000 (number of partitions) x 3.

I can't figure out where so many class B operations and Download Worldwide Destinations. Is this to be expected or could it be a bug?

Can you provide any insights into why the costs are so high and how I would need to change the code to decrease them?

What you expected to happen:
Much lower costs for Class B operations on GCS.

@gregorp90 gregorp90 added the bug Something isn't working label Jan 16, 2024
@ion-elgreco
Copy link
Collaborator

I wouldn't have the slightest clue what class B operations even resemble, I don't use GCP myself.

If you can break it down into lingo to non-gcp users that would help

@roeap
Copy link
Collaborator

roeap commented Jan 16, 2024

in order to write a delta table, we also need to always know the latest state of the table. as such ever read also requires us to read all relevant log files at least once. Usually there may be one or more list operations as well.

Are you creating checkpoints? if not, we have to read one commit file for at least very transaction that was created on a table, which can become very sizeable.

We have a PR in flight, that will allow us to be more economic in terms of reads, especially in append-only scenarios, where we can disregard a lot of the log - again, given there are checkpoints.

@gregorp90
Copy link
Author

@ion-elgreco mainly class B operations are for reading objects from Google Cloud Storage.

@roeap I'm not sure about checkpoints. I haven't defined any myself, so if write_deltalake is not using them by default I would assume I was not using them. Based on the numbers I provided does it make sense to get so many read/list operations?

Note: I later changed the implementation of simply adding new parquets as I figured out I don't really need the functionality of delta lake. I just wanted to point it out if anyone else had a similar problem. Especially since this can incur high unexpected costs on cloud providers.

@cpbritton
Copy link

cpbritton commented Jan 17, 2024

I fixed this by adding a regular checkpoint creation function. This reduces the number of file operations.

PR around auto checkpoints is #913

@scheduler_fn.on_schedule(schedule="*/5 * * * *", memory=options.MemoryOption.GB_1 , timeout_sec=1000) def checkpointdb(request): dt = DeltaTable("gs://bucket/deltalake/feed" , storage_options={"google_service_account_key": json.dumps(google_service_account_key)}) dt.create_checkpoint() print(f"DB Checkpoint" , flush=True)

@rtyler
Copy link
Member

rtyler commented Jan 20, 2024

I'm going to close this, I don't believe there is something actionable for the delta-rs project here

@rtyler rtyler closed this as completed Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants