Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] autoCompact not working #1427

Open
0xdarkman opened this issue Oct 11, 2022 · 6 comments
Open

[BUG] autoCompact not working #1427

0xdarkman opened this issue Oct 11, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@0xdarkman
Copy link

0xdarkman commented Oct 11, 2022

I have a delta table in the location as below with such properties:

tablePath abfss://CONTAINER@SA.dfs.core.windows.net/TABLE

delta.autoOptimize.autoCompact true
delta.autoOptimize.optimizeWrite true
delta.deletedFileRetentionDuration interval 168 hours
delta.logRetentionDuration interval 168 hours
delta.minReaderVersion 1
delta.minWriterVersion 2

I do run daily from databricks:

VACUUM delta.`abfss://CONTAINER@SA.dfs.core.windows.net/TABLE`

Table is partitionBy date, some_col.
I do writeStream to above table with mode append using spark streaming.

spark version 3.2.1
delta.io library ver 1.2.1
hadoop ver 3.3.0

If I check some old location:

abfss://CONTAINER@SA.dfs.core.windows.net/TABLE/date=2022-08-20/hour=7/some_col=xxx/

or some latest written location:

abfss://CONTAINER@SA.dfs.core.windows.net/TABLE/date=2022-10-11//hour=5/some_col=xxx/

I see lots of small files still.

I thought that autoCompact will force DeltaTable to repartition small files into large ones (128MB).

Although, it is not the case.
Why?

@0xdarkman 0xdarkman added the bug Something isn't working label Oct 11, 2022
@0xdarkman 0xdarkman changed the title [BUG] OPTIMIZE not working [BUG] autoCompact not working Oct 11, 2022
@0xdarkman
Copy link
Author

%sql
DESCRIBE DETAIL delta.`abfss://CONTAINER@SA.dfs.core.windows.net/TABLE`
delta.deletedFileRetentionDuration: "interval 168 hours"
delta.autoOptimize.autoCompact: "true"
delta.logRetentionDuration: "interval 168 hours"
delta.autoOptimize.optimizeWrite: "true"

@tdas
Copy link
Contributor

tdas commented Oct 11, 2022

Are you running this on Databricks? If so, please contact Databricks support if autocompaction is not working.
If you are running this with Apache Spark, then auto compaction is being built right now, so its not yet available in Delta OSS.

@0xdarkman
Copy link
Author

@tdas
I am running spark+scala job deployed with spark operator in kubernetes.
So the answer is: yes, I do use spark to writeStream BUT I do run VACUUM from databricks.

Shall I expect it to work?

Q1: I do not need to run VACUUM from spark, dont I? running VACUUM from databricks shall be OK
Q2: autoCompact is table property so I thought delta table will handle autocompact for me so what difference spark vs databricks make here?
Q3: if autoCompact does not work shall I use OPTIMIZE? I would like to run OPTIMIZE from databricks daily while I keep streaming with spark running in kubernetes.

@0xdarkman
Copy link
Author

0xdarkman commented Oct 11, 2022

https://docs.databricks.com/optimizations/auto-optimize.html#if-i-have-auto-optimize-enabled-on-a-table-that-im-streaming-into-and-a-concurrent-transaction-conflicts-with-the-optimize-will-my-job-fail

"By default, auto optimize does not begin compacting until it finds more than 50 small files in a directory"

I have more than 50 small files in partitioned directories.

I have 168 files per partition.

@zsxwing
Copy link
Member

zsxwing commented Oct 12, 2022

Q1: I do not need to run VACUUM from spark, dont I? running VACUUM from databricks shall be OK

You can run VACUUM anywhere

Q2: autoCompact is table property so I thought delta table will handle autocompact for me so what difference spark vs databricks make here?

auto optimize is still being built #1156. Currently the table property is respected only if you run your queries in Databricks.

Q3: if autoCompact does not work shall I use OPTIMIZE? I would like to run OPTIMIZE from databricks daily while I keep streaming with spark running in kubernetes.

Yep, you can use OPTIMIZE. It can be run in Spark or Databricks.

"By default, auto optimize does not begin compacting until it finds more than 50 small files in a directory"

This is expected right now if the write is not happening in Databricks. As I mentioned above, it's being built.

Another suggestion, if you run your code outside Databricks, it's better to read https://docs.delta.io/latest/index.html instead.

@bqiang-stackadapt
Copy link

@zsxwing I think this should be available now? since it's documented here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants