Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper pipeline refresh with delta table format #1777

Open
jorritsandbrink opened this issue Sep 2, 2024 · 0 comments
Open

Proper pipeline refresh with delta table format #1777

jorritsandbrink opened this issue Sep 2, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@jorritsandbrink
Copy link
Collaborator

Feature description

Two flaws exist with pipeline refresh for delta table format on filesystem destination:

  1. Incomplete "DROP" ➜ empty table folder remains with drop_sources.
  2. "TRUNCATE" behaves like "DROP" ➜ table schema/history gets deleted with drop_data.

Repro (1):

import dlt
from dlt.destinations import filesystem
from tests.pipeline.utils import airtable_emojis

source = airtable_emojis().with_resources("📆 Schedule", "🦚Peacock")
for resource in source.selected_resources.values():
    resource.apply_hints(table_format="delta")

pipe = dlt.pipeline(
    pipeline_name="refresh_repro",
    pipelines_dir="_storage",
    destination=filesystem("_storage")
)

pipe.run(source)
pipe.run(source.with_resources("🦚Peacock"), refresh="drop_sources")
# actual: empty folder `/_schedule/_delta_log` remains
# expected: `/_schedule/_delta_log` no longer exists

Repro (2):

import dlt
from dlt.destinations import filesystem
from tests.pipeline.utils import airtable_emojis

source = airtable_emojis().with_resources("📆 Schedule", "🦚Peacock")
for resource in source.selected_resources.values():
    resource.apply_hints(table_format="delta")

pipe = dlt.pipeline(
    pipeline_name="refresh_repro",
    pipelines_dir="_storage",
    destination=filesystem("_storage")
)

pipe.run(source)
pipe.run(source.with_resources("📆 Schedule"), refresh="drop_data")
# actual: _schedule table has single commit (/_schedule/_delta_log/00000000000000000000.json) (in SQL terms: table got DROPped)
# expected: _schedule table has two commits (in SQL terms: table got TRUNCATEd)

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

No response

Proposed solution

Custom implementations for drop_tables and truncate_tables for delta. Currently generic filesystem implementations are applied.

Related issues

#1742 (comment)

@jorritsandbrink jorritsandbrink added the enhancement New feature or request label Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

1 participant