Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a glob resolver for listing and partitioning files in an object store #5305

Merged
merged 16 commits into from
Jul 23, 2024

Conversation

begelundmuller
Copy link
Contributor

@begelundmuller begelundmuller commented Jul 17, 2024

This PR makes introduces a glob resolver that lists and optionally partitions files in an object store connector. In connection with that, it also:

  • Adds a ListObjects function on drivers.ObjectStore
  • Adds a mock_object_store driver for writing tests against object store drivers
  • Refactors runtime.Resolve to support efficient iteration over a ResolverResult

This resolver will be used in a follow-up PR to introduce support for incremental ingestion from object stores by calling the glob resolver to obtain values to split ingestion by.

Example usage for listing out individual files:

# Example row:
#   {"path": "data/file.parquet", "uri": "s3://bucket/data/file.parquet", "updated_on": "2024-07-18T12:00:00Z"}
glob:
  connector: s3
  path: s3://bucket/**/*.parquet

Example usage for listing out files partitioned by directory:

# Example row:
#   {
#     "path": "data",
#     "uri": "s3://bucket/data", 
#     "files": ["data/file1.parquet", ...],
#     "updated_on": "2024-07-18T12:00:00Z"
#   } 
glob:
  connector: s3
  path: s3://bucket/**/*.parquet
  partition: directory

Example usage for listing out files partitioned using Hive partitioning:

# Example row: 
#   {
#      "path": "year=2024/month=07",
#      "uri": "s3://bucket/year=2024/month=07", 
#      "files": ["year=2024/month=07/file1.parquet", ...],
#      "updated_on": "2024-07-18T12:00:00Z",
#      "year": "2024",
#      "month": "07"
#   }
glob:
  connector: s3
  path: s3://bucket/**/*.parquet
  partition: hive

Example usage for doing post-processing with DuckDB (note this is an advanced feature needed to address some complex partition processing use cases):

# Example row: 
#   {
#      "path": "2024/07/18.parquet",
#      "previous_path": "2024/07/17.parquet",
#   }
glob:
  connector: s3
  path: s3://bucket/**/*.parquet
  transform_sql: SELECT path, lag(path) OVER (ORDER BY path) AS previous_path FROM {{ .table }}

@begelundmuller begelundmuller self-assigned this Jul 18, 2024
@begelundmuller begelundmuller changed the title Runtime: Support listing files in an ObjectStore Support a glob resolver for listing and partitioning files in an object store Jul 18, 2024
@begelundmuller begelundmuller marked this pull request as ready for review July 19, 2024 11:22
runtime/resolvers/legacy_metrics.go Outdated Show resolved Hide resolved
runtime/resolvers/glob.go Show resolved Hide resolved
@begelundmuller begelundmuller merged commit 831ebd0 into main Jul 23, 2024
4 checks passed
@begelundmuller begelundmuller deleted the begelundmuller/object-store-list branch July 23, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants