Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Exclude files in s3 data source #2525

Open
kaituo opened this issue Feb 22, 2024 · 1 comment
Open

[FEATURE] Exclude files in s3 data source #2525

kaituo opened this issue Feb 22, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@kaituo
Copy link
Contributor

kaituo commented Feb 22, 2024

Is your feature request related to a problem?
A user may have a messy S3 file system and would like to exclude certain unstructured log types which are in their S3 bucket. Glue offers a way to exclude, but that is Hive functionality. I suspect we will need to upgrade our SQL grammar to support advanced filtering.

What solution would you like?
Possible approaches:

@kaituo kaituo added enhancement New feature or request untriaged labels Feb 22, 2024
@penghuo
Copy link
Collaborator

penghuo commented Feb 22, 2024

file metadata path is Spark SQL existing feature, no extra grammer change required. for example

SELECT _metadata.file_name, count(*) 
FROM alb_logs 
WHERE _metadata.file_path like '%2023/11/09%' 
GROUP BY _metadata.file_name

@penghuo penghuo removed the untriaged label Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants