-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingestion/s3): groupby group-splitting issue #12254
fix(ingestion/s3): groupby group-splitting issue #12254
Conversation
files = list( | ||
bucket.objects.filter(Prefix=f"{prefix_to_list}").page_size(PAGE_SIZE) | ||
) | ||
files = sorted(files, key=lambda a: a.last_modified) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to sort the s3_objects by last_modified
, as the logic below iterates through all files within the same folder and determines the maximum value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fair to me. @treff7es , since you were involved in the development here, are you okay with this?
@@ -139,6 +139,14 @@ def partitioned_folder_comparator(folder1: str, folder2: str) -> int: | |||
return 1 if folder1 > folder2 else -1 | |||
|
|||
|
|||
def _group_s3_objects_by_dirname(s3_objects: Any) -> Dict[str, List[Any]]: | |||
grouped_objects = defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used defaultdict
for grouping to eliminate the need to consider the order of the collection.
cc6f3b8
to
57d76cb
Compare
57d76cb
to
52bb264
Compare
52bb264
to
bef800f
Compare
bef800f
to
4019f41
Compare
4019f41
to
f12a066
Compare
assert len(res) == 2 | ||
assert res[0].sample_file == "s3://my-bucket/my-folder/dir1/0002.csv" | ||
assert res[1].sample_file == "s3://my-bucket/my-folder/dir2/0001.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfixed code returns
len(res): 3
res[0].sample_file: s3://my-bucket/my-folder/dir1/0001.csv
res[1].sample_file: s3://my-bucket/my-folder/dir2/0001.csv
res[2].sample_file: s3://my-bucket/my-folder/dir1/0002.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, nice catch!
f12a066
to
e411090
Compare
e411090
to
b8586ae
Compare
b8586ae
to
5b04411
Compare
5b04411
to
da7c1f3
Compare
def _group_s3_objects_by_dirname(s3_objects: Any) -> Dict[str, List[Any]]: | ||
grouped_objects = defaultdict(list) | ||
for obj in s3_objects: | ||
dirname = obj.key.rsplit("/", 1)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _group_s3_objects_by_dirname
function could be placed in either s3_util.py
or s3_boto_utils.py
within metadata-ingestion/src/datahub/ingestion/source/aws
.
To keep things as generic as possible, we should consider what happens if there's no directory (i.e., no /
) in the key. In such cases, we could include the file in the group None
or /
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, regarding s3_objects: Any
, could you specify a more precise type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar functionality existed in the past
https://github.com/datahub-project/datahub/pull/4011/files
def groupby_unsorted(
iterable: Iterable[T], key: Callable[[T], K]
) -> Iterable[Tuple[K, Iterable[T]]]:
Yet another example to keep this logic as generic as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed review. I’ll make the changes you suggested and request your feedback again.
Regarding your question, here’s my response:
To keep things as generic as possible, we should consider what happens if there's no directory (i.e., no /) in the key. In such cases, we could include the file in the group None or /. WDYT?
-> I think /
is the better choice because it intuitively represents the root directory, whereas it's harder to understand the meaning of None
in this context. I'll add the necessary logic and tests for this case. Thanks for pointing this out. 👍
The _group_s3_objects_by_dirname function could be placed in either s3_util.py or s3_boto_utils.py within metadata-ingestion/src/datahub/ingestion/source/aws.
-> Sounds good. I'll move it to s3_util.py
Also, regarding s3_objects: Any, could you specify a more precise type?
-> Okay, I'll specify the types with mypy_boto3_s3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Let's double confirm with @treff7es 's review, if possible
assert len(res) == 2 | ||
assert res[0].sample_file == "s3://my-bucket/my-folder/dir1/0002.csv" | ||
assert res[1].sample_file == "s3://my-bucket/my-folder/dir2/0001.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, nice catch!
20c8acc
to
b323dc1
Compare
b323dc1
to
96bb991
Compare
Changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the contrib!
96bb991
to
61ffdcb
Compare
61ffdcb
to
732fa19
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
... and 636 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
@sgomezvillamor Could you please merge this PR? 🙏 |
732fa19
to
ad69996
Compare
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Summary
Fix the group-splitting issue caused by
itertools.groupby
when the input data is not ordered by the group key.Changes
As-is
itertools.groupby
for grouping. (group_key:s3_object.key
)s3_object.last_modified
)To-be
defaultdict
for grouping without pre-sorting.Background
Behavior of itertools.groupby
itertools.groupby()
generates a new group every time the value of the key function changes, which requires the input data to be sorted.This means the input data should be sorted and the group key must be equal to the sort key for correct grouping.
Problem part
file.last_modified
even though the group key isfile.key
. This mismatch causes group-splitting issues.Checklist