-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Speedup Connected Components #302
[REVIEW] Speedup Connected Components #302
Conversation
f424de2
to
ea65bad
Compare
ea65bad
to
5caa34a
Compare
df62a1f
to
8396237
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would wait for @ayushdg to also TAL!
Great speedup 🎊
@ayushdg , Please take a look. Lets land this in soon. Have addressed all the changes |
@@ -1566,20 +1561,12 @@ def _write_dedup_encoded_jaccard_pair(self, encoded_jaccard_pair_path): | |||
transform_divisions=False, | |||
align_dataframes=False, | |||
) | |||
ddf.to_parquet(output_path, write_index=False) | |||
ddf.to_parquet(output_path, write_index=False, overwrite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the only place we should not add this is the place is when false_positive check is False and we write approximately num_input_text_files//num_workers * num_output_buckets
files which is what overwhelms removing a bunch of small files in OS.
Everywhere else if fair game and we should do it.
NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py
Lines 1176 to 1192 in 1cc6d7a
written_files = output_df.map_partitions( | |
write_partitioned_file, | |
output_path, | |
partition_on, | |
batch_label, | |
meta=cudf.Series([True]), | |
) | |
written_files = written_files.compute() | |
update_restart_offsets(output_path, bucket_part_offset, end_text_offset) | |
del output_df | |
print( | |
"Text-df partition ", | |
f"{end_text_offset}/{text_part_end_offset} " | |
f"completed in {time.time()-st_text}", | |
flush=True, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -1566,20 +1561,12 @@ def _write_dedup_encoded_jaccard_pair(self, encoded_jaccard_pair_path): | |||
transform_divisions=False, | |||
align_dataframes=False, | |||
) | |||
ddf.to_parquet(output_path, write_index=False) | |||
ddf.to_parquet(output_path, write_index=False, overwrite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
701da07
to
1384e17
Compare
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
1384e17
to
c122248
Compare
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
c122248
to
c7f822b
Compare
* Speedup fuzzy dedup by avoiding merge Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Remove unused function Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Clean up PR based on Praateeks reviews Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * style fixes Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * style fixes Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Remove dangling print Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Add handling for multiple columns Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Nuking convert to strings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Nuking convert to strings Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Verify it works on exp-01 Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Add dask profile options and add overwrite Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> --------- Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
This pull request includes several changes to the
nemo_curator/modules/fuzzy_dedup.py
file, focusing on removing theconvert_str_ids
functionality, optimizing performance, and improving logging.The most important changes are:
Removal of
convert_str_ids
functionality:convert_str_ids
parameter and its associated logic from the__init__
method and other methods innemo_curator/modules/fuzzy_dedup.py
. [1] [2] [3] [4] [5] [6]This is done because now we have longstrings support in
cuDF
so we no longer need to convert string to int idsPerformance optimizations:
Decreased the block size for reading parquet files in
_write_dedup_parsed_id
[1] to a lesser value to allow scaling of drop_duplicates (which has a big memory overhead 16x+ ) to prevent OOMs, this will allow us to run CC at larger scales without requiring more hardware.Increased the chuck size in
_write_encoded_jaccard_pair
methods to improve merge performance, as with large base chunks, we have bigger transfers so the throughput of transfer is better on TCP [2]Updated the
_run_connected_components
method to initializeComms
withp2p=False
Merge Improvements:
ddf_id
column.Main: 22m 10s
PR: 444.85 s
Dask Profiles:
cc_profiles.zip
Logging improvements:
cc_workflow
method and end-to-end time logging for the workflow. [1] [2]Verify Equal Results:
376321911
376321911
Check same ids
376321911
CC: @ayushdg
Checklist