[REVIEW] Speedup Connected Components #302

VibhuJawa · 2024-10-15T01:45:50Z

This pull request includes several changes to the nemo_curator/modules/fuzzy_dedup.py file, focusing on removing the convert_str_ids functionality, optimizing performance, and improving logging.

The most important changes are:

Removal of `convert_str_ids` functionality:

Removed the convert_str_ids parameter and its associated logic from the __init__ method and other methods in nemo_curator/modules/fuzzy_dedup.py. [1] [2] [3] [4] [5] [6]

This is done because now we have longstrings support in cuDF so we no longer need to convert string to int ids

Performance optimizations:

Decreased the block size for reading parquet files in _write_dedup_parsed_id [1] to a lesser value to allow scaling of drop_duplicates (which has a big memory overhead 16x+ ) to prevent OOMs, this will allow us to run CC at larger scales without requiring more hardware.
Increased the chuck size in _write_encoded_jaccard_pair methods to improve merge performance, as with large base chunks, we have bigger transfers so the throughput of transfer is better on TCP [2]
Updated the _run_connected_components method to initialize Comms with p2p=False

Merge Improvements:

This PR optimizes the merge process by using an index-based approach instead of the previous batched method, while maintaining the broadcast merge.
The new method reduces shuffles to 2*num_batches - 1 through indexing.
The only additional operation is setting the index on the ddf_id column.

Main: 22m 10s
PR: 444.85 s

Dask Profiles:
cc_profiles.zip

Logging improvements:

Added start time logging in the cc_workflow method and end-to-end time logging for the workflow. [1] [2]

Verify Equal Results:

ddf_1 = dask_cudf.read_parquet("/raid/vjawa/rpv2_debug_cache_pull_302/connected_components.parquet")
ddf_1 = ddf_1.repartition(npartitions=4).sort_values(by=['id', 'group'])
len(ddf_1)

376321911

ddf_2 = dask_cudf.read_parquet("/raid/vjawa/rpv2_debug_cache/connected_components.parquet")
ddf_2 = ddf_2.repartition(npartitions=4).sort_values(by=['id', 'group'])

len(ddf_2)

376321911

Check same ids

merged_df = ddf_1[['id']].merge(ddf_2[['id']], on='id', how='inner')
len(merged_df)

376321911

CC: @ayushdg

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

nemo_curator/modules/fuzzy_dedup.py

nemo_curator/scripts/fuzzy_deduplication/connected_components.py

praateekmahajan

LGTM, would wait for @ayushdg to also TAL!
Great speedup 🎊

VibhuJawa · 2024-10-25T03:34:47Z

@ayushdg , Please take a look. Lets land this in soon. Have addressed all the changes

praateekmahajan · 2024-10-28T00:14:01Z

nemo_curator/modules/fuzzy_dedup.py

@@ -1566,20 +1561,12 @@ def _write_dedup_encoded_jaccard_pair(self, encoded_jaccard_pair_path):
                transform_divisions=False,
                align_dataframes=False,
            )
-            ddf.to_parquet(output_path, write_index=False)
+            ddf.to_parquet(output_path, write_index=False, overwrite=True)


Nice to see we're adding overwrites, though I thought we discussed not adding them sicne the fs removal might cause a slowdown. See this and hence this

cc @ayushdg (in case we're adding overwrites) I'll also add some in #321

So the only place we should not add this is the place is when false_positive check is False and we write approximately num_input_text_files//num_workers * num_output_buckets files which is what overwhelms removing a bunch of small files in OS.

Everywhere else if fair game and we should do it.

NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py

Lines 1176 to 1192 in 1cc6d7a

written_files = output_df.map_partitions(

write_partitioned_file,

output_path,

partition_on,

batch_label,

meta=cudf.Series([True]),

)

written_files = written_files.compute()

update_restart_offsets(output_path, bucket_part_offset, end_text_offset)

del output_df

print(

"Text-df partition ",

f"{end_text_offset}/{text_part_end_offset} "

f"completed in {time.time()-st_text}",

flush=True,

)

I think we're aligned here based on the updates in #321/#330.

Side note: The dynamics of overwriting/removing files might change when we start testing with cloud paths rather than local filesystems, but will leave that discussion for later.

ayushdg

Thanks!

nemo_curator/modules/fuzzy_dedup.py

ayushdg · 2024-10-29T19:56:58Z

nemo_curator/modules/fuzzy_dedup.py

@@ -1566,20 +1561,12 @@ def _write_dedup_encoded_jaccard_pair(self, encoded_jaccard_pair_path):
                transform_divisions=False,
                align_dataframes=False,
            )
-            ddf.to_parquet(output_path, write_index=False)
+            ddf.to_parquet(output_path, write_index=False, overwrite=True)


I think we're aligned here based on the updates in #321/#330.

Side note: The dynamics of overwriting/removing files might change when we start testing with cloud paths rather than local filesystems, but will leave that discussion for later.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Speedup fuzzy dedup by avoiding merge Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Remove unused function Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Clean up PR based on Praateeks reviews Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * style fixes Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * style fixes Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Remove dangling print Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Add handling for multiple columns Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Nuking convert to strings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Nuking convert to strings Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Verify it works on exp-01 Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> * Add dask profile options and add overwrite Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> --------- Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

VibhuJawa commented Oct 15, 2024

View reviewed changes

nemo_curator/modules/fuzzy_dedup.py Outdated Show resolved Hide resolved

VibhuJawa requested a review from ayushdg October 15, 2024 07:43

VibhuJawa marked this pull request as ready for review October 15, 2024 07:44

praateekmahajan reviewed Oct 15, 2024

View reviewed changes