Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Speedup Connected Components #302

Merged
merged 11 commits into from
Oct 29, 2024

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Oct 15, 2024

This pull request includes several changes to the nemo_curator/modules/fuzzy_dedup.py file, focusing on removing the convert_str_ids functionality, optimizing performance, and improving logging.

The most important changes are:

Removal of convert_str_ids functionality:

  • Removed the convert_str_ids parameter and its associated logic from the __init__ method and other methods in nemo_curator/modules/fuzzy_dedup.py. [1] [2] [3] [4] [5] [6]

This is done because now we have longstrings support in cuDF so we no longer need to convert string to int ids

Performance optimizations:

  • Decreased the block size for reading parquet files in _write_dedup_parsed_id [1] to a lesser value to allow scaling of drop_duplicates (which has a big memory overhead 16x+ ) to prevent OOMs, this will allow us to run CC at larger scales without requiring more hardware.

  • Increased the chuck size in _write_encoded_jaccard_pair methods to improve merge performance, as with large base chunks, we have bigger transfers so the throughput of transfer is better on TCP [2]

  • Updated the _run_connected_components method to initialize Comms with p2p=False

Merge Improvements:

  • This PR optimizes the merge process by using an index-based approach instead of the previous batched method, while maintaining the broadcast merge.
  • The new method reduces shuffles to 2*num_batches - 1 through indexing.
  • The only additional operation is setting the index on the ddf_id column.

Main: 22m 10s
PR: 444.85 s

image

Dask Profiles:
cc_profiles.zip

Logging improvements:

  • Added start time logging in the cc_workflow method and end-to-end time logging for the workflow. [1] [2]

Verify Equal Results:

ddf_1 = dask_cudf.read_parquet("/raid/vjawa/rpv2_debug_cache_pull_302/connected_components.parquet")
ddf_1 = ddf_1.repartition(npartitions=4).sort_values(by=['id', 'group'])
len(ddf_1)

376321911

ddf_2 = dask_cudf.read_parquet("/raid/vjawa/rpv2_debug_cache/connected_components.parquet")
ddf_2 = ddf_2.repartition(npartitions=4).sort_values(by=['id', 'group'])

len(ddf_2)

376321911

Check same ids

merged_df = ddf_1[['id']].merge(ddf_2[['id']], on='id', how='inner')
len(merged_df)

376321911

CC: @ayushdg

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@VibhuJawa VibhuJawa marked this pull request as ready for review October 15, 2024 07:44
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from f424de2 to ea65bad Compare October 16, 2024 02:35
@VibhuJawa VibhuJawa changed the base branch from main to r0.3.0 October 16, 2024 02:37
@VibhuJawa VibhuJawa changed the base branch from r0.3.0 to main October 16, 2024 02:37
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from ea65bad to 5caa34a Compare October 16, 2024 02:46
@VibhuJawa VibhuJawa changed the title [WIP] Speedup Connected Components [REVIEW] Speedup Connected Components Oct 22, 2024
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from df62a1f to 8396237 Compare October 22, 2024 22:42
Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would wait for @ayushdg to also TAL!
Great speedup 🎊

@VibhuJawa
Copy link
Collaborator Author

@ayushdg , Please take a look. Lets land this in soon. Have addressed all the changes

@@ -1566,20 +1561,12 @@ def _write_dedup_encoded_jaccard_pair(self, encoded_jaccard_pair_path):
transform_divisions=False,
align_dataframes=False,
)
ddf.to_parquet(output_path, write_index=False)
ddf.to_parquet(output_path, write_index=False, overwrite=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see we're adding overwrites, though I thought we discussed not adding them sicne the fs removal might cause a slowdown. See this and hence this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ayushdg (in case we're adding overwrites) I'll also add some in #321

Copy link
Collaborator Author

@VibhuJawa VibhuJawa Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only place we should not add this is the place is when false_positive check is False and we write approximately num_input_text_files//num_workers * num_output_buckets files which is what overwhelms removing a bunch of small files in OS.

Everywhere else if fair game and we should do it.

written_files = output_df.map_partitions(
write_partitioned_file,
output_path,
partition_on,
batch_label,
meta=cudf.Series([True]),
)
written_files = written_files.compute()
update_restart_offsets(output_path, bucket_part_offset, end_text_offset)
del output_df
print(
"Text-df partition ",
f"{end_text_offset}/{text_part_end_offset} "
f"completed in {time.time()-st_text}",
flush=True,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're aligned here based on the updates in #321/#330.

Side note: The dynamics of overwriting/removing files might change when we start testing with cloud paths rather than local filesystems, but will leave that discussion for later.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

nemo_curator/modules/fuzzy_dedup.py Show resolved Hide resolved
@@ -1566,20 +1561,12 @@ def _write_dedup_encoded_jaccard_pair(self, encoded_jaccard_pair_path):
transform_divisions=False,
align_dataframes=False,
)
ddf.to_parquet(output_path, write_index=False)
ddf.to_parquet(output_path, write_index=False, overwrite=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're aligned here based on the updates in #321/#330.

Side note: The dynamics of overwriting/removing files might change when we start testing with cloud paths rather than local filesystems, but will leave that discussion for later.

@ayushdg ayushdg added enhancement New feature or request gpuci Run GPU CI/CD on PR labels Oct 29, 2024
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch 2 times, most recently from 701da07 to 1384e17 Compare October 29, 2024 21:43
VibhuJawa and others added 9 commits October 29, 2024 16:19
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from 1384e17 to c122248 Compare October 29, 2024 23:24
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the vjawa/speedup_cc_in_fuzzy_dedup branch from c122248 to c7f822b Compare October 29, 2024 23:27
@VibhuJawa VibhuJawa merged commit 36fcf50 into NVIDIA:main Oct 29, 2024
3 checks passed
ayushdg pushed a commit to ayushdg/NeMo-Curator that referenced this pull request Oct 30, 2024
* Speedup fuzzy dedup by avoiding merge

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Remove unused function

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Clean up PR based on Praateeks reviews

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* style fixes

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* style fixes

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Remove dangling print

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Add handling for multiple columns

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Nuking convert to strings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Nuking convert to strings

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Verify it works on exp-01

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

* Add dask profile options and add overwrite

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

---------

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants