Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run MinHash dedup on Multi-Nodes #92

Closed
alielfilali01 opened this issue Jun 18, 2024 · 5 comments
Closed

Run MinHash dedup on Multi-Nodes #92

alielfilali01 opened this issue Jun 18, 2024 · 5 comments

Comments

@alielfilali01
Copy link

Hello there,

First i would like to extend my many thanks to you for setting up this amazing repo !

I'm currently working on a project with the aim to release the largest clean arabic text dataset, we have now about 313 billion tokens (qwen tokenizer) gathered from multiple clean sources ...
The quality (toxicity, URL cleaning, short sentences, ...) itself is unquestionable so the major step we still have to perform is dedup, and for that we choosed your tool which is both clean and straightforward !
One problem raises is due to the large size of the dataset we cannot perform dedup on a single node (52 AMD CPUs) in a reasonable time ! For that we want to explore the possibility to run the script on multi nodes setting, 52 CPU per Node, on about 50 Nodes.

Do you think we can do that with the current available in this repo ? If yes, i would appreciate your guidance on this matter.

Thank you again and looking forward to hear from you soon

@ChenghaoMou
Copy link
Owner

Hi @alielfilali01

Thanks for reaching out.

313 billion tokens sounds doable with a decent cluster based on my experience. For reference, The spark script was tested with TB-level dataset with less than 20 nodes. Can I ask if your hardware is on any commercial cloud platform or a local HPC? I am more than happy to jump on a call to discuss more details if you want.

Best,
Chenghao

@alielfilali01
Copy link
Author

Hi dear @ChenghaoMou

Thank you so much for your quick response and your openess 🤗

Personally, I know very little on text dedup, so all i did was copy past the command on the main README file pointing to the directory of my dataset ... I have no idea on how i could do it using Spark !
For the Hardware, it is a local HPC located in the UM6P Campus - Morocco

If you can walk me here through the process on how to run MinHash dedup using Spark in a multi-node setting, that would be better ! This way i won't get too much from your time. Otherwise if you believe a call would be better, i would be happy and honored to do it.

Thank you so much again 🤗

@ChenghaoMou
Copy link
Owner

Thanks for the details. In this case, you might have at least two options:

  1. Try datatrove with its slurm pipeline executor for deduplication with minimal HPC configuration and knowledge.
  2. Set up your HPC for spark documentation and documentation (make sure you double check the spark script because it might contain some different settings than the normal script) You should consider setting the spark cluster with HPC and running the script (spark-submit) as separate steps. The latter is only one command when the cluster is ready.

There is some amount of learning and tweaking involved in either method. But it is a one-time thing that you will find useful for future experiments. I suggest starting with a small cluster and small data to make sure everything runs before scaling it up. Unfortunately, for HPC with limited access, I won't be able to directly help. Feel free to post any issues or questions still, I will try my best to help you answer them.

Best,
Chenghao

@alielfilali01
Copy link
Author

Thank you so much dear @ChenghaoMou
I'll get back to you if i have any more questions

Copy link

Stale issue message

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants