-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intro example 1 #718
Intro example 1 #718
Conversation
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
@sujee This is a nice introductory example. I was able to run the python version of this on colab, but because fuzzy dedup is only available with ray, I cannot see whether fuzzy dedup has a positive effect on reducing the number of chunks or not. On the other hand, testing the ray version gives a "ray job failed' error in pdf2parquet (before getting to doc id error in issue #719 ), so let's wait and see if PR #721 fixes the Ray issues. |
Thanks for reviewing @shahrokhDaijavad 1 - Ray version erroring on pdf2pq step is due to downloaded model cleanup, i believe : #667 2 - are we good on location of this example : 3 - Yes, fuzzy dedupe will remove a similar chunk. So that's nice to see :-) |
|
confirming: 1A. [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667 1B. [Bug] one of the created Ray actors die during docid transform #722 1C. Related to #722 above.
Yes, completes on local dev env ✅ |
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
Updated using DPK release 0.2.1 Note: Once merged, I will do a followup PR to update the URLs to reflect the main repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujee As we discussed, this reversion from release 0.2.2 to 0.2.1. is just for the AI summit demo and we should go and solve the Ray issues with 0.2.2 after the demo.
Co-authored-by: Maroun Touma <touma@us.ibm.com>
pip install in 2 lines
Python only needs data-prep-toolkit
We still need data-prep-toolkit, and the ray version of transforms
We need transforms only for ray version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks
Why are these changes needed?
This example showcases some of the useful transforms of DPK.
PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings
For reviewers
examples/notebooks/intro
as of nowinput/solar-system
. I hope this is okRelated issue number (if any).