Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intro example 1 #718

Merged
merged 10 commits into from
Oct 21, 2024
Merged

Intro example 1 #718

merged 10 commits into from
Oct 21, 2024

Conversation

sujee
Copy link
Contributor

@sujee sujee commented Oct 17, 2024

Why are these changes needed?

This example showcases some of the useful transforms of DPK.

PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings

  • code runs on colab (single click, no setup required) and local python dev env
  • have both python and Ray notebooks
  • works on simple PDF input, so user can track the transformations along the way
  • I created these, so I can understand the transformations. I also used them during workshop. they were well received

For reviewers

  • current URLs (colab links, image links ..etc) are pointing to my fork of data prep kit. This way the code will work on colab and can be reviewed. Once review is concluded, i will update the URLs to point to main repo.
  • current location is examples/notebooks/intro as of now
  • using synthetic (generated) data as input, they are checked into input/solar-system. I hope this is ok

Related issue number (if any).

sujee added 4 commits October 15, 2024 23:19
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
Signed-off-by: Sujee Maniyam <sujee@sujee.net>
@shahrokhDaijavad
Copy link
Member

@sujee This is a nice introductory example. I was able to run the python version of this on colab, but because fuzzy dedup is only available with ray, I cannot see whether fuzzy dedup has a positive effect on reducing the number of chunks or not. On the other hand, testing the ray version gives a "ray job failed' error in pdf2parquet (before getting to doc id error in issue #719 ), so let's wait and see if PR #721 fixes the Ray issues.

@sujee
Copy link
Contributor Author

sujee commented Oct 17, 2024

Thanks for reviewing @shahrokhDaijavad

1 - Ray version erroring on pdf2pq step is due to downloaded model cleanup, i believe : #667
Is there fix in the works for this?

2 - are we good on location of this example : examples/notebooks/intro ?

3 - Yes, fuzzy dedupe will remove a similar chunk. So that's nice to see :-)

@shahrokhDaijavad
Copy link
Member

@sujee

  1. For the ray version, I see 3 errors. One is [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667: ERROR - Exception creating transform [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip' The second one is this: ERROR - Exception during execution out of 2 created actors only 1 alive and the third one is: ERROR - Exception during execution 'processing_time'
  2. The location of this intro example is good.
  3. Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?

@sujee
Copy link
Contributor Author

sujee commented Oct 17, 2024

  1. For the ray version, I see 3 errors. One is [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667: ERROR - Exception creating transform [Errno

confirming:

1A. [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667
Possible fix : add MultiLock class #693

1B. [Bug] one of the created Ray actors die during docid transform #722

1C. Related to #722 above.
[Bug] docid ray transformation errors when running on colab (release 0.2.2dev1) #719
possible fix: Fix metadata logging even when actors crash #721

  1. Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?

Yes, completes on local dev env ✅

Signed-off-by: Sujee Maniyam <sujee@sujee.net>
@sujee
Copy link
Contributor Author

sujee commented Oct 18, 2024

Updated using DPK release 0.2.1

Note: Once merged, I will do a followup PR to update the URLs to reflect the main repo

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujee As we discussed, this reversion from release 0.2.2 to 0.2.1. is just for the AI summit demo and we should go and solve the Ray issues with 0.2.2 after the demo.

shahrokhDaijavad and others added 5 commits October 21, 2024 08:02
Co-authored-by: Maroun Touma <touma@us.ibm.com>
pip install in 2 lines
Python only needs data-prep-toolkit
We still need data-prep-toolkit, and the ray version of transforms
We need transforms only for ray version
Copy link

@matouma matouma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks

@touma-I touma-I merged commit c90017a into IBM:dev Oct 21, 2024
1 check passed
@sujee sujee deleted the intro-example1 branch November 5, 2024 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants