Intro example 1 #718

sujee · 2024-10-17T06:35:28Z

Why are these changes needed?

This example showcases some of the useful transforms of DPK.

PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings

code runs on colab (single click, no setup required) and local python dev env
have both python and Ray notebooks
works on simple PDF input, so user can track the transformations along the way
I created these, so I can understand the transformations. I also used them during workshop. they were well received

For reviewers

current URLs (colab links, image links ..etc) are pointing to my fork of data prep kit. This way the code will work on colab and can be reviewed. Once review is concluded, i will update the URLs to point to main repo.
current location is examples/notebooks/intro as of now
using synthetic (generated) data as input, they are checked into input/solar-system. I hope this is ok

Related issue number (if any).

Signed-off-by: Sujee Maniyam <sujee@sujee.net>

shahrokhDaijavad · 2024-10-17T17:17:52Z

@sujee This is a nice introductory example. I was able to run the python version of this on colab, but because fuzzy dedup is only available with ray, I cannot see whether fuzzy dedup has a positive effect on reducing the number of chunks or not. On the other hand, testing the ray version gives a "ray job failed' error in pdf2parquet (before getting to doc id error in issue #719 ), so let's wait and see if PR #721 fixes the Ray issues.

sujee · 2024-10-17T18:03:08Z

Thanks for reviewing @shahrokhDaijavad

1 - Ray version erroring on pdf2pq step is due to downloaded model cleanup, i believe : #667
Is there fix in the works for this?

2 - are we good on location of this example : examples/notebooks/intro ?

3 - Yes, fuzzy dedupe will remove a similar chunk. So that's nice to see :-)

shahrokhDaijavad · 2024-10-17T18:33:22Z

@sujee

For the ray version, I see 3 errors. One is [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667: ERROR - Exception creating transform [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip' The second one is this: ERROR - Exception during execution out of 2 created actors only 1 alive and the third one is: ERROR - Exception during execution 'processing_time'
The location of this intro example is good.
Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?

sujee · 2024-10-17T19:03:22Z

For the ray version, I see 3 errors. One is [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667: ERROR - Exception creating transform [Errno

confirming:

1A. [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667
Possible fix : add MultiLock class #693

1B. [Bug] one of the created Ray actors die during docid transform #722

1C. Related to #722 above.
[Bug] docid ray transformation errors when running on colab (release 0.2.2dev1) #719
possible fix: Fix metadata logging even when actors crash #721

Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?

Yes, completes on local dev env ✅

Signed-off-by: Sujee Maniyam <sujee@sujee.net>

sujee · 2024-10-18T20:37:42Z

Updated using DPK release 0.2.1

Note: Once merged, I will do a followup PR to update the URLs to reflect the main repo

shahrokhDaijavad

@sujee As we discussed, this reversion from release 0.2.2 to 0.2.1. is just for the AI summit demo and we should go and solve the Ray issues with 0.2.2 after the demo.

examples/notebooks/intro/README.md

Co-authored-by: Maroun Touma <touma@us.ibm.com>

pip install in 2 lines

Python only needs data-prep-toolkit

We still need data-prep-toolkit, and the ray version of transforms

We need transforms only for ray version

matouma

Looks good to me! Thanks

sujee added 4 commits October 15, 2024 23:19

DPK intro example v1

bd72423

Signed-off-by: Sujee Maniyam <sujee@sujee.net>

DPK intro example v2

41e1d52

Signed-off-by: Sujee Maniyam <sujee@sujee.net>

Fixing URLs

96d6808

Signed-off-by: Sujee Maniyam <sujee@sujee.net>

fix colab url

970c22a

intro examples using DPK release 0.2.1

469a90e

Signed-off-by: Sujee Maniyam <sujee@sujee.net>

shahrokhDaijavad approved these changes Oct 18, 2024

View reviewed changes

touma-I reviewed Oct 21, 2024

View reviewed changes

examples/notebooks/intro/README.md Show resolved Hide resolved

shahrokhDaijavad and others added 5 commits October 21, 2024 08:02

Update examples/notebooks/intro/README.md

27e7134

Co-authored-by: Maroun Touma <touma@us.ibm.com>

Update README.md

71e0dc2

pip install in 2 lines

Update dpk_intro_1_python.ipynb

b3acad2

Python only needs data-prep-toolkit

Update dpk_intro_1_ray.ipynb

b236dc0

We still need data-prep-toolkit, and the ray version of transforms

Update dpk_intro_1_ray.ipynb

4d070ca

We need transforms only for ray version

matouma approved these changes Oct 21, 2024

View reviewed changes

touma-I merged commit c90017a into IBM:dev Oct 21, 2024
1 check passed

sujee deleted the intro-example1 branch November 5, 2024 06:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intro example 1 #718

Intro example 1 #718

sujee commented Oct 17, 2024 •

edited

Loading

shahrokhDaijavad commented Oct 17, 2024

sujee commented Oct 17, 2024 •

edited

Loading

shahrokhDaijavad commented Oct 17, 2024

sujee commented Oct 17, 2024

sujee commented Oct 18, 2024 •

edited

Loading

shahrokhDaijavad left a comment

matouma left a comment

Intro example 1 #718

Intro example 1 #718

Conversation

sujee commented Oct 17, 2024 • edited Loading

Why are these changes needed?

For reviewers

Related issue number (if any).

shahrokhDaijavad commented Oct 17, 2024

sujee commented Oct 17, 2024 • edited Loading

shahrokhDaijavad commented Oct 17, 2024

sujee commented Oct 17, 2024

sujee commented Oct 18, 2024 • edited Loading

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

matouma left a comment

Choose a reason for hiding this comment

sujee commented Oct 17, 2024 •

edited

Loading

sujee commented Oct 17, 2024 •

edited

Loading

sujee commented Oct 18, 2024 •

edited

Loading