Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example extractors #145

Closed
wants to merge 32 commits into from

Conversation

yenicelik
Copy link
Contributor

@yenicelik yenicelik commented Nov 21, 2023

Implemented and tested these extractors:

  • Invoice Extractor (using Donut)
  • Language Extractor (provided full-text, chunks the text and predicts the language for each chunk)
  • Identity Extractor (hashes the input-bytes into a vector using sha256, useful for quick duplicate-checks)

How to reproduce:

(1) Package the extractor into a docker image cargo run extractor package --dev -v --config-path extractors/simple_invoice_parser.yaml
(2) Run the extractor, mounting any files required

docker run \
--mount type=bind,source="$(pwd)/data",target=/indexify/data \ 
-it \
--rm \
-e RUST_LOG=debug \ 
yenicelik/simple-invoice-parser extractor extract \ 
--file data/...

cargo run can be replaced by ./target/debug/indexify or indexify if the binaries were added to the PATH.

For maintainers & contributors:

These steps are usually not needed when working only extractor! Currently, the extractor-base image is pulled from DockerHub, due to how BuildKit works (it does not use a local registry). Unfortunately, it seems tedious to resolve this (see moby/buildkit#2343). If you need to modify the rust code to run the extractors, please run

  • create an account for dockerhub
  • modify the Makefile credentials at the top of the Makefile
  • run make build-base-extractor-push, this will build the image, and push it to dockerhub
  • modify dockerfiles/Dockerfiles.extractor to use your own FROM ..., (i.e. FROM yenicelik/indexify-extractor-base instead FROM diptanu/indexify-extractor-base).
  • Then you can go to the "How to reproduce" section above

…er even for the tests (not most efficient, but this is what we want to test!)
Copy link

vercel bot commented Nov 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
indexify ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 22, 2023 1:59pm

Copy link
Collaborator

@diptanu diptanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall! Thanks David <3
The main comment is to break down the PR into many commits, and move some code around.

.devcontainer/Dockerfile Outdated Show resolved Hide resolved
docs/docs/develop.md Outdated Show resolved Hide resolved
extractors/identity_hash_embedding.py Show resolved Hide resolved
extractors/identity_hash_embedding.yaml Show resolved Hide resolved
extractors/language_extractor.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
src/extractors.rs Outdated Show resolved Hide resolved
@yenicelik
Copy link
Contributor Author

Please make sure to merge #146 before this

@yenicelik
Copy link
Contributor Author

Will move the extractors here https://github.com/tensorlakeai/indexify-extractors and remove them from diptanu/indexify

@yenicelik yenicelik closed this Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants