diff --git a/README.md b/README.md index 9b3b9fa..af2b663 100644 --- a/README.md +++ b/README.md @@ -31,10 +31,8 @@ img.show() ## Examples -

Image 1 Image 2 -

See `main.py` or `ex.ipynb` for examples on how to draw the images. @@ -50,12 +48,18 @@ pip install -r requirements.txt This algorithm works particularly well with documents that have a lot of diagrams and that are well spaced. It performs poorly on documents that are purely text-based (but there is usually no need to segment documents that are completely text-based just throw it into RAG directly). It could be interesting to detect situations like this and skip the segmentation step entirely for these sorts of pages. -At the moment, I am looking to build out an ML model to determine when to split chunks in the page. The main principle would be to train a seq2seq model that outputs a binary sequence. The sequence input is the slices of the image and the output is a binary sequence where a 1 represents a split in the image and 0 otherwise. +At the moment, I am looking to build out an ML model to determine when to split chunks in the page. The main principle would be to train a seq2seq model that outputs a binary sequence. The sequence input is the slices of the image and the output is a binary sequence where a 1 represents a split in the image and 0 otherwise. Basic training code setup can be found on my other [branch](https://github.com/johnathanchiu/recursive-segmentation/tree/jchiu/model-training-code/model). ### Limitations Like any bounding box segmentation algorithm, the main limitation is the shape of the segmentation. Edge cases arise when the input image is not necessarily framed in a grid-shape. Take an example where an image contains "L" shaped objects. This makes it impossible to segment out the "L" shaped object defined by a bounding box. If anyone has any ideas on how to improve this, please feel free to suggest! +For largely text-based PDFs, the results can look like this. + +Image 3 + +I'm still looking for a solution so feel free to suggest any if you have ideas. + ## Contributing Feel free to contribute to this repository through Pull Requests and Issues. Reach out to me if you have any ideas surrounding this that you want to discuss! diff --git a/examples/outputs/somato_output.jpg b/examples/outputs/somato_output.jpg new file mode 100644 index 0000000..c8bdfd7 Binary files /dev/null and b/examples/outputs/somato_output.jpg differ