bert-chunker: efficient and trained chunking for documents of any size

bert-chunker is a text chunker based on BERT with a classifier head to predict the start token of chunks (for use in RAG, etc), and with a sliding window it cut documents of any size into chunks. It is finetuned based on nreimers/MiniLM-L6-H384-uncased, and the whole training lasted for 10 minutes on a Nvidia P40 GPU on a 50 MB synthetized dataset. This repo includes codes for model defining, generating dataset, training and testing.

Generate dataset

See generate_dataset.ipynb

Train from the base model all-MiniLM-L6-v2

Run

bash train.sh

Inference

See test.py

Citation

If this work is helpful, please kindly cite as:

@article{BertChunker,
  title={BertChunker: Efficient and Trained Chunking for Unstructured Documents}, 
  author={Yannan Luo},
  year={2024},
  url={https://github.com/jackfsuia/BertChunker}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
generate_dataset.ipynb		generate_dataset.ipynb
main.pdf		main.pdf
modeling_bertchunker.py		modeling_bertchunker.py
test.py		test.py
train.sh		train.sh
train_chunk_model.py		train_chunk_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bert-chunker: efficient and trained chunking for documents of any size

Generate dataset

Train from the base model all-MiniLM-L6-v2

Inference

Citation

About

Releases

Packages

Languages

License

jackfsuia/bert-chunker

Folders and files

Latest commit

History

Repository files navigation

bert-chunker: efficient and trained chunking for documents of any size

Generate dataset

Train from the base model all-MiniLM-L6-v2

Inference

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages