-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document segmentation #199
Comments
Hi @yixuan-qiao
which index are you looking at? |
see this repo? https://github.com/castorini/docTTTTTquery
in the docTTTTTquery repo |
basically we use spacy sentensizer, the spacy version should be 2.1.6 IIRC |
I find the data processing script and I will try immediately. Awesome memory, thanks! |
carefully read the script convert_msmarco_doc_to_t5_format.py, i found a constant 10000, 10000 characters not tokens which is small relative to the length of the document(median:584 max:333757). Maybe due to time efficiency?
|
Before segment each document into passages by applying a sliding window of 10 sentences with a stride of five, may be some extra pre-processing method using regular expression? We simply use the NLTK package to split sentences and the obtained passages is different from the one in the released index. In some cases, semicolon is used for split, and in other cases, sentences with high number ratio were also removed, but there's a lot more that I feel I haven't taken into account.
Would you mind sharing a script for data processing? Much thanks
The text was updated successfully, but these errors were encountered: