Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document segmentation #199

Open
yixuan-qiao opened this issue Jul 6, 2021 · 6 comments
Open

Document segmentation #199

yixuan-qiao opened this issue Jul 6, 2021 · 6 comments

Comments

@yixuan-qiao
Copy link

yixuan-qiao commented Jul 6, 2021

Before segment each document into passages by applying a sliding window of 10 sentences with a stride of five, may be some extra pre-processing method using regular expression? We simply use the NLTK package to split sentences and the obtained passages is different from the one in the released index. In some cases, semicolon is used for split, and in other cases, sentences with high number ratio were also removed, but there's a lot more that I feel I haven't taken into account.

Would you mind sharing a script for data processing? Much thanks

@MXueguang
Copy link
Member

Hi @yixuan-qiao

the obtained passages is different from the one in the released index.

which index are you looking at?

@yixuan-qiao
Copy link
Author

the index we use is msmarco-doc-per-passage, command is
searcher = SimpleSearcher.from_prebuilt_index('msmarco-doc-per-passage')

image

The top one is ours, and the bottom one is extracted directly from the index

@MXueguang
Copy link
Member

MXueguang commented Jul 6, 2021

see this repo? https://github.com/castorini/docTTTTTquery

In comparison with per-passage expansion, we will use per passage without expansion as the baseline. In this method, we will not append the predicted queries to the passages.

in the docTTTTTquery repo

@MXueguang
Copy link
Member

basically we use spacy sentensizer, the spacy version should be 2.1.6 IIRC

@yixuan-qiao
Copy link
Author

I find the data processing script and I will try immediately. Awesome memory, thanks!

@yixuan-qiao
Copy link
Author

carefully read the script convert_msmarco_doc_to_t5_format.py, i found a constant 10000, 10000 characters not tokens which is small relative to the length of the document(median:584 max:333757). Maybe due to time efficiency?

for doc_id, (doc_title, doc_text) in tqdm(corpus.items(), total=len(corpus)):
    doc = nlp(doc_text[:10000])
    sentences = [sent.string.strip() for sent in doc.sents]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants