Document segmentation #199

yixuan-qiao · 2021-07-06T03:28:26Z

Before segment each document into passages by applying a sliding window of 10 sentences with a stride of five, may be some extra pre-processing method using regular expression? We simply use the NLTK package to split sentences and the obtained passages is different from the one in the released index. In some cases, semicolon is used for split, and in other cases, sentences with high number ratio were also removed, but there's a lot more that I feel I haven't taken into account.

Would you mind sharing a script for data processing? Much thanks

MXueguang · 2021-07-06T03:31:02Z

Hi @yixuan-qiao

the obtained passages is different from the one in the released index.

which index are you looking at?

yixuan-qiao · 2021-07-06T03:39:31Z

the index we use is msmarco-doc-per-passage, command is
searcher = SimpleSearcher.from_prebuilt_index('msmarco-doc-per-passage')

The top one is ours, and the bottom one is extracted directly from the index

MXueguang · 2021-07-06T04:31:18Z

see this repo? https://github.com/castorini/docTTTTTquery

In comparison with per-passage expansion, we will use per passage without expansion as the baseline. In this method, we will not append the predicted queries to the passages.

in the docTTTTTquery repo

MXueguang · 2021-07-06T04:41:23Z

basically we use spacy sentensizer, the spacy version should be 2.1.6 IIRC

yixuan-qiao · 2021-07-06T04:57:37Z

I find the data processing script and I will try immediately. Awesome memory, thanks!

yixuan-qiao · 2021-07-06T10:19:21Z

carefully read the script convert_msmarco_doc_to_t5_format.py, i found a constant 10000, 10000 characters not tokens which is small relative to the length of the document(median:584 max:333757). Maybe due to time efficiency?

for doc_id, (doc_title, doc_text) in tqdm(corpus.items(), total=len(corpus)):
    doc = nlp(doc_text[:10000])
    sentences = [sent.string.strip() for sent in doc.sents]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document segmentation #199

Document segmentation #199

yixuan-qiao commented Jul 6, 2021 •

edited

Loading

MXueguang commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021

MXueguang commented Jul 6, 2021 •

edited

Loading

MXueguang commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021

Document segmentation #199

Document segmentation #199

Comments

yixuan-qiao commented Jul 6, 2021 • edited Loading

MXueguang commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021

MXueguang commented Jul 6, 2021 • edited Loading

MXueguang commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021

yixuan-qiao commented Jul 6, 2021 •

edited

Loading

MXueguang commented Jul 6, 2021 •

edited

Loading