We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For example:
6d6b8015505c7099,1,1,4.61273e-07,2c_thrissur 3e8f9e5769458e9f,1,1,4.61273e-07,government_medical_college
We also have terms with double underscores that appear to be some kind of metadata?
868661c0426526a7,1,1,0.000557102,__noeditsection__ a135c90cbb896da0,1,1,2.97521e-05,__notoc__ 14a64ebade034c85,1,1,3.11359e-06,__nogallery__
As well as weird terms that have even more underscores:
b614cd7474e25139,1,1,5.76591e-07,f___ 22ed3514efd6df2a,1,1,4.61273e-07,o___y 3ea21b09f892bac0,1,1,4.61273e-07,mother______ c4767d3137687cf6,1,1,9.6262e-07,i_______________________________________
The text was updated successfully, but these errors were encountered:
If you want something from chunked1, that has
3b15cf09a2fde054,1,1,6.59631e-05,______next 664c5c0a691d85f4,1,1,6.59631e-05,x__x b1863a3a0b343641,1,1,6.59631e-05,20__
Sorry, something went wrong.
"______next" is actually on the page: https://en.wikipedia.org/?curid=11139 "x__x" is in page https://en.wikipedia.org/?curid=24782 "20__" is in page https://en.wikipedia.org/?curid=21481 "f___" is in page https://en.wikipedia.org/?curid=83530
The above examples seem to be correctly extracted.
"2c_thrissur" seems to be a "%2c" (comma) in a url. See, for example, https://en.wikipedia.org/wiki/File:GovemedcollegethrissurDistricthosp.JPG. The question here is whether urls should be indexed. It seems that the word breaker split the word before the "%2c".
"noeditsection" matches documents 49483, 72692 (I only searched the first two chunks). The term seems to be Wikipedia markup. See, for example the source code for https://en.wikipedia.org/?curid=49483 at https://en.wikipedia.org/w/index.php?title=Wikipedia:Ignore_all_rules&action=edit
In this case, it is debatable whether this wikipedia markup should be extracted.
No branches or pull requests
For example:
We also have terms with double underscores that appear to be some kind of metadata?
As well as weird terms that have even more underscores:
The text was updated successfully, but these errors were encountered: