Many terms have underscores in them #9

danluu · 2016-12-06T20:19:29Z

For example:

6d6b8015505c7099,1,1,4.61273e-07,2c_thrissur
3e8f9e5769458e9f,1,1,4.61273e-07,government_medical_college

We also have terms with double underscores that appear to be some kind of metadata?

868661c0426526a7,1,1,0.000557102,__noeditsection__
a135c90cbb896da0,1,1,2.97521e-05,__notoc__
14a64ebade034c85,1,1,3.11359e-06,__nogallery__

As well as weird terms that have even more underscores:

b614cd7474e25139,1,1,5.76591e-07,f___
22ed3514efd6df2a,1,1,4.61273e-07,o___y
3ea21b09f892bac0,1,1,4.61273e-07,mother______
c4767d3137687cf6,1,1,9.6262e-07,i_______________________________________

The text was updated successfully, but these errors were encountered:

danluu · 2016-12-08T02:58:21Z

If you want something from chunked1, that has

3b15cf09a2fde054,1,1,6.59631e-05,______next
664c5c0a691d85f4,1,1,6.59631e-05,x__x
b1863a3a0b343641,1,1,6.59631e-05,20__

MikeHopcroft · 2016-12-08T06:23:02Z

"______next" is actually on the page: https://en.wikipedia.org/?curid=11139
"x__x" is in page https://en.wikipedia.org/?curid=24782
"20__" is in page https://en.wikipedia.org/?curid=21481
"f___" is in page https://en.wikipedia.org/?curid=83530

The above examples seem to be correctly extracted.

"2c_thrissur" seems to be a "%2c" (comma) in a url. See, for example, https://en.wikipedia.org/wiki/File:GovemedcollegethrissurDistricthosp.JPG. The question here is whether urls should be indexed. It seems that the word breaker split the word before the "%2c".

"noeditsection" matches documents 49483, 72692 (I only searched the first two chunks). The term seems to be Wikipedia markup. See, for example the source code for https://en.wikipedia.org/?curid=49483 at https://en.wikipedia.org/w/index.php?title=Wikipedia:Ignore_all_rules&action=edit

In this case, it is debatable whether this wikipedia markup should be extracted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many terms have underscores in them #9

Many terms have underscores in them #9

danluu commented Dec 6, 2016 •

edited

Loading

danluu commented Dec 8, 2016 •

edited

Loading

MikeHopcroft commented Dec 8, 2016

Many terms have underscores in them #9

Many terms have underscores in them #9

Comments

danluu commented Dec 6, 2016 • edited Loading

danluu commented Dec 8, 2016 • edited Loading

MikeHopcroft commented Dec 8, 2016

danluu commented Dec 6, 2016 •

edited

Loading

danluu commented Dec 8, 2016 •

edited

Loading