Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many terms have underscores in them #9

Open
danluu opened this issue Dec 6, 2016 · 2 comments
Open

Many terms have underscores in them #9

danluu opened this issue Dec 6, 2016 · 2 comments

Comments

@danluu
Copy link
Contributor

danluu commented Dec 6, 2016

For example:

6d6b8015505c7099,1,1,4.61273e-07,2c_thrissur
3e8f9e5769458e9f,1,1,4.61273e-07,government_medical_college

We also have terms with double underscores that appear to be some kind of metadata?

868661c0426526a7,1,1,0.000557102,__noeditsection__
a135c90cbb896da0,1,1,2.97521e-05,__notoc__
14a64ebade034c85,1,1,3.11359e-06,__nogallery__

As well as weird terms that have even more underscores:

b614cd7474e25139,1,1,5.76591e-07,f___
22ed3514efd6df2a,1,1,4.61273e-07,o___y
3ea21b09f892bac0,1,1,4.61273e-07,mother______
c4767d3137687cf6,1,1,9.6262e-07,i_______________________________________
@danluu
Copy link
Contributor Author

danluu commented Dec 8, 2016

If you want something from chunked1, that has

3b15cf09a2fde054,1,1,6.59631e-05,______next
664c5c0a691d85f4,1,1,6.59631e-05,x__x
b1863a3a0b343641,1,1,6.59631e-05,20__

@MikeHopcroft
Copy link
Contributor

"______next" is actually on the page: https://en.wikipedia.org/?curid=11139
"x__x" is in page https://en.wikipedia.org/?curid=24782
"20__" is in page https://en.wikipedia.org/?curid=21481
"f___" is in page https://en.wikipedia.org/?curid=83530

The above examples seem to be correctly extracted.

"2c_thrissur" seems to be a "%2c" (comma) in a url. See, for example, https://en.wikipedia.org/wiki/File:GovemedcollegethrissurDistricthosp.JPG. The question here is whether urls should be indexed. It seems that the word breaker split the word before the "%2c".

"noeditsection" matches documents 49483, 72692 (I only searched the first two chunks). The term seems to be Wikipedia markup. See, for example the source code for https://en.wikipedia.org/?curid=49483 at https://en.wikipedia.org/w/index.php?title=Wikipedia:Ignore_all_rules&action=edit

In this case, it is debatable whether this wikipedia markup should be extracted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants