Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 556 Bytes

README.md

File metadata and controls

19 lines (10 loc) · 556 Bytes

UyghurTextResource

uyghur text resources crawled from website, every root folder name represent the crawled website domain and each root folder contains three sub folder and one txt file, details as follow:

###data folder:

original text content crawled from web page(warning: this is raw text from web site)

###content folder:

original uyghur text from the web page(a line text that split by space)

###dic folder:

original web page words list handled by word tokenization

###unique.txt file:

unique word list crawled from the entire website