GitHub - liuenda/preprocess-Reuter: financial text mining: pre-processing with Reuter news in English and Japanese

liuenda / preprocess-Reuter Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

financial text mining: pre-processing with Reuter news in English and Japanese

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.txt		README.txt
clean2_tag-EN.py		clean2_tag-EN.py
clean2_tag-JP.py		clean2_tag-JP.py
clean_tag-EN.py		clean_tag-EN.py
cmd_tagging_stanford.bat		cmd_tagging_stanford.bat
old-clean_tag-EN.py		old-clean_tag-EN.py
preclean1.py		preclean1.py
preclean2-EN.py		preclean2-EN.py
preclean2-JP.py		preclean2-JP.py
shell_tagging_stanford.sh		shell_tagging_stanford.sh
tagging_mecab-JP.py		tagging_mecab-JP.py
tagging_nltk-EN.py		tagging_nltk-EN.py

Repository files navigation

2016/8/12 
疑问和以后处理点：
. 前处理的phrase detection怎么处理，并没有找到phrase检测和合并的文件，日语或者英语，请求检查
.. phrase detection 将会全部在modeling模块的中的phrase_det.py完成，详见modeling的readme文件

Procedures for preprocessing
1. For English:
	preclean1.py
		IN:
			"en_jp_text_2014_64079.csv"
		OUT:
			output1 "removed_en.csv" 
			output2 "removed_jp.csv"
	preclean2-EN.py
		IN:
			'removed_en.csv'
		OUT:
			'removed2_en.csv'
	tagging_nltk-EN.py
		IN:
			"removed2_en.csv"
		OUT: 
			"tag_nltk_en.csv"
	clean_tag-EN.py (No JP version)
		IN:
			"tag_nltk_en.csv"
		OUT:
			"cleaned_tag_en.txt"
	clean2_tag-EN.py
		IN:
			"cleaned_tag_en.txt"
		OUT:
			"cleaned2_tag_en.txt"

2. For Japanese:
	preclean1.py
		IN:
			"en_jp_text_2014_64079.csv"
		OUT:
			output1 "removed_en.csv" 
			output2 "removed_jp.csv"
	preclean2-JP.py
		IN:
			'removed_jp.csv'
		OUT:
			'removed2_jp.csv'
	tagging_mecab-JP.py
		IN:
		OUT:
	clean2_tag-JP.py
		IN:
		OUT: