-
Notifications
You must be signed in to change notification settings - Fork 14
/
note.txt
252 lines (183 loc) · 8.14 KB
/
note.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
Note of training, testing, evaluation with Maximum Entrophy approach.
I assumed that you already installed Maximum Entropy Modeling Toolkit for Python and C++.
Download link: https://github.com/lzhang10/maxent
My installation note for maxent: maxent-install-note.txt
Prepared data t1/ .. t10/ and open test data in a folder.
For me, I prepared under ~/experiment/maxent/mypos/demo/.
ls output is as follow:
lar@lar-air:~/experiment/maxent/mypos/demo$ ls
evaluate-all.sh evaluate.py otest otest.nopipe otest.nopipe.word t1 t10 t2 t3 t4 t5 t6 t7 t8 t9 test-all.sh train-all.sh
###############
Step1: Training
###############
lar@lar-air:~/experiment/maxent/mypos/demo$ time ./train-all.sh | tee maxent-training.log
LBFGS module not compiled in, use GIS insteadFirst pass: gather word frequency information
1000 lines
4020 words found in training data
Saving word frequence information to ./t1/train1.nopipe.wordfreq
Second pass: gather features and tag dict to be used in tagger
feature cutoff:10
rare word freq:5
1000 lines
34748 features found
600 words found in pos dict
Applying cutoff 10 to features
2531 features remained after cutoff
saving features to file ./t1/train1.nopipe.model.features
Saving tag dict object to ./t1/train1.nopipe.model.tagdict done
Third pass:training ME model...
1000 lines
training finished
saving tagger model to ./t1/train1.nopipe.model done
...
...
...
LBFGS module not compiled in, use GIS insteadFirst pass: gather word frequency information
1000 lines
2000 lines
3000 lines
4000 lines
5000 lines
6000 lines
7000 lines
8000 lines
9000 lines
10000 lines
15048 words found in training data
Saving word frequence information to ./t10/train10.nopipe.wordfreq
Second pass: gather features and tag dict to be used in tagger
feature cutoff:10
rare word freq:5
1000 lines
2000 lines
3000 lines
4000 lines
5000 lines
6000 lines
7000 lines
8000 lines
9000 lines
10000 lines
144072 features found
3318 words found in pos dict
Applying cutoff 10 to features
15096 features remained after cutoff
saving features to file ./t10/train10.nopipe.model.features
Saving tag dict object to ./t10/train10.nopipe.model.tagdict done
Third pass:training ME model...
1000 lines
2000 lines
3000 lines
4000 lines
5000 lines
6000 lines
7000 lines
8000 lines
9000 lines
10000 lines
training finished
saving tagger model to ./t10/train10.nopipe.model done
real 29m51.086s
user 29m50.148s
sys 0m1.036s
###############
Step2: Testing
###############
lar@lar-air:~/experiment/maxent/mypos/demo$ time ./test-all.sh | tee ./maxent-test-GIS.log
Start closed testing with ./t1/ctest1.nopipe.word ...
Finished!
Start open testing with ./otest.nopipe.word ...
Finished!
...
...
...
Start closed testing with ./t10/ctest10.nopipe.word ...
Finished!
Start open testing with ./otest.nopipe.word ...
Finished!
real 0m33.602s
user 0m33.444s
sys 0m0.116s
##########################
Step3: Removed blank lines
##########################
When I checked output of tagged file, I found blank lines exist between tagged lines as follows:
lar@lar-air:~/experiment/maxent/mypos/demo/t1$ head ctest1.nopipe.word.Tagged
ဂျပန်/n စကား/n ပြော/v လမ်းညွှန်/v သူ/n ရှိ/v ပါ/part သလား/part ။/punc
တံခါး/n က/ppm မ/part ပိတ်/v ဘူး/part ။/punc
ထို/adj အစည်းအရုံး/n သည်/ppm အမျိုးသား/n ရေး/part စိတ်ဓာတ်/n ပြင်းပြ/v ၍/conj နုပျို/n တက်ကြွ/v သည့်/part အခြေခံ/n လက္ခဏာ/n ရှိ/v ပြီး/conj အစည်းအရုံး/n ခေါင်းဆောင်/n များ/part သည်/ppm သက်ကြီး/n နိုင်ငံရေးသမား/n များ/part နှင့်/ppm မ/part တူ/v တမူထူးခြား/v ကြ/part ပါ/part သည်/ppm ။/punc
ဓာတုဗေဒ/n ပညာရပ်/n ပိုင်း/part က/ppm သုံးသပ်/n ရင်/conj လည်း/part ဓာတု/n ပညာရပ်/n ကို/ppm စ/v ခဲ့/part တဲ့/part ရှေး/adj ခေတ်/n အဂ္ဂိရတ်/n ဆရာကြီး/n တွေ/part ရဲ့/ppm ယမ်းငရဲမီး/n ၊/punc ကန့်ငရဲမီး/n ၊/punc ဆားငရဲမီး/n နဲ့/ppm ရွှေစားငရဲမီး/v ထုတ်လုပ်/v ပုံ/part နည်းစနစ်/n တွေ/part နဲ့/ppm ပုံစံ/n တူ/v တာ/part တွေ့/v ရ/part သည်/ppm ။/punc
နာရီ/n ဘယ်/pron မှာ/ppm ဝယ်/part လို့/part ရ/part နိုင်/part မလဲ/part ။/punc
And thus, we have to removed blank lines for evaluation.
Run following bash shell script that I prepared.
lar@lar-air:~/experiment/maxent/mypos/demo$ ./rm-blank-lines.sh | tee clean.log
Check with head -n 3 ./t1/ctest1.nopipe.word.Tagged.clean
ဂျပန်/n စကား/n ပြော/v လမ်းညွှန်/v သူ/n ရှိ/v ပါ/part သလား/part ။/punc
တံခါး/n က/ppm မ/part ပိတ်/v ဘူး/part ။/punc
ထို/adj အစည်းအရုံး/n သည်/ppm အမျိုးသား/n ရေး/part စိတ်ဓာတ်/n ပြင်းပြ/v ၍/conj နုပျို/n တက်ကြွ/v သည့်/part အခြေခံ/n လက္ခဏာ/n ရှိ/v ပြီး/conj အစည်းအရုံး/n ခေါင်းဆောင်/n များ/part သည်/ppm သက်ကြီး/n နိုင်ငံရေးသမား/n များ/part နှင့်/ppm မ/part တူ/v တမူထူးခြား/v ကြ/part ပါ/part သည်/ppm ။/punc
Check with head -n 3 ./t1/otest1.nopipe.word.Tagged.clean
ဆယ်/n ရာခိုင်နှုန်း/n ဈေး/n လျှော့/v ပေး/part ရင်/conj ဝယ်/v မယ်/ppm ။/punc
ယခု/n လ/n ၏/ppm အထိမ်းအမှတ်/n ပန်း/n မှာ/ppm မြတ်လေးပန်း/n Pomeacoccinea/fw ဖြစ်/v သည်/ppm ။/punc
ကရင်/n ဗမာ/n အဓိကရုဏ်း/n သည်/ppm သူ့/pron အား/ppm များ/adj စွာ/part ဒေါမနဿ/n ဖြစ်/v စေ/part ပါ/part သည်/ppm ။/punc
==========
...
...
...
#################
STep4: Evaluation
#################
lar@lar-air:~/experiment/maxent/mypos/demo$ time ./evaluate-all.sh | tee evaluation1.log
### Evaluation for Maximum Entrophy train1 model ###
-with closed test data
Tag precision: 0.928698752228
-with open test data
Tag precision: 0.91736854086
### Evaluation for Maximum Entrophy train2 model ###
-with closed test data
Tag precision: 0.949188727583
-with open test data
Tag precision: 0.937445936718
### Evaluation for Maximum Entrophy train3 model ###
-with closed test data
Tag precision: 0.959365861167
-with open test data
Tag precision: 0.945686319144
### Evaluation for Maximum Entrophy train4 model ###
-with closed test data
Tag precision: 0.95881290905
-with open test data
Tag precision: 0.950876394264
### Evaluation for Maximum Entrophy train5 model ###
-with closed test data
Tag precision: 0.964024885042
-with open test data
Tag precision: 0.952651946278
### Evaluation for Maximum Entrophy train6 model ###
-with closed test data
Tag precision: 0.964196807732
-with open test data
Tag precision: 0.954154336444
### Evaluation for Maximum Entrophy train7 model ###
-with closed test data
Tag precision: 0.965183127705
-with open test data
Tag precision: 0.955338037787
### Evaluation for Maximum Entrophy train8 model ###
-with closed test data
Tag precision: 0.965802935943
-with open test data
Tag precision: 0.956203050307
### Evaluation for Maximum Entrophy train9 model ###
-with closed test data
Tag precision: 0.969310272094
-with open test data
Tag precision: 0.957933075347
### Evaluation for Maximum Entrophy train10 model ###
-with closed test data
Tag precision: 0.971141723091
-with open test data
Tag precision: 0.959298884589
real 0m1.310s
user 0m1.204s
sys 0m0.060s
FIN!