-
Notifications
You must be signed in to change notification settings - Fork 3
/
README
239 lines (180 loc) · 8.04 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
$Id$
stem-php -- a PHP extension that provides word stemming.
Version 1.5.0-dev.
23 Jan 2006
J Smith <jay@php.net>
Olivier Hill <ohill@php.net>
NOTE: There have been changes to some of the stemming algorithms,
so they are not 100% compatible with older versions of stem-php.
Everything works the same, but words stemmed with older versions
may not match with the current version. Be careful.
0. What is it?
Basically, a word stemmer takes a word and strips it of its suffix.
(Or suffixes.) Words like "assassination", "assassinations",
"assassinate" and "assassinated" all have the same base -- "assassin".
A word stemmer breaks each of the aforementioned words down to their
common base.
The stem extension for PHP provides this capability for a variety of
languages using Dr. M.F. Porter's Snowball API, which can be found at
http://snowball.tartarus.org
1. What is it good for?
The idea behind stemming is to increase the efficiency of information
retrieval by combining similar words. This has a couple of advantages
in, say, a typical search engine:
* You can reduce the size of your database/keyword list. There's no
reason to store "physics", "physicist" and "physical" when you
can just store "physic" and provide more or less the same results.
This can decrease the search time as well as decrease the physical
size of the database.
* Provide more generalized results. If a user searches on "assassins",
it's easy to provide them results with words like "assassinated",
"assassinations" and "assassin", since they all work out to the same
thing after stemming.
There are lots of uses for stemming in the IR world. You can read up
on them at Dr. Porter's Snowball site.
2. Why?
When I was first hired for my current job, I had to quickly pound
together a dynamic web site/application combined with a search engine.
I had about a month to do it. So I hammered it all together in PHP.
The search engine was pretty terrible at first, but eventually I
prettied it up, adding a lot of features like Boolean syntax, the ability
to run as a "daemon"-type service, and the ability to serve XML documents
for results, thus freeing it from it's web interface and making it
implementation independent.
Another feature I added after reading Dr. Porter's "An algorithm for
suffix stemming" article, first featured in "Program". (Although I read
it in "Readings in Information Retrieval", a collection of IR works by
K.S. Jones and P. Willett.)
My own implementation of Porter's stemmer went through several variations,
none of which were 100% correct, which wasn't a total surprise. (Porter's
site recants many malformed implementations.)
I finally hit it dead on after re-writing my crappy little stemmer in C
as a PHP module using standard POSIX regexes. After pumping some 50,000
words through the module and comparing it with Porter's own stemmer
(also written in C), I came up with identical results. (His was a bit
faster, though, which is to be expected considering the extra layers
of abstration PHP and Zend required.)
Eventually I realized we'd probably need a French stemmer, seeing as
I work for a Canadian company. So back to the drawing board I went.
By this time, Dr. Porter's Snowball project was well underway. After
discovering it, I decided to pack a bunch of languages into one PHP
extension -- stem.
3. How do you use it?
There are two basic prototypes for the stem extension:
string stem(string word [, int language])
string stem_LANGUAGE(string word)
Both prototypes have the same return values: the stemmed word on success,
and false should some kind of Snowball error occur. (An E_NOTICE is
also raised on various errors.)
In the first prototype, language can be one of the following:
STEM_DANISH
STEM_DUTCH
STEM_ENGLISH -- enhanced Porter stemming algorithm
STEM_FINNISH
STEM_FRENCH
STEM_GERMAN
STEM_HUNGARIAN
STEM_ITALIAN
STEM_NORWEGIAN
STEM_PORTER -- the original Porter stemming algorithm
STEM_PORTUGUESE
STEM_ROMANIAN
STEM_RUSSIAN
STEM_RUSSIAN
STEM_SPANISH
STEM_SWEDISH
STEM_TURKISH
By default, STEM_PORTER is used.
In the second prototype variation, you can use the following forms:
stem_danish
stem_dutch
stem_english
stem_finnish
stem_french
stem_german
stem_hungarian
stem_italian
stem_norwegian
stem_porter
stem_portuguese
stem_romanian
stem_russian
stem_russian
stem_spanish
stem_swedish
stem_turkish
Each function will accept and return strings encoded in UTF-8.
More languages may follow, and perhaps some aliases (like stem_francais()
will be equivalent to stem_french(). Who knows.)
A quick example:
<?php
print "Original porter (default): assassinations -> " . stem("assassinations") . "\n";
print "English: devestating -> " . stem("devestating", STEM_ENGLISH) . "\n";
print "Finnish: innostuneissa -> " . stem("innostuneissa", STEM_FINNISH) . "\n";
print "French: majestueusement -> " . stem("majestueusement", STEM_FRENCH) . "\n";
print "Spanish: chicharrones -> " . stem("chicharrones", STEM_SPANISH) . "\n";
print "Dutch: lichamelijkheden -> " . stem("lichamelijkheden", STEM_DUTCH) . "\n";
print "Danish: undersøgelsen -> " . stem_danish("undersøgelsen") . "\n";
print "German: aufeinanderschlügen -> " . stem_german("aufeinanderschlügen") . "\n";
print "Italian: pronunciamento -> " . stem_italian("pronunciamento") . "\n";
print "Norwegian: havnemyndighetene -> " . stem_norwegian("havnemyndighetene") . "\n";
print "Portuguese: quilométricas -> " . stem_portuguese("quilométricas") . "\n";
print "Swedish: klostergården -> " . stem_swedish("klostergården") . "\n";
?>
Output:
Original porter (default): assassinations -> assassin
English: devestating -> devest
Finnish: innostuneissa -> innostun
French: majestueusement -> majestu
Spanish: chicharrones -> chicharron
Dutch: lichamelijkheden -> licham
Danish: undersøgelsen -> undersøg
German: aufeinanderschlügen -> aufeinanderschlüg
Italian: pronunciamento -> pronunc
Norwegian: havnemyndighetene -> havnemyndighet
Portuguese: quilométricas -> quilométr
Swedish: klostergården -> klostergård
4. How do you compile it?
You compile it as you would any other PHP extension.
On UNIX-like systems:
0. Unpack the source.
1. Run phpize.
2. You can optionally enable/disable languages using
--enable-LANGUAGE=(yes|no) style arugments.
3. Run "make" and "make install".
Alternatively, you can install using the pecl command thusly
as root:
# pecl install stem-x.x.x.tar.gz
Or using the PECL channel, which will pull the latest version:
# pecl install stem
On Windows systems it's a bit trickier, so you might want to
just download the extension from http://pecl4win.php.net/.
If you must build it, you've going to have to do some work to
get things running.
For PHP 4.x:
0. Unpack the source. It's probably best to unpack it under
c:\path\to\php-src\ext, as the paths for the headers and
libraries are set up to look for them in that general area.
1. Open up the VC++ 6 workspace provided. You will likely have
to adjust the project settings to point to the proper
directories for the PHP includes and library files. Look
under Project->Settings for those paths.
2. Build the DLL and move it to the proper extension directory
for your system.
3. To use the extension, either use the dl() function to load
the DLL or use the extension=php_stem.dll directive in your
php.ini file.
For PHP 5.x and above:
0. Unpack the source into c:\path\to\php-src\ext.
1. Open up a console window and head to the directory you just
unpacked to.
2. Run the buildconf.bat script. This will build configure.js
with the new build information for the stem extension.
3. Build your PHP tree as you normally would and add --with-stem
to build the stem module.
4. To use the extension, either use the dl() function to load
the DLL or use the extension=php_stem.dll directive in your
php.ini file.
You might be able to just do all of this using the pecl/pear
command on Windows. Honestly, I have no idea, I'm just glad
I'm able to compile this thing under Windows to begin with.