-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathTechnReport
214 lines (156 loc) · 7.67 KB
/
TechnReport
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
CONCISE REPORT ON THE ALGORITHMS IN SIM 970623
INTRODUCTION
The general outline of the similarity checker is as follows:
1. the files are read in (pass 1)
2. a forward-reference table is prepared
3. the set of interesting runs is determined
4. the line numbers of the runs are determined (pass 2)
5. the contents of the runs are printed in order (pass 3)
To keep the memory requirements (relatively) small, the exact positions
of the tokens are not recorded. This necessitates pass 2. See, however,
the pertinent chapter.
READING THE FILES
Each file is tokenized using an lex-generated scanner appropriate for
the input. Each token fits in one byte, possibly using all 8 bits. The
tokens are stored in the array Token_Array[], which is extended by
reallocation if it overflows. See tokenarray.c.
Also, to optimize away pass 2, an attempt is made to remember the token
positions of all beginnings of lines. The token-positions at BOL are
stored in the array nl_buff[], which is also extended by reallocation,
if needed. If the attempt fails due to lack of memory, nl_buff[] is
abandoned, and pass2 will read the files instead.
PREPARING THE FORWARD-REFERENCE TABLE
Text is compared by comparing every substring to all substrings
to the right of it; this process is in essence quadratic. However,
only substrings of length at least 'MinRunSize' are of interest,
which gives us the possibility to speed up this process by using
a hash table.
Once the entire text has been read in, a forward-reference table
forward_references[] is made (see hash.c).
For every position in the text, we construct an index which gives
the next position in the text where a run of MinRunSize tokens
starts that has the same hash code. If there is no such run, the
index is 0.
To fill in this array, we use a hash table last_index[], such that
last_index[i] is the index of the latest token with hash_code i, or 0 if
there is none. If at a given position p, we find that the text ahead of
us has hash code i, last_index[i] tells us which position in
forward_references[] will have to be updated to p.
See MakeForward_References().
For long text sequences (say hundreds of thousands of tokens), the
hashing is not really efficient any more since too many spurious matches
occur. Therefore, the forward reference table is scanned a second time,
eliminating from any chain all references to runs that do not start with
and end in the same token (actually this is a second hash code).
For the UNIX manuals this reduced the number of matches from 91.9% to 1.9%
(of which 0.06% was genuine).
DETERMINING THE SET OF INTERESTING RUNS
The overall structure of the routine Compare_Files() (see compare.c) is:
for all new files
for all texts it must be compared to
for all positions in the new file
for all positions in the text
for ever increasing sizes
try to match and keep the best
If for a given position in the new file a good run (i.e. on of at least
minimum length) has been found, the run is registered using a call of
add_run(), the run is skipped in the new file and searching continues at
the position after it. This prevents duplicate reports of runs.
Add_run() allocates a struct run for the run (see sim.h)
which contains two struct chunks and a quality description. It fills
in the two chunks with the pertinent info, one for the first file and
one for the second (which may be the same, if the run relates two chunks
in the same file).
The run is then entered into the arbitrary-in-sorted-out store AISO (see
aiso.spc and aiso.bdy, a genuine generic abstract data type in C!), in
which it is inserted according to its quality. Both positions
(struct position) in both chunks in the run (so four in total) are each
entered in a linked list starting at the tx_pos field in the struct text
of the appropriate file.
When this is finished, the forward reference table can be deleted.
So the final results of this phase are visible both through the tx_pos
fields and through the aiso interface.
DETERMINING THE EXACT POSITION OF EACH RUN (PASS 2)
The purpose of this pass is to find for each chunk, which up to now is
known by token position only, its starting and ending line number (which
cannot be easily derived from the token position).
For each file that has a non-zero tx_pos field, ie. that has some
interesting chunks, the positions in the tx_pos list are sorted on
ascending line number (they have been found in essentially arbitrary
order) by sort_pos() in pass2.c.
Next we scan the pos list and the file in parallel, updating the info in
a position when we meet it. A position carries an indication whether it
is a starting or an ending position, since slightly differing
calculations have to be done in each case.
Actually, if the nl_buff[] data structure still exists, the file is not
accessed at all and the data from nl_buff[] is used instead. This is
done transparently in buff.c.
PRINTING THE CONTENTS OF THE RUNS (PASS 3)
Since each struct run has now been completely filled in, this is simple;
the hard work is calculating the page layout.
Pass3() accesses the aiso store and retrieves from it the runs in
descending order of importance. Show_run() opens both files, positions
them using the line numbers and prints the runs.
================================================================
CODE EXCERPT OF THE SOFTWARE SIMILARITY TESTER SIM (980222)
sim:
get command line options
check the options
init language, to precompute tables
pass1, read the files
# there is an array Token_Array[] that holds all input tokens
make forward reference table
# there is an array forward_references[], with one entry for
# each token in the input; forward_references[i] gives the
# token number where a token sequence starts with the same
# hash value as the one starting at i
compare various files to find runs
delete forward reference table
pass2, find newline positions of found similarities
pass3, print the similarities
pass1, read the files:
for each file
divide the text into tokens
store all tokens except newlines in Token_Array and try to
keep a record of the newline positions
make forward reference table:
# there are two independent hash functions, hash1() and hash2().
# hash1(i) gives the hash value of the token sequence starting at i
# likewise for hash2(i)
set up the forward references using the last_index table:
# there is an array last_index[], with one entry for each
# possible hash value; last_index[i] gives the position in
# forward_references[] at which i was most recently
# encountered as a hash value
for each file
for all positions in file except the last MinRunSize
set forward_references[] and update last_index[]
use hash2() to clean out matches:
for all tokens
find first token in chain with same hash2 code
short-circuit forward reference to it
compare:
for all new files
for all texts it must be compared to
for all positions in the new file
for all positions in the text
for ever increasing sizes
try to match and keep the best
try to match and keep the best:
# using forward_references[], we find a list of positions in
# which a matching token sequence will start;
# scanning this list, we measure the maximum length of the
# match and add the longest match to the run collection
pass2, find positions of found runs:
for all files:
sort the positions in the runs
# we scan the pos list and the file in parallel
for all positions inside this file
if it matches a token position in a run
record line number
pass3, print the similarities:
for all runs
# a run consists of two chunks
open the files that hold the chunks and position them
at the beginning of the chunk
display the chunks