-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathChangelog
118 lines (99 loc) · 4.48 KB
/
Changelog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
Version: 2.11.0 (2022.09.08)
----------------------------
Surrogate pairs in RTF input are now also handled correctly if they are
interspersed by white space characters.
As in \u55357 \u56911
versus \u55357\u56911
Version: 2.10.1 (2022.04.07)
----------------------------
Vertical tab characters are retained in the output.
In a sequence of white spaces only consisting of any number of space characters
0x20 and new line characters 0x0A (\n), and one or more vertical tab characters
0x0B (\v), only the vertical tab characters and the first new line character
prevail. The rest is discarded.
Version: 2.10.0 (2022.04.03)
----------------------------
Form feed characters are retained in the output.
In a sequence of white spaces only consisting of any number of space characters
0x20 and new line characters 0x0A (\n), and one or more form feed characters
0x0C (\f), only the form feed characters and the first new line character
prevail. The rest is discarded.
Version: 2.9.1 (2021.09.02)
---------------------------
Fixed: lines with trailing spaces confused the determination whether two
segments belonged to the same paragraph or not. What was needed was a look-
ahead to see whether a non-space would follow a given space character.
Version: 2.9 (2021.09.01)
-------------------------
Added option -p. Output an empty line if a segment in the input ends where the
line ends. Can be used to retain information about paragraphs.
Version: 2.8 (2019.05.27)
-------------------------
Improved handling of XML(-like) tags. Previously, text preceding a 'tag'
was not properly segmented. Note that if a part of text can be interpreted as
valid XML or HTML, then that part of the text is copied to the output without
changes. For example, the text '<press escape>' is NOT tokenised to
'< press escape >'.
Version: 2.7 (2019.01.19)
-------------------------
Initialise optionStruct::tokenSplit with ' '.
Version: 2.6 (2019.01.02)
-------------------------
Option -D inserts ASCII 127 (DEL) in place of the default ASCII 32 (space) to
separate tokens that are not already separated in the input, for example
between the last word in a sentence and the full stop, or between 'got' and
'ta' if the input is the English word 'gotta'. This option makes it possible to
go back to an untokenized, but still segmentized, version of the text.
Version: 2.5 (2017.10.28)
-------------------------
Makefile: now creates the executable in the rtfeader folder, not /usr/local/bin.
README.md: Updated link to workflow manager.
Version: 2.4 (2017.03.29)
-------------------------
Added to list SEP[] and QUOTE[]
U+02B9 ʹ 697 Modifier Letter Prime
U+02BA ʺ 698 Modifier Letter Double Prime
U+02BB ʻ 699 Modifier Letter Turned Comma
U+02BC ʼ 700 Modifier Letter Apostrophe
U+02BD ʽ 701 Modifier Letter Reversed Comma
Version: 2.4 (2017.03.29)
-------------------------
Added [] after delete in main.cpp.
Removed static linked version, because Mac OS X users do not have that option.
Version: 2.3 (2016.12.22)
-------------------------
Added newline at end of some header files that hadn't.
Made destructors virtual in some classes.
Version: 2.2 (2016.05.03)
-------------------------
Typecast char to unsigned char when calling isspace and isUpper.
Version: 2.1 (2016.01.15)
-------------------------
Corrected interpretation of unicode characters \uN.
1) Absence of \ucM now means M == 1, as the RTF standard prescribes. Which
means that we must assume that the unicode character is followed by a
single character that figures as stand-in in case the unicode character
cannot be handled.
2) Parsing \u51234anumber stops when the number after \u is as big as it can
be in a short integer or when a non-digit is seen after having seen a
number after \u
Version: 2.03 (2015.10.30)
--------------------------
More subtle improvements to segmentation.
Version: 2.02 (2015.10.28)
--------------------------
Lines with most words beginning with a capital letter are not always head
lines. The previous line must not seem to be unfinished: no punctuation,
comma or colon as last punctuation, hyphen as if word is wrapped.
Version: 2.0 (2015.10.13)
-------------------------
(Changelog starts here.)
Better handling of RTF coming from other sources than Microsoft s/w.
Better handling of periods and other punctuation.
Linebreaks in more appropriate places. (Because of previous point.)
Parsing of roman numbers.
Restructured the program: Still spagetti, but now with tomato sauce (more
source files, more classes).
Version 1.5
-----------
The previous version.