forked from goodmami/rumi-jawi-web
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdetails.html
150 lines (139 loc) · 7.18 KB
/
details.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<link rel="stylesheet" href="static/styles.css">
<link rel="icon" href="favicon.png">
</head>
<body>
<div class="top">
<header>
<h1>Malay Rumi-Jawi Converter</h1>
</header>
<nav>
<a href="index.html">Home</a>
•
<a href="about.html">About Rumi and Jawi</a>
•
<a href="qa.html">Q & A</a>
•
<a class="at" href="details.html">Technical Details</a>
</nav>
</div>
<main>
<h1>Technical Details of the Converter</h1>
<article>
<h2>Algorithmic Details</h2>
<p class="justify">The Rumi–Jawi converter is at its core a
dictionary-based method, meaning that each Rumi or Jawi word is
looked up in the dictionary and, if it exists, mapped to one or
more forms in the other script. The results are concatenated
together into the output box. This process has a few more steps
that are detailed below.</p>
<h3>Step 1: Tokenization</h3>
<p class="justify">A computer does not know what a word is, so,
from the perspective of the converter, the input box only contains
a sequence of characters. <em>Tokenization</em> is the process of
finding word boundaries (or token boundaries, more generally) in
this character sequence. Unlike, say, Chinese or Japanese, Malay
is written in both Rumi and Jawi with spaces between words and
this makes the task considerably easier. There are still some
challenges, however, such as punctuation. If words were only ever
split at spaces, then it might act as though "<em>makan,</em>"
(including the comma) is a word and fail to find an entry in the
dictionary, whereas it would have found "<em>makan</em>" (without
the comma).</p>
<p class="justify">In the converter, sequences of letter
characters (Latin or Arabic) are grouped into tokens that get
converted while whitespace and most punctuation is ignored.
Hyphens (-), however, are used frequently in both Rumi and Jawi
for reduplication, and they are therefore included in word tokens.
Commas, semicolons, and question marks appear differently in Rumi
and Jawi and are therefore handled specially.
</p>
<h3>Step 2: Normalization</h3>
<p class="justify">Once words have been found by tokenization,
they are passed to a conversion function depending on the
direction of conversion (Rumi-to-Jawi or Jawi-to-Rumi), and the
first step of conversion is normalization. This step reduces
variation that would make conversion difficult.</p>
<p>In Rumi-to-Jawi conversion, the only word normalization that is
done is downcasing. Computers see <em>A</em> and <em>a</em>
as different letters, so if the dictionary contained a word like
<em>kuning</em> in the dictionary, it would not find
<em>Kuning</em>.</p>
<p class="justify">In Jawi-to-Rumi conversion there is no
downcasing as Jawi does not have upper and lower case, but there
are other reasons for variation. In Jawi, the preferred letter for
/k/ sounds is ک rather than the Arabic kaf ك, but some documents
nevertheless use the latter, so these are normalized to the
former. Similarly, Jawi uses ݢ for the /g/ sound, but sometimes
people use گ or ڬ instead, so these latter two get normalized to
the first one.</p>
<p class="justify">For conversion in both directions, commas,
semicolons, and question marks are replaced with the appropriate
version.</p>
<h3>Step 3: Morphological Analysis</h3>
<p class="justify">Malay has a robust morphological system of
prefixes and suffixes which, for instance, change the root
<em>ajar</em> ("teach"/"learn") to <em>pelajar</em> ("student"),
<em>ajaran</em> ("precept"/"lesson"), <em>pelajaran</em>
("education"), <em>belajar</em> ("to learn"), <em>mengajar</em>
("to teach"), etc. Each of these words needs to be in the
dictionary to be converted, but the system could be more robust if
it could detect these affixes and convert them separately, because
then it would only need to contain the roots and the affixes.</p>
<p>The converter does not yet do this level of sophisticated
morphological analysis, but if it fails to find a word in the
dictionary, the word ends in <em>lah</em>/له (a common discourse
suffix), and the word without the suffix exists in the dictionary,
then the word and the suffix are converted separately. For
Jawi-to-Rumi conversion, words beginning with د (<em>di</em>) are
similarly converted separately when the whole word is not in the
dictionary. This is because the adposition <em>di</em> ("in"/"at")
sometimes appears attached to words in Jawi where it would be a
separate word in Rumi.</p>
<h3>Step 4: Dictionary Lookup</h3>
<p class="justify">Once all the tokenization, normalization, and
morphological analysis is complete, the actual dictionary lookup
step is trivial: if the word is in the dictionary, the mapped form
is used; if not, the original word is retained.</p>
</article>
<article>
<h1>Display Details</h1>
<p class="justify">Aside from the algorithmic details of
conversion, there are some additional technicalities in the way
the results are presented.</p>
<h3>Font</h3>
<p class="justify">Jawi is an Arabic script but it uses characters
not present in the Arabic language, so it is important to choose a
font containing glyphs for these characters. This site uses
Google's <a
href="https://www.google.com/get/noto/#naskh-arab">Noto Naskh
Arabic</a> font as it contains these glyphs and is a neutral
typeface without many embellishments.</p>
<h3>Writing direction</h3>
<p class="justify">As Arabic scripts like Jawi are written from
right to left and Latin scripts like Rumi are written from left to
right, the input and output boxes of the converter are specified
with these directionalities so the text displays appropriately.
When the "Switch Direction" button is clicked, these
directionalities are reversed.</p>
<h3>Alternatives</h3>
<p class="justify">Sometimes a word, whether in Rumi or Jawi, has
multiple candidates for conversion. For instance, <em>dan</em>
("and") can be written both as دن and دان in Jawi, and سمبيلن can
be either <em>sembilan</em> ("nine") or <em>sambilan</em>
("casual"). The converter tries to assist users in selecting the
correct form by highlighting ambiguous conversions in blue.
Clicking on the blue words shows the list of conversions at the
top of the output box, as well as the original form. The user may
then select the preferred conversion which will be used in the
output. <strong>Be careful</strong> as any change to the content
of the input box, such as typing a character or switching the
direction of conversion, will wipe out any selections, so only
make such selections when the input will no longer change.</p>
</article>
</main>
</body>
</html>