Datasets for ancient written text recognition algorithms are of fundamental interest for the training of statistics based recognition methods as well as for benchmarking existing recognition systems. SleukRith Set, the first dataset specifically created for Khmer palm leaf manuscripts, has been constructed. The dataset consists of annotated data from 657 pages of digitized palm leaf manuscripts which are selected arbitrarily from our digitized palm leaf image corpus
SleukRith Set is composed of three types of data:
- Isolated Characters: Individual or isolated character dataset is the most important data type in SleukRith Set since its information is used to produce the other types of data. In order to segment and annotate a manuscript page into small image patches representing each individual character, a polygon boundary enclosing the character needs to be drawn manually. The ground truther is required to dot out vertex of the polygon one by one until a proper boundary is formed. The ground truther is then prompted to input the correct Unicode or Unicode sequence as label for that character. Some samples of character image patches extracted from annotated character dataset of SleukRith Set are shown below.
- Words: After all characters in the page are manually annotated, they can be combined together into words. To form a word, the character components of that word are selected one by one. The selection order is also important since Khmer Unicode sequence does not follow the left to right position order of the characters but instead respects a consonant-first-vowel-second basis. The ground truther is then again prompted to input a Unicode sequence representing the label of the formed word. By default, the word label is generated by putting together the labels of the characters which are the components of that word. The second label should also be provided by the ground truther when either the current word spelling is found to be erroneous or when an equivalent word from the modern Khmer language has a different spelling. The image below illustates some samples of word patch images extracted from annotated word dataset of SleukRith Set.
- Lines: Similarly annotated characters may be grouped into lines. To efficiently achieve this, the ground truther uses left click and drag over characters belonging to the same line. He is then asked to create a new line from the selected characters or add them to existing lines.
After all steps in the annotation scheme are complete, an xml file containing all information of the three types of data of the annotation can be exported for each manuscript page. The xml file is divided into two sections. The upper part under the tag name “CharAnno” is dedicated to the annotation at the character level. This section block contains child blocks. Each child block represents an annotated character, information about the coordinates of its polygon boundary and additional attributes including character id, its label, and the id of the line which the character belongs to. The lower part of the file under the tag name “WordAnno” describes the annotation at the word level. Since a word is a combination of characters, only the id’s of the annotated characters defined in the first section are stored along with the id information of the annotated word and its two labels.
<CharAnno>
<Char id="0" label="យ" lineid="0">
<poly x="406" y="100"/>
<poly x="406" y="87"/>
...
</Char>
...
</CharAnno>
<WordAnno>
<Word id="0" label="កំលាំង" label2="កម្លាំង">
<CharInWord id="329"/>
<CharInWord id="330"/>
...
</Word>
...
</WordAnno>
- beta (110 pages): images/annotated data
- version 1.00 (657 pages): images/annotated data
- isolated characters: image train/label train/image test/label test (data format)
For more information about SleukRith Set, please refer to our paper: Valy, D., Verleysen, M., Chhun, S., & Burie, J. C. (2017). A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition - SleukRith Set. In 4th International Workshop on Historical Document Imaging and Processing (HIP).
We would like to thank the National Library of Cambodia, the EFEO team, and the Buddhist Institute for providing their digital images of palm leaf manuscripts. In addition, we would also like to acknowledge the help with the annotation process of our dataset by volunteer students from the Institute of Technology of Cambodia (ITC) and the National Institute of Posts, Telecommunications, and ICT (NIPTICT).
This research study is supported by ARES-CCD (program AI 2014-2019) under the funding of Belgian university cooperation and the STIC Asia program implemented by the French Ministry of Foreign Affairs and International Development (MAEDI).