Language Technology Research Laboratory
Language Technology Research Laboratory
University of Colombo School of Computing

Accessibility Help   Select Language:   English   සිංහල
Publications & Technical Reports
Implementation of Internet Domain Names in Sinhala (248 KB) pdf
Harsha Wijayawardana, Asanka Wasala, Ruvan Weerasinghe and Chamila Liyanage, Language Technology Research Laboratory, University of Colombo School of Computing

Proceedings of International Symposium on Country Domain Governance. Nov, 20-22, 2008, Nagaoka, Japan. (2008)

This paper presents the process undertaken in setting about the study and resolution of the issues surrounding the localization of Domain Names primarily for Sinhala, the methodology adopted in translating Top Level Domains(TLDs) to Sinhala, and technical issues related to the implementation of a robust localized domain name system for Sinhala.

Festival-si: A Sinhala Text-to-Speech System
Ruvan Weerasinghe, Asanka Wasala, Viraj Welgama and Kumudu Gamage, Language Technology Research Laboratory, University of Colombo School of Computing

Proceedings of Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007. (2007) 472-479

This paper brings together the development of the first Text-to- Speech (TTS) system for Sinhala using the Festival framework and practical applications of it. Construction of a diphone database and implementation of the natural language processing modules are described. The paper also presents the development methodology of direct Sinhala Unicode text input by rewriting letter-to-sound rules in Festival’s context sensitive rule format and the implementation of Sinhala syllabification algorithm. A Modified Rhyme Test (MRT) was conducted to evaluate the intelligibility of the synthesized speech and yielded a score of 71.5% for the TTS system described.

Facilitating Information Accessibility for the Print Disabled (382 KB) pdf
Ruvan Weerasinghe, Asanka Wasala and Samantha Mathara Arachchi

Diriya 2007 - a conference on "Mainstreaming Disability into Development". Colombo, Sri Lanka (2007)

Information accessibility is becoming a key to personal, professional and national development in this increasingly connected world. While cost of access is still an issue in developing countries such as Sri Lanka, global trends indicate that this barrier will be removed sooner rather than later. Another significant impediment, the Language barrier, is currently being addressed in the region through various Localization initiatives.

In this paper however, we focus on a particularly disadvantaged community which is to a large extent shut out from the enormous global information resource made conveniently accessible to the rest of us through the Internet and the World Wide Web: this is the Print Disabled community. With a global percentage of 0.57% and a Sri Lankan estimate of 0.36%, this community forms a significant minority, who are in many other ways well equipped to benefit most from information technology. While the language literacy rate of this community is usually very high in countries like Sri Lanka, their access to information is greatly hampered owing to their relatively low IT literacy. One of the key reasons for this is the mismatch between their physical faculties and the most prevalent user interface in the computer – its screen.

Work undertaken at the Language Technology Research Lab (LTRL) at the University of Colombo School of Computing (UCSC) has focused on some of the key assistive technologies for enabling the active participation of this community in the emerging knowledge society. Several of these, such as the development of non-proprietary fonts and input methods, have a broader relevance in facilitating local language support for all. Others, such as Text to Speech, Optical Character Recognition and Talking Book technology, are more specifically supportive of the print disabled community. Still others, such as web accessibility support, are fast becoming global best practice to assist various differently-abled communities.

This paper will describe each of the key technologies relating to web content accessibility in local languages with a focus on the Sinhala language support tools currently available through the LTRL. It will also outline preliminary results and findings of testing and deploying such technology in general.

A KNN based Algorithm for Printed Sinhala Character Recognition
A.R.Weerasinghe, D.L. Herath, N.P.K. Medagoda

Proceedings of 8th International Information Technology Conference. Colombo, Sri Lanka (2006)

In this paper we discuss the application of the K-nearest neighbor algorithm for recognizing printed Sinhala characters. This attains overall accuracy of approximately 85%. This attempt of recognizing Sinhala characters is font specific and we considered four popular fonts.

Sinhala Grapheme to Phoneme Conversion and Rules for Schwa Epenthesis (271 KB) pdf
Asanka Wasala, Ruvan Weerasinghe and Kumudu Gamage, Language Technology Research Laboratory, University of Colombo School of Computing

Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Sydney, Australia (2006) 890-897

This paper describes an architecture to convert Sinhala Unicode text into phonemic specification of pronunciation. The study was mainly focused on disambiguating schwa-/ə/ and /a/ vowel epenthesis for consonants, which is one of the significant problems found in Sinhala. This problem has been addressed by formulating a set of rules. The proposed set of rules was tested using 30,000 distinct words obtained from a corpus and compared with the same words manually transcribed to phonemes by an expert. The Grapheme-to-Phoneme (G2P) con version model achieves 98 % accuracy.

The Sinhala Collation Sequence and its Representation in UNICODE (2.11 MB) pdf
Weerasinghe A. R., Herath D. L., Gamage K.

Localisation Focus - The International Journal for Localisation. March 2006. 13-19

The alphabet of a language is perhaps the first thing we learn as users. The alphabet of our mother tongue would be the first alphabet we ever learn. And yet, a closer look reveals that there is much about such an alphabet that we have not explicitly specified anywhere. The Sinhala alphabet order is a prime example. We use it, recite it and yet would be hard pressed to define it explicitly. Sinhala is spoken in all parts of Sri Lanka except some districts in the north, east and centre by approximately 20 million people. It is spoken by an additional 30,000 (1993) people in Canada, Maldives, Singapore, Thailand and United Arab Emirates. Sinhala is classified as an Indo-European language and used as an official language. The UNICODE Collation Algorithm (UCA) is an attempt to make explicit the collation sequence of any language expressed in the UNICODE (or any other) coding system. In order to express the Sinhala collation sequence (alphabetical order) using UCA, the authors undertook the task of identifying unresolved issues facing the unambiguous definition of the order. This paper first describes the issues identified through this study, suggesting alternate solutions and recommending one of them. Finally, it sets out the recommended collation sequence for Sinhala in the form of the UNICODE collation specification. The outcome of this process is a unique and unambiguous expression of the Sinhala collation sequence which could be tested using existing tools and software environments.

A Rule Based Syllabification Algorithm for Sinhala
Ruvan Weerasinghe, Asanka Wasala and Kumudu Gamage, Language Technology Research Laboratory, University of Colombo School of Computing

Proceedings of 2nd International Joint Conference on Natural Language Processing (IJCNLP-05). Jeju Island, Korea (2005) 438-449

This paper presents a study of Sinhala syllable structure and an algorithm for identifying syllables in Sinhala words. After a thorough study of the Syllable structure and linguistic rules for syllabification of Sinhala words and a survey of the relevant literature, a set of rules was identified and implemented as a simple, easy-to-implement algorithm. The algorithm was tested using 30,000 distinct words obtained from a corpus and compared with the same words manually syllabified. The algorithm performs with 99.95 % accuracy.

A Stochastic Part of Speech Tagger for Sinhala
Dulip Lakmal Herath, A.R. Weerasinghe

Proceedings of 6th International Information Technology Conference. Colombo, Sri Lanka (2004)

This paper presents the results of the experiment on part of speech tagging (POS) for Sinhala using Hidden Markov Models (HMMs) based on bi-gram probabilities. POS tagging process is needed to resolve the syntactic ambiguities exist in natural language texts. Two kinds of ambiguities of ambiguities have been handles in the present work: Known Word Ambiguity and Unknown Word Ambiguity. An annotated corpus of Sinhala was built for HMM parameter estinmation. A comprehensive POS tag set has been designed and used for corpus annotation. The paper describes the process of developpingthe POS tagger and it's related issue sin the context of Sinhala. The tagger has shown an interesting performance even under several constrains with respect to training data. The paper also makes important suggestions on further improvements to achive higher level of accuracy.

An Introduction to UNICODE for Sinhala Characters (337 KB) pdf
Samaranayake, V. K., Nandasara, S. T., Dissanayake, J. B.*, Weerasinghe, A.R., Wijayawardhana, H. University of Colombo School of Computing * Sinhala Department, University of Colombo.

UCSC Technical Report 03/01

This paper introduces the background, steps taken and eventual adoption of a Standard Code for the Sinhala Character set and the UNICODE/ISO10646 standard for Sinhala together with clarifications on some of the technical and linguistic issues involved in using the code for implementation.

Rendering of Unicode Sinhala Characters (151 KB) pdf
Wijayawardhana, H. University of Colombo School of Computing

This document is a reference material given at a workshop by Mr. H. Wijayawardhana as a supplement to his speech.

© Language Technology Research Laboratory, 2011 Last updated on 09 April 2015