Language Technology Research Laboratory (LTRL) was established in 2004 to address the growing need of local language computing in Sri Lanka by doing Localization and Language Processing research and development. LTRL maintains strong relationships with all related parties, collaborating with the researchers, practitioners, linguists and policy makers in the island. The research group has a diverse range of activities: Research & Development, Consultancy, Training and Services. LTRL conducts research in the field of computational linguistics: phonetics/phonology, morphology, and syntax of local languages. These researches are mainly focused on mathematical and statistical modeling of local languages, namely Sinhala and Tamil. These research programs are strengthened by the supervision and consultation of a panel of senior scholars from various linguistic traditions.
These themes are elaborated in the research, development and the commercial projects:
- The compilation of a large Sinhala corpus.
- The construction of a Sinhala lexicon with some translations to Tamil and English.
- The production of a commercial quality Sinhala test-to-speech system.
- The building of a Sinhala OCR system.
This activity is concerned in building up a valuable resource for various kinds of language processing tasks. While the initial aim is to collect large amounts of electronic Sinhala texts from wherever it can be found, towards the latter stages, some effort will be made to ‘balance’ the corpus produced. The final goal is to compile a corpus of 1 million words. As a part of the strategy for collecting large quantities of text material, there will be some works on producing tools for mapping various ‘fonts’ and the Sinhala UNICODE standard. The corpus itself will be completely in UNICODE.
The Corpus compilation task will employ the following methodology:
For non-electronic content:
- collating government documents
- negotiating publisher content
- keyboarding non-electronic content from above in UNICODE
For electronic content:
- identifying high content ‘font encodings’
- building mappers for converting these to UNICODE
- collecting archived web content in these encodings
- converting all electronic content to UNICODE
Markup entire corpus
Some effort will be expanded towards the end of the project to balance the corpus.
A lexicon is an important resource for most of other language processing works, including those which have been undertaken in this project. As such, this activity will precede the following activities and will directly feed into them.
The primary aim of this is to create a list of 30 thousand Sinhala words together with some grammatical features. The features identified currently are the part-of-speech, numbers, gender and morphology, but may be extended as other priority uncovered features. Apart from this, at least a portion of 10 thousand common words in this list will also have entries of the Tamil and English translations in the lexicon, providing a resource for language translation works.
The methodology adopted for the construction of the lexicon will include:
- buying rights to a good Sinhala dictionary
- manual building of the electronic version
- augmenting morphology, POS tags, number and gender ( and possibly other) information in each lexical entry
- including Tamil and English equivalents for subset of lexicon
In addition to this, a (multi-level) morphological parser for Sinhala will be constructed in this sub-task.
Text To Speech (TTS) System
While some experimental TTS systems for Sinhala are being carried out by the UCSC, the aim under this project is to produce another one which is with commercial quality. To this end, considerable effort has been taken on the quality of the aspects of this activity. The fundamental approach will continue to be diphone concatenation. Apart from identifying the phonetic alphabet of the language, recording relevant word sentences in the database and building a text analysis component can be mentioned. The project will also produce a synthesizing engine that will support prosodic features of the language to be modeled so that the final output will be in a natural sounding Sinhala voice.
The basic methodology to be adopted is based on the diphone concatenation approach to TTS and will involve:
A text analysis component
- study types of non-textual content and how to convert them to text
- define text analysis interface
- build text analysis components
A phonetic component
- study the phonology and phonetics of Sinhala
- identify the phonetic vocabulary
- construct word sentences for recording most common diphones
- define phonetic processor components
- build diphone database
- build phonetic processor
- build prosody model
Optical Character Recognition System
Currently the UCSC concerning OCR has concentrated on developing a technique well suited to detect printed Sinhala characters. This component of work will focus on converting that research into a real product by making its robust for variations in font size, particularly for those which are commonly used by the majority of the people. These include newspaper prints and government publications.
The methodology adopted for the construction of OCR will consist of following activities:
- scanning documents and skew detection with a view to discarding
- noise detection and removal
- extraction of text characteristics and individual characters
- identification of representative texts
- separation of training, validation and testing sets
- feature extraction and pattern matching
- testing of competing algorithms
- optimization of algorithms
- application development
On behalf of the Language Technology Research Laboratory – University of Colombo School of Computing, we would like to thank the following contributors for their enormous support in our various projects.
Buddhist Cultural Centre
Godage Book Emporium
Associated Newspapers of Ceylon Limited
Martin Wickremasinghe Trust
Social Scientists Association
Prof. Sucharitha Gamlath
Dr. Jayadewa Uyangoda
- Dr. A.R. Weerasinghe – BSc(Col), MSc(Cardiff), PhD(Cardiff) (Head – LTRL)
- Mr. Viraj Welgama – BSc(Col), MA(Kelaniya), MPhil(Col)
- Mr. Chamila Liyanage – BA(Kelaniya), (Senior Research Assistant)
- Mr. Namal Udalamatta – BA(Col), MPhil(Kelaniya), (Senior Research Assistant)
- Mr. Randil Pushpananda – BSc(Col), BIT(Col), MSc(Col), (Research Associate)
- Mrs. Thilini Nadungodage – BSc(Col), (Research Associate)
- Mrs. Dilhani Samaranayake – BIT(Col)UG, (NCICT), CAA (NAITA) (Research Assistant)
- Prof. J.B. Disanayaka (Professor of Sinhala – University of Colombo)
- Prof. Sarmad Hussain (Head – Center for Language Engineering, University of Engineering and Technology, Pakistan)
- Prof. W.M. Wijeratne – BA(Sri Lanka), MA(Sri Lanka), PhD(Ed.) (Professor of Linguistics – University of Kelaniya)
- Dr. Lalith Premarante – BSc(SL), PGDip(Col), MSc(Col), Ph.Lic(Chalmers), PhD(Chalmers) (Senior Lecturer – University of Colombo School of Computing)
- Mr. S.T. Nandasara – BDev(SL), MCS(SL), MACS, MBCS (Senior Lecturer – University of Colombo School of Computing)
- Mr. Harsha Wijewardhane – BSc(Miami), (Head – Software Development Unit, University of Colombo School of Computing)
- Dr. Sandagomi Coperahewa – BA(Col), MA(Lancaster), MPhil(Peradeniya), PhD(Cambridge) (Senior Lecturer – Department of Sinhala, University of Colombo)
- Mr. Harshula Jayasuriya (Visiting Research Associate)
- Late Prof. Tissa Jayawardana (Professor of Sinhala – University of Colombo)
- Mr. Vincent Halahakone – (Corpus Linguist)
- Mr. Dulip Herath – BSc(Col), MA(Kelaniya), MPhil(Cambridge), (Senior Research Assistant)
- Mr. Asanka Wasala – BSc(Col), PhD(Limerick), (Senior Research Assistant)
- Mr. Eranga Jayalatharachchi – BSc(Peradeniya), MSc(Col), (Research Assistant)
- Mr. Nishantha Madegoda – BSc(Col), MA(Kelaniya), MPhil(Col), (Senior Research Assistant)
- Mr. Rajathurai Premkumar – BSc(Col), (Research Assistant)
- Miss. Kumudu Gamage – BA(Kelaniya), (Linguist)
- Mrs. Ridma Ranasinge – BA(Kelaniya), (Project Assistant)
- Miss. Jeevanthi Liyanapathirana – BSc(Col), MPhil(Cambridge), (Research Assistant)
- Ms. Vindya Widanagamaachchi – BSc(Col), (Research Assistant)
- Mr. Asiri Ranasinghe – (Research Assistant)
- Mr. Prabudda Kalanameth – BA(Col), (Project Assistant)
- Mrs. Chathumini Ranasinghe – BA(Kelaniya), (Project Assistant)
- Miss. Chathuri Jayawardhane – BA(Col), (Project Assistant)
- Miss. Sajini Caldera – BA(Kelaniya), (Project Assistant)
- Miss. Sumithra Kanapathi – (Project Assistant)
- Mr. Pubudu Tharaka Viswakula – BSc(Col), (Trainee)
- Mr. Madhura Anushanga – (Trainee)
- Miss. Lakshika Nanayakkara – (Trainee)
- Mr. Mohamed Nowsad – BICT(Col), (Trainee)
- Mr. Tharindu Danushka – BSc(Col), (Trainee)
- Ms. Sudeshika Madhushani – BSc(Col), (Trainee)
- Mr. M.I. Mohomad Waseem – (Trainee)
- Mr. Ranil Ashoka – (Trainee)