Pashto making inroads into modern computational linguistics

Languages around the world were being integrated into technology to serve communities better and Pashto was fast heading towards achieving that

Prof Omar Usman Marwat

Languages around the world were being integrated into technology to serve communities better and Pashto was fast heading towards achieving that integration, said experts.

They said that it would open up new avenues and possibilities for scholars and linguists worldwide and also the work would be beneficial to expand work on analysis of Pashto and its dialects, develop new software tools and pave way for standardised scripts.

Most recent achievements in this regard were the activities carried out under the auspices of an MoU between Pashto Academy, University of Peshawar, and the FAST-NUCES Centre for Computational Linguistics (CoCL), inked earlier this year.

Prof Omar Usman Marwat, an expert on computational linguistics, told this scribe that the collaboration had already been resulted in preparation of an online dictionary and thesaurus of Pashto; Pashto part of speech taggers; and a Pashto Treebank for grammar checking. He said that the work would eventually contribute towards the field of Pashto dialectology, machine translation, lexicology, morphology and phonology and would be available online under open licences.

Expert says 20 per cent vocabulary of online dictionary already uploaded on Pashto Academy’s website

Prof Omar said that the ‘Pashto to Pashto’ dictionary contained a total of 98,000 words and terminologies, prepared painstakingly over many decades by the Pashto Academy, while efforts were underway to make that wealth of information available online within the current year.

He said that about 20 per cent of the vocabulary already uploaded online could be searched on the Pashto Academy’s website.

The expert stated that the field visits for preparation of an online speech corpus had also been carried out as hundreds of speech samples of Afridi, Shinwari, Malagori, Khattak and Yousafzai, Peshawar, Charsadda, Swabi, Mardan, Swat, Shangla, Wazir, Mehsud, Marwat, Betani, Kakar, Banusi and Dawar dialects had also been acquired and being transcribed according to the International Phonetic Association (IPA) standard.

“Dialect analysis by linguists and their findings had already been presented at a dialect conference in Baragali earlier this month. The speech corpus will be made online and will pave way for Pashto dialect analysis and development of speech recognition software,” said Prof Omar.

Experts said that provision of those basic datasets was important to shift towards Natural Language Processing (NLP). They said that NLP was an inter-disciplinary field, which made use of artificial intelligence techniques to help computers read, decipher and understand natural languages of humans in a manner that it was valuable and beneficial to the community.

Prof Omar said that provision of basic tools and data sets would elevate status of Pashto from resource limited to a resourceful language because those developments could then open opportunities for the research community to create software that could convert text to speech, speech to text, images to text, and electro-medical signals to various actions.

Published in Dawn, July 24th, 2019