Title: Somali Corpus – towards 5 million tagged words: challenges and opportunities
Date: 24 February 2017,
Venue: Addis Ababa University, OCR 119
Hosted by College of Humanities, Language Studies, Communication and Journalism Department of Linguistics.
Developing IT resources for language mainly focuses on well-described languages with long-standing written traditions and with a large number of speakers. One of the main challenges for the languages with more recent written traditions is the lack of enough data for successful statistical approaches. This descriptive talk aims to present the state of the art of the construction of the Redsea Cultural Foundation’s Somali Corpus (RCF-SC), in collaboration with Oriental University of Naples, and the development of a series of computer programs with which to analyze the corpus data for various purposes. The core of RCF-SC is unique in Somali speaking countries and wants to be, for Somali, a resource equivalent in quality to the British National Corpus. The first edition of the corpus, containing almost 5 million words tagged and grammatically annotated, is online at www.somalicorpus.com.