Somali AI & Innovation Lab (SAIL)

Abstract

Despite representing roughly a fifth of the world population, African languages are underrepresented in NLP research, in part due to a lack of datasets. While there are individual language-specific datasets for several tasks, only a handful of tasks (e.g. named entity recognition and machine translation) have datasets covering geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS—the largest dataset for news topic classification covering 16 languages widely spoken in Africa. We provide and evaluate a set of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning, such as: cross-lingual parameter-efficient fine-tuning (MAD-X), pattern exploiting training (PET), prompting language models (ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and the co:here embedding API). Our evaluation in a few-shot setting, shows that with as little as 10 examples per label, we achieve more than 90% (i.e. 86.0 F1 points) of the performance of fully supervised training (92.6 F1 points) leveraging the PET approach. Our work shows that existing supervised approaches work well for all African languages and that language models with only a few supervised samples can reach competitive performance, both findings which demonstrate the applicability of existing NLP techniques for African languages.

Abstract

MasakhaNEWS: News Topic Classification for African Languages

Abstract

Related Projects

Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform

Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Large Language Models

CIRAL: A Test Collection for CLIR Evaluation in African Languages

MasakhaNEWS: News Topic Classification for African Languages

Abstract

Related Projects

Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform

Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Large Language Models

CIRAL: A Test Collection for CLIR Evaluation in African Languages