SAIL Logo
HomeAboutProjectsNews & EventsNLP ResourcesContact
SAIL Logo

Somali-language AI and Innovation Lab — Pioneering the digital frontier for Somali language through cutting-edge AI research and innovation.

Jamhuriya University of Science and Technology
Mogadishu, Somalia
sail@just.edu.so
+252 - 61- 2223999

About

  • About SAIL
  • Research Areas
  • Why SAIL?

Quick Links

  • Featured Projects
  • News & Insights
  • Resources
  • Contact

2026 SAIL - Somali-language AI and Innovation Lab. All rights reserved.

NLPcompleted

MasakhaNEWS: News Topic Classification for African Languages

Read Full Article
March 7, 2026
SAIL Team

Abstract

Despite representing roughly a fifth of the world population, African languages are underrepresented in NLP research, in part due to a lack of datasets. While there are individual language-specific datasets for several tasks, only a handful of tasks (e.g. named entity recognition and machine translation) have datasets covering geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS—the largest dataset for news topic classification covering 16 languages widely spoken in Africa. We provide and evaluate a set of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning, such as: cross-lingual parameter-efficient fine-tuning (MAD-X), pattern exploiting training (PET), prompting language models (ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and the co:here embedding API). Our evaluation in a few-shot setting, shows that with as little as 10 examples per label, we achieve more than 90% (i.e. 86.0 F1 points) of the performance of fully supervised training (92.6 F1 points) leveraging the PET approach. Our work shows that existing supervised approaches work well for all African languages and that language models with only a few supervised samples can reach competitive performance, both findings which demonstrate the applicability of existing NLP techniques for African languages.

Related Projects

Explore more projects in this category

Research Paper
NLP

Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform

Somali NLP Engine
AI/NLP

Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Large Language Models

OCR System
NLP

CIRAL: A Test Collection for CLIR Evaluation in African Languages