Abstract
Despite representing roughly a fifth of the
world population, African languages are underrepresented in NLP research, in part due
to a lack of datasets. While there are individual language-specific datasets for several tasks, only a handful of tasks (e.g.
named entity recognition and machine translation) have datasets covering geographical and
typologically-diverse African languages. In
this paper, we develop MasakhaNEWS—the
largest dataset for news topic classification covering 16 languages widely spoken in Africa.
We provide and evaluate a set of baseline models by training classical machine learning models and fine-tuning several language models.
Furthermore, we explore several alternatives to full fine-tuning of language models that are
better suited for zero-shot and few-shot learning, such as: cross-lingual parameter-efficient
fine-tuning (MAD-X), pattern exploiting training (PET), prompting language models (ChatGPT), and prompt-free sentence transformer
fine-tuning (SetFit and the co:here embedding
API). Our evaluation in a few-shot setting,
shows that with as little as 10 examples per
label, we achieve more than 90% (i.e. 86.0 F1
points) of the performance of fully supervised
training (92.6 F1 points) leveraging the PET
approach. Our work shows that existing supervised approaches work well for all African
languages and that language models with only a
few supervised samples can reach competitive
performance, both findings which demonstrate
the applicability of existing NLP techniques for
African languages.