SAIL Page Hero

SomaliNLP Resources

A collection of high-quality datasets, pre-trained models, and specialized linguistic tools for the Somali language.

Datasets

Directly created by our team

Somali LLM Fine-tuning Dataset

Fine-tuningLLMTask-specific

Curated dataset for fine-tuning Somali language models on specific tasks.

Somali LLM Pretraining Dataset

PretrainingLLMLanguage Modeling

Large-scale pretraining dataset for Somali language models.

Somali Lemmatization Corpus

LemmatizationCorpusMorphology

A comprehensive corpus for Somali lemmatization developed through crowdsourcing efforts.

Contributed Resources

African languages collaboration

Somali Question Answering Dataset (Multimodal)

Question AnsweringMultimodalQA

Multimodal question answering dataset for Somali language understanding.

Somali Machine Translation Dataset

Machine TranslationParallel DataMasakhane

Parallel corpus for machine translation involving Somali and other African languages.

Somali News Classification Dataset

News ClassificationMasakhaneText Classification

News articles dataset for classification tasks, part of the Masakhane African languages initiative.

Models

Pre-trained and fine-tuned models for Somali language understanding and generation.

SomBERTb

BERTTransformersComing Soon

Enhanced BERT model for Somali language tasks (Coming soon).

SomT5

T5Generative AISeq2Seq

T5-based model fine-tuned for Somali language generation and understanding tasks.

SomBERTa

BERTTransformersFake News Detection

BERT-based model specifically trained for Somali language understanding and fake news detection.

Other Tools

Specialized tools and databases for Somali language processing.

Somali Lexical Database

LexiconDatabaseLinguistics

Comprehensive lexical database for Somali language resources and linguistic analysis.