
A research and innovation center advancing Somali-language AI technologies and SomaliNLP research at Jamhuriya University of Science and Technology.
Jamhuriya University of Science and Technology (JUST) established the Somali-language AI and Innovation Lab (SAIL) based on a strong belief in AI's transformative potential to enhance healthcare, agriculture, operational efficiency, and service accessibility.
The lab advances Somali-language AI technologies and innovation while laying a strong foundation for Somali Natural Language Processing (SomaliNLP) research, building on years of work that brought Somali-language technology to regional and international research venues.
An estimated population of over 22 million people speak Somali across Somalia, Djibouti, Kenya, Ethiopia, and diaspora communities. Despite this wide usage, Somali remains severely under-resourced in AI research due to limited datasets, annotated corpora, and language models.
A future where Somali thrives in the digital age, enabling equitable access to AI-powered technologies while preserving the language's richness for generations to come.
To become a leading research hub for Somali-language AI by creating high-quality datasets, developing state-of-the-art models, training the next generation of researchers, and collaborating with local and international institutions.
The recent shift toward Large Language Models (LLMs) offers unprecedented promise for low-resource languages. LLMs can learn from large raw text through self-supervised learning, making them far more efficient than traditional methods that require massive labeled datasets.

Somali now has a growing digital footprint—from news portals and blogs to social media and text corpora. This data provides essential raw material to train modern AI models and elevate Somali from extremely low-resourced to moderately resourced.

A growing cohort of native Somali researchers and engineers with NLP and AI expertise provides the linguistic intuition and cultural context necessary to build effective, Somali-centric language technologies.

SAIL is a timely response to this unique opportunity. By leveraging LLMs, existing digital data, and Somali-centric expertise, we aim to develop essential data and models that make AI benefits accessible to the Somali community.

Systematically crawl, clean, and combine large-scale Somali text corpora from diverse online and offline sources, following best practices for ethical data collection.
Manually annotate gold-standard datasets for core NLP tasks (e.g., text classification, named entity recognition, etc) to serve as evaluation benchmarks for the community.
Train, fine-tune and release the first dedicated Somali LLMs based on modern transformers (e.g. BERT, GPT, T5, etc) on the collected corpus.
Host outreach activities, workshops and seminars, etc, to foster knowledge exchange and identify priority research challenges.
To develop a suite of production-ready core tools for Somali, including but not limited to spelling error correction, orthographic normalization, text classification, machine translation, etc.
To build and deploy foundational large-scale monolingual Somali-language LLMs and support multilingual models that include it as a core language.
To create a comprehensive, open-source speech technology stack for Somali-language to underpin robust Automatic Speech Recognition (ASR) and expressive Text-to-Speech (TTS) systems.
Research support, feasibility studies, AI strategy development, and technical consultancy services for public and private institutions.
Workshops, training programs, and certification courses in artificial intelligence, machine learning, data science, and software engineering.
Structured Somali-language datasets and professional data labeling and annotation services for research and commercial projects.
We promote creative thinking and advanced technological solutions.
We maintain strong research standards and evidence-based development.
We believe in partnerships between universities, government, private sector, and international institutions.