About SAIL

About SAIL

A research and innovation center advancing Somali-language AI technologies and SomaliNLP research at Jamhuriya University of Science and Technology.

About Us

Jamhuriya University of Science and Technology (JUST) established the Somali-language AI and Innovation Lab (SAIL) based on a strong belief in AI's transformative potential to enhance healthcare, agriculture, operational efficiency, and service accessibility.

The lab advances Somali-language AI technologies and innovation while laying a strong foundation for Somali Natural Language Processing (SomaliNLP) research, building on years of work that brought Somali-language technology to regional and international research venues.

An estimated population of over 22 million people speak Somali across Somalia, Djibouti, Kenya, Ethiopia, and diaspora communities. Despite this wide usage, Somali remains severely under-resourced in AI research due to limited datasets, annotated corpora, and language models.

Vision

A future where Somali thrives in the digital age, enabling equitable access to AI-powered technologies while preserving the language's richness for generations to come.

Mission

To become a leading research hub for Somali-language AI by creating high-quality datasets, developing state-of-the-art models, training the next generation of researchers, and collaborating with local and international institutions.

Strategic Foundation

The establishment of SAIL is a timely response to a unique convergence of technological, data, and human capital factors.
1

The LLM Paradigm Shift

The recent shift toward Large Language Models (LLMs) offers unprecedented promise for low-resource languages. LLMs can learn from large raw text through self-supervised learning, making them far more efficient than traditional methods that require massive labeled datasets.

The LLM Paradigm Shift
2

Expanding Digital Footprint

Somali now has a growing digital footprint—from news portals and blogs to social media and text corpora. This data provides essential raw material to train modern AI models and elevate Somali from extremely low-resourced to moderately resourced.

Expanding Digital Footprint
3

Growing Native Expertise

A growing cohort of native Somali researchers and engineers with NLP and AI expertise provides the linguistic intuition and cultural context necessary to build effective, Somali-centric language technologies.

Growing Native Expertise
4

A Timely Strategic Response

SAIL is a timely response to this unique opportunity. By leveraging LLMs, existing digital data, and Somali-centric expertise, we aim to develop essential data and models that make AI benefits accessible to the Somali community.

A Timely Strategic Response

Immediate Objectives

Building foundational capacity through data collection, benchmarks, models, and community collaboration.

Data Collection

Systematically crawl, clean, and combine large-scale Somali text corpora from diverse online and offline sources, following best practices for ethical data collection.

Benchmark Creation

Manually annotate gold-standard datasets for core NLP tasks (e.g., text classification, named entity recognition, etc) to serve as evaluation benchmarks for the community.

Foundation Models

Train, fine-tune and release the first dedicated Somali LLMs based on modern transformers (e.g. BERT, GPT, T5, etc) on the collected corpus.

Community Building

Host outreach activities, workshops and seminars, etc, to foster knowledge exchange and identify priority research challenges.

Long-Term Goals

Strategic vision for production-ready tools, large-scale models, and comprehensive speech technology.

Production Somali Language Tools

To develop a suite of production-ready core tools for Somali, including but not limited to spelling error correction, orthographic normalization, text classification, machine translation, etc.

Large-Scale Somali LLMs

To build and deploy foundational large-scale monolingual Somali-language LLMs and support multilingual models that include it as a core language.

Somali Speech Technology Stack

To create a comprehensive, open-source speech technology stack for Somali-language to underpin robust Automatic Speech Recognition (ASR) and expressive Text-to-Speech (TTS) systems.

Our Services

Comprehensive AI solutions, research support, and capacity building for Somali-language technology.

Research and Consultancy

Research support, feasibility studies, AI strategy development, and technical consultancy services for public and private institutions.

Training and Capacity Building

Workshops, training programs, and certification courses in artificial intelligence, machine learning, data science, and software engineering.

Dataset Development

Structured Somali-language datasets and professional data labeling and annotation services for research and commercial projects.

Core Values

Principles guiding our commitment to excellence, innovation, and impact.

Innovation

We promote creative thinking and advanced technological solutions.

Academic Excellence

We maintain strong research standards and evidence-based development.

Collaboration

We believe in partnerships between universities, government, private sector, and international institutions.