Inception, a company specializing in AI-native products and a subsidiary of G42, in collaboration with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), has introduced the AraGen Leaderboard, a pioneering framework designed to reshape the way Arabic Large Language Models (LLMs) are evaluated. This innovative framework is built around the 3C3H metric, a newly developed evaluation system that offers a transparent, holistic, and robust approach to assessing LLMs. Unlike traditional evaluation systems, the AraGen Leaderboard balances factual accuracy with usability, setting new benchmarks for Arabic Natural Language Processing (NLP) and addressing the unique challenges posed by Arabic language models.
With over 400 million Arabic speakers worldwide, the launch of the AraGen Leaderboard is a critical development in the field of AI. The framework is specifically designed to meet the needs of Arabic speakers by focusing on the linguistic and cultural intricacies of the Arabic language. Prior to AraGen, existing benchmarks for evaluating AI models were primarily tailored to English and other widely spoken languages, leaving a significant gap in the way Arabic LLMs were assessed. AraGen aims to bridge this gap by providing a carefully constructed evaluation dataset that considers the subtleties of Arabic, including its grammar, syntax, and regional variations. The platform addresses a number of issues in AI evaluation, such as benchmark leakage, reproducibility challenges, and the absence of a comprehensive metric that evaluates both core knowledge and practical utility.
A key innovation introduced by the AraGen Leaderboard is the incorporation of generative tasks in its evaluation process. This is a significant departure from traditional leaderboards, which have typically relied on static benchmarks based on likelihood accuracy. These traditional methods do not adequately reflect the real-world performance of models. By introducing generative tasks, AraGen adds a new layer to the evaluation process, allowing for a more dynamic and realistic assessment of how Arabic LLMs perform in real-world applications. This shift represents a groundbreaking advancement in the AI field, encouraging the development of models that are more capable of handling the complex, generative nature of language use.
The AraGen Leaderboard evaluates models across six critical dimensions: correctness, completeness, conciseness, helpfulness, honesty, and harmlessness. These dimensions are carefully chosen to ensure that models are not only accurate but also useful and reliable in real-world applications. The platform includes a wide range of tasks, with 279 questions spanning areas such as Arabic grammar, general Q&A, reasoning, and safety. This comprehensive approach ensures that the leaderboard evaluates a model’s ability to handle the diverse challenges faced by Arabic-speaking users. The leaderboard is also updated quarterly, introducing new challenges and tasks to keep the evaluation process relevant and to encourage continuous improvement in model performance.
The dynamic nature of the AraGen Leaderboard is one of its defining features. Unlike traditional benchmarks that often remain static, AraGen evolves with the field of AI. The quarterly updates and the invitation for public submissions make it much harder for organizations to “game” the system, ensuring that the leaderboard remains a true reflection of a model’s capabilities. This dynamic approach also helps maintain the relevance of the leaderboard as the AI landscape continues to evolve.
Professor Preslav Nakov, Department Chair of Natural Language Processing at MBZUAI, emphasized the significance of the AraGen Leaderboard, saying, “AraGen is a major step towards open, collaborative, and reproducible evaluation of large language models for Arabic, with a focus on their text generation capabilities. This contrasts with popular leaderboards, which rely primarily on multiple-choice questions. Moreover, AraGen is a dynamic board with new questions every three months, which makes it much harder to game compared to existing leaderboards.” This highlights the importance of reproducibility and collaborative evaluation, which are central to the AraGen framework’s design.
Ali El Filali, a Machine Learning Engineer at Inception and the lead author of the project, explained the motivation behind AraGen’s development: “Our aim was to create a benchmark that introduces generative task evaluation with a strong emphasis on transparency, reproducibility, and rigorous performance measurement. By evaluating models across multiple dimensions, we assess both factuality and usability, providing actionable insights for diverse NLP tasks. AraGen empowers the Arabic AI community to develop safe, high-performing models that meet real-world needs while prioritizing equity and inclusion for underrepresented languages.”
One of the key benefits of the AraGen Leaderboard is the detailed performance insights it provides. This enables organizations to make informed decisions when selecting AI models that align with their specific requirements. For organizations that do not have the resources for extensive internal testing, AraGen offers a cost-effective alternative by providing a clear, relevant metric for evaluating LLMs. Additionally, the transparent and reproducible nature of the platform builds trust within the Arabic AI community, ensuring that the models developed are both reliable and culturally sensitive.
The AraGen Leaderboard sets a new precedent for AI benchmarks, not just for Arabic, but for underrepresented languages in general. By focusing on fairness, inclusivity, and the unique characteristics of Arabic, AraGen offers a roadmap for how AI benchmarks can evolve to prioritize equity and inclusivity. The leaderboard’s dynamic approach and comprehensive evaluation criteria are a model for how AI evaluation can better serve the diverse needs of the global AI community. This initiative positions the Arabic-speaking world at the forefront of the AI revolution, ensuring that no language or culture is left behind as AI continues to evolve.