ImageEval 2026

Cultural Grounding in Arabic Multimodal Generation and Understanding

Task 2: CRAI-Bench - Cultural Accuracy Evaluation for Arabic Text-to-Image Generation

CRAI-Bench evaluates whether AI-generated images faithfully represent Qatari and Arab cultural scenes. The task introduces the Cultural Representation Accuracy Index (CRAI), a five-dimensional scoring framework for measuring cultural accuracy in text-to-image generation.

Task Description

Objective: Given a reference image of a Qatari cultural scene, a culturally grounded image caption, and an AI-generated image produced from that caption, participants produce CRAI scores in the range [0, 1] across five dimensions.

Dataset: The dataset consists of reference image, caption, and generated-image triples grounded in Qatari culture across five categories: people and traditional attire; objects and natural elements; architecture and built environment; cultural activities and practices; and historical and cultural context.

Submission format: Submit a TSV file with one scored instance per line:

  • id
  • CRAI_CEA — Cultural Element Accuracy, 0–1
  • CRAI_CC — Contextual Coherence, 0–1
  • CRAI_CS — Cultural Specificity, 0–1
  • CRAI_CI — Cultural Integrity, 0–1
  • CRAI_HP — Hallucination Penalty, 0–1
  • CRAI_composite — weighted composite score, 0–1

Evaluation: Systems are ranked by Spearman correlation and mean absolute error against human-annotated gold CRAI scores, reported both overall and per dimension.

CRAI = 0.30 CEA + 0.20 CC + 0.20 CS + 0.20 CI - 0.10 HP

Score bands: 0.85 or higher is highly accurate; 0.70–0.84 is mostly accurate; 0.50–0.69 is moderate; 0.30–0.49 is weak; below 0.30 is poor or misleading.