Task 2: CRAI-Bench - Cultural Accuracy Evaluation for Arabic Text-to-Image Generation
CRAI-Bench evaluates whether AI-generated images faithfully represent Qatari and Arab cultural scenes. The task introduces the Cultural Representation Accuracy Index (CRAI), a five-dimensional scoring framework for measuring cultural accuracy in text-to-image generation.
Task Description
Objective: Given a reference image of a Qatari cultural scene, a culturally grounded image caption, and an AI-generated image produced from that caption, participants produce CRAI scores in the range [0, 1] across five dimensions.
Dataset: The dataset consists of reference image, caption, and generated-image triples grounded in Qatari culture across five categories: people and traditional attire; objects and natural elements; architecture and built environment; cultural activities and practices; and historical and cultural context.
Submission format: Submit a TSV file with one scored instance per line:
idCRAI_CEA— Cultural Element Accuracy, 0–1CRAI_CC— Contextual Coherence, 0–1CRAI_CS— Cultural Specificity, 0–1CRAI_CI— Cultural Integrity, 0–1CRAI_HP— Hallucination Penalty, 0–1CRAI_composite— weighted composite score, 0–1
Evaluation: Systems are ranked by Spearman correlation and mean absolute error against human-annotated gold CRAI scores, reported both overall and per dimension.
CRAI = 0.30 CEA + 0.20 CC + 0.20 CS + 0.20 CI - 0.10 HP
Score bands: 0.85 or higher is highly accurate; 0.70–0.84 is mostly accurate; 0.50–0.69 is moderate; 0.30–0.49 is weak; below 0.30 is poor or misleading.