本セミナーは、東京科学大学笹原研究室が主催するオンライン・セミナーシリーズです。計算社会科学およびその関連分野の国内外の研究者を招き、最新の研究成果や方法論について議論します。ビッグデータ分析、オンライン調査・実験、計算モデル、社会シミュレーション、AIと社会の関わりなど、幅広いトピックを取り上げます。
This seminar is an online seminar series organized by Sasahara Lab at Science Tokyo. It invites researchers from Japan and abroad in Computational Social Science (CSS) and related disciplines to discuss the latest research findings and methodologies. Topics covered include big data analysis, online surveys and experiments, computational modeling , social simulation, and the relationship between AI and society, among others.
講演者: Dr. Manuel Cebrian (Spanish National Research Council)
日時:2025/5/12(月)17:00-18:00 JST
形式 :オンライン (Zoom)
題目:General Scales Unlock AI Evaluation with Explanatory and Predictive Power
概要:
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE/)
プロフィール:
Manuel Cebrian氏は、スペイン国立研究評議会の上級研究員で、計算社会科学および人工知能の研究者。これまでにMITメディアラボやマックスプランク研究所などで活躍し、DARPA Network Challengeでの優勝や機械行動学(Machine Behavior)の提唱など、数多くの実績を有する。
本セミナーに関するお問い合わせはこちら