Eval Run00 / 15
Idle
TestOpusGPTGemini
Factual recallknowledge
Math reasoningmath
Code completioncode
Safety refusalsafety
Long contextretrieval
1Opus 4.7
0%
2GPT-5.5
0%
3Gemini 3 Pro
0%