ishidalab/capcode
Viewer • Updated • 756 • 26
None defined yet.
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?