gitmyhub

ai-test-case

Blade ★ 133 updated 3mo ago

AI 编程的测试用例

What This Repo Does

This is a collection of coding challenges designed to test how well AI models can write code. Instead of simple puzzles, each challenge asks an AI to build a real, working application—everything from a polished website redesign to a physics simulator to a command-line tool. The goal is to see how well modern AI can handle realistic software projects that a human developer might actually work on.

How It Works

The repo contains seven different test cases, each with a starter file or example and a detailed prompt (instructions written in plain English). When you feed one of these prompts to an AI coding assistant like Claude or ChatGPT, the AI attempts to build the entire project from scratch. A human then judges whether the AI's code actually works, looks good, and does what was asked. It's a way to benchmark AI programming ability against real-world tasks rather than theoretical problems.

The challenges span different types of work: frontend design (redesigning a landing page), 3D graphics (building a gravity simulator), game development (making Angry Birds), full-stack migration (converting a PHP app to JavaScript), physics simulation (Maxwell's Demon), native desktop apps (a presentation tool for macOS), and command-line tools (a word cloud generator). Some challenges explicitly forbid build tools and package managers, forcing the AI to work with just HTML, CSS, and JavaScript—constraints that make the task harder and more interesting.

Who Uses This and Why

This repo is useful for anyone trying to compare AI coding models objectively. Researchers and tech companies use test suites like this to measure progress. If you're choosing between different AI assistants for coding tasks, running them through these challenges gives you concrete proof of which one performs best. Content creators also use these—the repo credits Alejandro AO, who has made YouTube videos comparing AI models on exactly these kinds of tests.

The test cases themselves are practical: they're inspired by real evaluation frameworks and real coding tutorials. You could theoretically use these prompts to have an AI build actual applications you might want, though the intent is measurement rather than production use.