App-Bench

Overview

App-Bench evaluates how well AI-driven tools can automatically build modern web applications. The benchmark consists of full-stack app-building tasks from economically important domains—healthcare, real estate, finance, legal services, and education.

Each task exercises core features found in real-world software: integrated AI assistants, real-time synchronization, multi-role logic, automated triggers, and robust authentication flows.

Web App Tasks

Financial Dashboard: Bloomberg terminal-style dashboard with real-time stock prices, interactive charts with date-range selection, live financial news, AI chat with web search, and a public real-time forum.
Hospital Dashboard: Multi-role patient tracking board for admins, nurses, and physicians with real-time bed status updates, presence indicators, emergency alert banners, and a shared forum chat.
Legal Assistant: AI legal assistant with RAG over an updatable document knowledge base, voice dictation, @-referencing of documents, and integrated web search.
Pharmacy System: Multi-user pharmacy platform where patients browse and order medications while pharmacists manage inventory, process orders, and communicate via private real-time messaging.
Drawing Game: Multiplayer vocabulary-drawing game with turn-taking, real-time canvas sync, public room lobbies, score tracking, and a drawing replay system.
Rental Booking: Airbnb-style rental marketplace with property browsing, search filters, booking flow with payments, and a media upload system for photos and videos.

Key Findings

Overall results showed that builders have lots of room to improve. No tool managed to perfectly solve all tasks—every approach left significant gaps. On several challenging apps, only 2–3 builders managed to implement more than 50% of required features. The best-performing builder successfully implemented roughly 77% of required features.

Web-based prompt-to-app builders achieved slightly higher average scores, reliably handling boilerplate features like authentication and standard UI scaffolding. However, they struggled with custom integrations not natively supported.

CLI code agents showed more variance in performance, both across tools and within individual tools on different tasks. Web builders had more consistent scoring—reliably building certain features while consistently failing on others—with fewer failed deployments overall.

Common Failures

Analysis of 150+ generation attempts revealed several failure modes: missing or misimplemented features, multi-role flow breakdowns, runtime errors from outdated API usage, and UI/UX glitches affecting usability.

Tasks combining complex multi-step logic with rich UI updates and multi-user roles tripped up the majority of tools. For detailed breakdowns of error categories, task-by-task results, and methodology, read the full analysis.

Leaderboard

Overview

Web App Tasks

Key Findings

Common Failures