App-Bench | AI Web App Builder Benchmark

Overview

Using AI coding agents to generate web applications is one of the most common use cases for code generation models. App-Bench is a benchmark designed to evaluate how well AI-driven tools can automatically build modern web applications. The benchmark consists of 6 full-stack app-building tasks inspired by workflows in economically important domains—healthcare, real estate, finance, legal services, and education.

Each task exercises core features found in real-world software projects:

Integrated AI assistants with RAG and web search
Real-time synchronization across multiple clients
Database persistence and CRUD operations
Third-party integrations like Stripe payments and Resend emails
User authentication and role-based access control

All tasks share a consistent structure and evaluation criteria, enabling fair comparison across tools. The benchmark is designed to test real-world complexity rather than isolated coding exercises.

Methodology

Tools Evaluated

App-Bench tests two broad categories of AI development tools:

Generative app builder platforms (aka "web builders"): No-code or low-code web platforms where the user provides a description and the platform assembles the app from high-level components.

CLI/IDE tools (aka "coding assistants"): AI systems accessed via command line or integrated development environment that generate code given a prompt or instructions.

In total, 10 tools were evaluated—five web builders and five coding assistants. Each tool was given identical task descriptions with only minor adjustments for platform-specific conventions (details below). No task-specific training or prior examples were provided.

All tools were tested using default settings on their respective Pro plans. For CLI/IDE tools, the underlying models used were: Claude Code (Opus 4.5), Codex (gpt-5.1-codex-max), Cursor (composer 1), Gemini CLI (Gemini 2.5 Pro), and Google AI Studio (Gemini 3 Pro Preview).

Prompts

Each task prompt consisted of a short description of the app and a statement that API keys would be provided if needed. Prompts also included an instruction to make all technical decisions autonomously and a numbered list of functional requirements—each requirement corresponding to a point on the rubric.

An excerpt from the Financial Dashboard prompt:

"Create a fully functional Bloomberg terminal style dashboard... If API keys are necessary, please request them from me and I will provide. You are not allowed to ask the user any follow up questions. Select all technical, architectural, and service-level details yourself..."

Feature requirements were explicit and objectively testable. Here are 5 of the 23 requirements from the Financial Dashboard task:

"Application provides full user authentication, including registration, login, and logout flows."
"Application displays real-time stock prices with continuous updates and no manual refresh required."
"Application allows users to select a date range directly on the stock chart."
"Application automatically sends the chart-selected date range into the AI chat as contextual input."
"Application updates public chat messages in real time for all connected users with no manual page reload."

A full example prompt for the Financial Dashboard can be found here.

Standardized Project Setup

For coding assistants, a pre-initialized Next.js project template was provided with basic file structure. Supabase API URL and API keys were provided. Tools were run in isolated Docker containers with Node.js to ensure code could be installed and executed in a clean environment.

For web builders, prompt instructions were identical to those given to coding assistants but without a starter template. Each builder tool deployed apps on its own cloud infrastructure, or Supabase API URL and API keys were provided if requested.

For both tool types, API keys, authentication credentials, and external service credentials (Stripe, Resend, etc.) were provided upon request—this was the only human intervention during generation.

Generation Procedure

Each tool was given three attempts per task, with the best-performing run used for final scoring. Each run was a one-shot generation: the tool received the task description and produced an app with no interactive debugging, step-by-step assistance, or fix probes.

After generation, we provided environment variables and set up any required external services, then deployed. No task-specific hints or coding were given.

All tools were tested over a short timeframe under equivalent conditions, using default settings with Pro plans as of December 2025.

Scoring Rubric

Each app task comes with a detailed rubric covering functional features. Each app tested had 20–40 feature requirements, each corresponding to a single point in the rubric. Examples include:

"Application filters all news items to include only finance- or stock-related articles."
"Application displays a persistent, scrollable conversation history for the AI assistant chat."
"Application allows authenticated users to post public messages that include sender identification and a timestamp."
"Application persists all user accounts and information with database-backed durable storage."

A full example rubric for the Financial Dashboard can be found here. The complete dataset of all prompts and rubrics is available on Hugging Face.

Each rubric item was scored as 1 (pass) or 0 (fail) based on observed behavior—no partial credit was awarded. Points were summed across all 6 apps to produce a total score. Per the best-of-3 generation scheme, if a tool scored 25/40 on attempt 1, 0/40 on attempt 2 (failed deployment), and 30/40 on attempt 3, the 30 points from the best run contribute to its total. The final percentile score represents total points earned divided by total possible points across all tasks.

Every generated app was manually tested and inspected against the rubric. The evaluation focused purely on functionality rather than UI aesthetics, ensuring a purely objective assessment. UI considerations only factored into scoring when they directly prevented a feature from functioning correctly.

Grading Process

Two experienced full-stack developers independently graded each trajectory against the rubric. Graders tested each deployed application by exercising all features specified in the rubric, verifying database persistence, and testing multi-user flows where applicable.

Any criteria with contradicting grades between the two graders were flagged and re-evaluated jointly to reach consensus. This dual-grading approach, combined with the binary pass/fail scoring per rubric item, helped minimize subjective bias and ensured consistent evaluation across all tools and tasks.

Below is an example evaluation of Claude Code's Financial Dashboard generation:

Authentication

4/4 passed

User registration functionality
User login functionality
User logout functionality
Database-backed user persistence

1 / 7

Results

Overall Performance

The best-performing builder successfully implemented roughly 77% of required features. No tool managed to perfectly solve all tasks—every approach left significant gaps.

Mid-tier tools scored in the 50–70% range. Several trailed below 40%. One coding assistant produced no usable output across all tasks, ending with 0 points.

Score Interpretation

Scores reflect feature completion degree. A score of 77% indicates roughly three-quarters of required functionality delivered. A score around 50% indicates only about half of features correctly implemented.

Each tool demonstrated distinct strengths and weaknesses. Web builders excelled at user authentication, basic CRUD functionality, and standard UI scaffolding. They struggled with complex integrations like RAG retrieval, real-time synchronization, third-party services, and multi-user roles. Coding assistants showed high variance—some produced near-complete apps while others failed to deploy entirely.

Task Difficulty

Some tasks proved near-impossible for multiple tools. The hospital management dashboard—involving multi-role accounts, real-time bed status updates, presence indicators, emergency alert banners, and a shared forum chat—saw several tools fail all 3 passes while others produced no functionality beyond basic user authentication. Average score was 27.6%.

The pharmacy system was similarly challenging, requiring patient-pharmacist role separation, inventory management, order status tracking, private real-time messaging between pharmacists, and automated email notifications. Most tools failed to implement the core multi-role workflow.

The simplest task was the financial dashboard, with an average score of 65.4%. Tasks combining complex multi-step logic with rich UI updates and multi-user roles or real-time interactions tripped up the majority.

Web Builders vs. Coding Assistants

Noticeable performance patterns emerged between web builders and coding assistants.

Web builders achieved slightly higher average scores. They reliably handled boilerplate features—user authentication, database hookups, standard UI scaffolding—and produced consistent results. Their advantage lies in built-in templates and components. However, when required features weren't natively supported (e.g., custom AI chatbot, uncommon integrations), these tools either provided superficial placeholders or failed entirely.

Coding assistants showed more variance in performance, both across tools and within individual tools on different tasks. Web builders had more consistent scoring—reliably building certain features while consistently failing on others—with fewer failed deployments overall. Coding assistants have flexibility to write arbitrary code, and occasionally wrote non-trivial algorithms or integration code, but this freedom comes at reliability cost.

Error Analysis

Analysis of failures across 180 generation attempts revealed several common failure modes:

Missing or Misimplemented Features

The most prevalent issue was leaving out required functionality or implementing it incorrectly. Many outputs had placeholder elements or stubs where key features should be. Several apps lacked search and filter capability despite explicit requirements, or had non-working search bars. In the rental booking app, multiple tools failed to include calendar date-pickers, either omitting date selection entirely or using plain text fields.

One builder consistently failed to implement database-backed storage of user accounts despite it being listed in the requirements.

Multi-Role Flow Breakdowns

Applications requiring distinct user roles exposed significant logic gaps. In the hospital dashboard scenario, several apps provided no role selection during signup or role-based UI once signed in. For tools whose signup pages did not allow choosing roles, all users were identical and no cross-role features or admin functionality could be tested.

Other tools included role options at signup but failed to route users to correct homepages after login. One app had two different home pages for different roles but would endlessly switch between them every half second no matter what role you signed in as, rendering the app unusable. Some apps created separate front-end pages but did not properly set user roles in the database, so every login defaulted to a null/guest role.

Runtime Errors and API Misuse

Another frequent failure mode was code throwing exceptions at runtime. A common theme: outdated usage of Next.js and Supabase libraries. Multiple coding assistants attempted incorrect Supabase Next.js helper usage, producing TypeErrors for nonexistent functions.

Cookie handling in newer Next.js versions also tripped up several models, with logs showing TypeErrors related to cookieStore methods. These integration errors prevented many generated apps from running without manual fixes.

UI/UX Glitches

Even running apps often had interface bugs or design failures. Some were minor—having to zoom out to view a stock price chart, or canvas drawing strokes appearing offset from the cursor.

Others significantly affected usability: one interface was almost completely covered by another element and thus unusable; dropdown menus for profile management did not appear; buttons, search inputs, and chat input areas were mistakenly disabled. Almost none of the apps had mobile-friendly layouts despite this being a common implicit expectation.

Conclusion

App-Bench reveals that current AI-powered development tools still have significant room to grow. While the best performers delivered roughly three-quarters of required functionality, no tool achieved complete coverage across all tasks. Complex features—multi-role workflows, real-time synchronization, and third-party integrations—remain challenging for even the top-ranked builders.

To learn more, visit AfterQuery.