AI projects stall as testing lags behind deployment

Thu, 16th Apr 2026

Applause has published research showing that more than half of organisations have released AI-powered applications and features, but many AI projects still fail to reach full production.

The findings are based on surveys of more than 1,000 developers and quality assurance professionals, alongside more than 4,000 consumers. They point to a widening gap between the pace of AI deployment and testing teams' ability to assess the results.

Among organisations surveyed, 55% said they had shipped AI-powered applications or features. Yet 52% said fewer than half of their AI projects progress from proof of concept to full production, suggesting many initiatives stall before wider rollout.

Consumer responses indicate that quality problems remain common. Some 40% of users said they had experienced hallucinations, up from 32% a year earlier, while 46% said AI had misunderstood their prompts and 41% said responses lacked enough detail.

The pattern comes as businesses push AI into more customer-facing roles. Chatbots and customer service tools were among the most common uses, though scaling those systems remains difficult.

Multimodal AI is adding to the pressure on testing teams. The report found that 84% of generative AI users see the ability to process and generate text, images, audio and video as critical, increasing the range of outputs that must be checked.

Hybrid Testing

Many organisations are combining automated methods with human review rather than relying on a single approach. Human input remains the most common way to validate AI performance, with 61% of organisations using people in the evaluation process.

At the same time, 33% said they use LLM-as-judge methods, in which multiple models assess AI outputs in parallel. The figures suggest businesses are experimenting with newer forms of automated review while still relying heavily on human assessment.

The same mix appears in broader testing practices. The research found that 54% of organisations use human-generated data for fine-tuning, while 29% use synthetic data. It also found that 39% conduct human-led red teaming, 23% use automated red teaming, 30% deploy AI-first testing agents and 31% use human-in-the-loop monitoring.

Despite the spread of these methods, respondents indicated that technical tests alone do not determine whether a system is ready for release. Nearly half, or 46%, said human sentiment and usability are the main factors in deciding whether an AI feature is ready for production.

Testing Strain

The findings highlight a broader challenge for software quality teams as AI systems behave less predictably than traditional applications. Conventional quality assurance techniques were designed for deterministic software, where the same input reliably produces the same output.

By contrast, AI systems can generate varied responses, making it harder to define pass-and-fail conditions at scale. That creates extra work for organisations trying to assess not just technical correctness, but also relevance, clarity and user trust.

Chris Munroe, Vice President of AI Programs at Applause, said testing now extends beyond basic accuracy checks. "Testing AI isn't just about accuracy - it's about evaluating complex, multimodal outputs at scale," he said.

Munroe said newer model-based evaluation methods still require human supervision. "LLM-as-judge systems are becoming an important part of that process, but they can't operate in isolation. Without human oversight, you risk reinforcing the same blind spots you're trying to detect. In addition to human-led evals and fine-tuning, structured red teaming by both domain experts and generalists is essential," he said.

The report suggests businesses are under pressure to release AI tools even as testing processes continue to evolve. That tension appears in both user sentiment and internal project outcomes, with organisations reporting strong productivity gains alongside persistent quality concerns.

According to the study, 40% of users said AI tools boost productivity by more than 75%, even as many also reported recurring errors and weak responses. The data points to a trade-off for companies trying to deploy AI quickly without exposing customers to unreliable outputs.

Chris Sheehan, Executive Vice President of High Tech and AI at Applause, said organisations are moving faster than their testing methods.

"AI development isn't slowing down, but quality is falling behind," he said. "Teams are pushing AI into production before they've figured out how to properly test it. That's why we're seeing more failures and more risk reaching users. AI adds speed and scale, but human evaluation is what earns trust - you need both. The companies getting it right combine AI with domain expertise to evaluate and fine-tune their systems, ensuring outputs are more relevant, accurate and inclusive."

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google