Why Do AI Evaluation Startups Keep Failing?

AI Summary

AI evaluation startups fail because large research labs are reluctant to hand over core evaluation rights to third parties, combined with the slow speed of services and the ability of clients to build their own evaluation systems.

Imagine this: You have spent hundreds of millions of won to hire the best chef to open a perfect restaurant. But what if you had to call an external expert agency every time to evaluate the chef’s cooking skills and ask, “Is our chef doing well right now?” While you wait for the answer, the customers will have already left, and you will have missed the timing to improve the recipes.

The AI industry is currently grappling with a similar dilemma. As more companies developing AI models emerge, ‘AI evaluation startups’ that measure how smart these models are have also sprung up. Surprisingly, however, many of these companies fail to settle in successfully and eventually disappear. Why is this? Is it simply bad luck, or is there a structural problem with the business of AI evaluation itself?

Why does this matter?

As AI technology advances, the ‘accuracy’ of the answers provided by AI is now directly linked to a company’s survival. If an AI provides false information or biased answers, it can cause significant damage to a company’s image. In this context, AI evaluation services seemed like a godsend to companies. However, the fact that evaluation startups keep failing means that the ‘AI quality control’ we expected is not a problem that can be solved just by adopting a single service. This poses a contemporary challenge that general companies seeking to utilize AI services must also build their own technical capabilities.

Understanding it simply

Simply put, the difficulty faced by AI evaluation startups can be compared to the problem of ‘compass sovereignty.’

For research labs (like big tech companies) building AI models, ‘evaluation’ is not just a process of scoring. This evaluation serves as an important ‘compass’ that determines which direction our AI should take. According to Why are there so few independent eval startups?, giant research labs do not want to hand over the core direction of the research they have set to external companies.

‘Speed’ is also a major issue. AI model development proceeds at an incredibly fast pace. However, when evaluation is outsourced, a ‘latency’ phenomenon occurs where one has to wait until the evaluation results come out. This latency is an unbearable factor for developers who consider development speed to be vital. As pointed out in Why are there so few independent eval startups?, this latency occurring in the process of outsourcing evaluation is a critical obstacle that slows down model development.

Finally, there is the ‘expertise’ gap. Nathan Lambert, an expert in the field of artificial intelligence, advises via X (formerly Twitter) that for an outstanding evaluation professional, it is more valuable to focus on ‘post-training’ tasks (a learning process that optimizes specific performance after model development) to directly improve AI’s capabilities than to score at an evaluation company.

Current situation

The current AI evaluation market is in a very unstable state. According to an analysis pointed out by John Hwang, many evaluation startups tend to focus on making the UI (user interface) look pretty rather than solving ‘upstream’ (foundational) processes that require real technical depth, such as configuring representative test datasets or designing complex evaluation logic. Yet, they try to charge high prices to companies, so they are being turned away by customers.

Moreover, client companies that develop or operate AI directly finish learning quickly and build their own evaluation systems. As Nathan Lambert points out, customers soon graduate to their own internal evaluation systems, making it a very difficult structure for startups to consistently make a profit.

Statistically, such failures are even more painful. Studies show that the 10-year survival rate of startups is less than 10%, and cases of failure without even recovering the invested capital account for three-quarters of the total. In particular, there are statistics that the probability of failure within 3 years for UK startups reaches 50-60%. startup failure rates 2025.

What will happen in the future?

Experts advise that for evaluation startups to survive, they must break away from the simple framework of ‘evaluation services.’ As suggested in a discussion on Hacker News, instead of just saying “entrust your evaluation to us,” they must evolve in a direction that provides an ‘AI verification toolchain’ (a series of tools for AI verification) that helps developers build their own evaluation systems.

MindTickleBytes’ AI Reporter Perspective

Ultimately, AI evaluation is moving beyond a simple service market into the realm of ‘technical internalization’ (processing technology on your own without external help). For companies dealing with AI, the ability to create precise test questions and grade them to fit their own goals, rather than relying on external evaluation firms, will become their core competitive edge.

References

Why are there so few independent eval startups? Thomas I. Liao (https://thomasliao.com/eval-startups)
Nathan Lambert on X: “Most of these eval companies should be non profits or non VC path companies.” / X (https://x.com/natolambert/status/1925327027600859426)
Evals Startups Are Not Enterprise Ready - by John Hwang (https://nextword.substack.com/p/evals-startups-want-enterprise-money)
Why Startups Fail (2026) Lessons From 200 Founders Wilbur Labs (https://www.wilburlabs.com/blueprints/why-startups-fail)
Why eval startups fail (2025) - Hacker News (https://news.ycombinator.com/item?id=48637868)
Statistics on Startup Failure Rates (2025) - LinkedIn (https://www.linkedin.com/pulse/statistics-startup-failure-rates-2025-altaf-rahman–orn1c)

Share this article:

Test Your Understanding

Q1. What is the main reason why services provided by AI evaluation startups slow down model development?

Complexity of UI design
Latency caused by introducing external evaluations
Data security regulations

Introducing external evaluations adds unnecessary waiting time to the development loop, causing critical latency in model development environments where high speed is essential.

Q2. What is the fundamental difficulty faced by AI evaluation startups mentioned in the article?

Lack of UI/UX design
The difficulty of 'upstream' tasks like securing quality data and defining evaluation logic
Lack of promotion and marketing

Evaluation startups often fail to solve the hard tasks of securing accurate test data and designing meaningful evaluation logic, rather than just making the UI look nice.

Q3. Why do large AI research labs rarely outsource their evaluation tasks?

Lack of funds
They want to set and control their own research direction
Due to security laws

Research evaluation is a core task that determines the direction of technology, so giant research labs do not want to outsource this authority.