How Does AI Guess Our 'Inner Thoughts'? A New Testbed for Self-Learning AI

AI Summary

Inverse Rubric Optimization (IRO) is a new testing environment that measures the intelligence of autonomous AI agents by evaluating their ability to figure out the hidden preferences of a strict judge within a limited number of attempts.

Imagine you are a newly appointed executive chef at a top-tier three-Michelin-starred restaurant. Suppose there is a legendary food critic who visits periodically, someone who is very picky and never reveals their true feelings. This critic never tells you directly what flavors they like, how much salt should be added, or what spices they prefer.

The only method available to you is to prepare a dish yourself and serve it to them. However, there is a catch. Due to the restaurant’s financial situation, the opportunity to ask the critic for an evaluation is limited to exactly five times. During these five chances, you have to slightly tweak the menu and gauge their reaction by asking, “Is this too salty?” or “Do you like this?”. And on the final, sixth attempt, you must serve the ultimate banquet that perfectly matches the critic’s palate 100% to maintain the restaurant’s stars.

The process of reverse-engineering a perfect recipe you’ve never seen before, using only five pieces of feedback. This is precisely the core of the latest artificial intelligence technology we will explore today, and how machines learn genuine “tact”.

Why It Matters

Recently, in the field of artificial intelligence, we are moving beyond simple chatbots and entering the era of ‘Agents’ that assess situations and act autonomously. If AI in the past was a ‘smart encyclopedia’ that only answered when asked, agents are different. Simply put, if you say, “I’m going on a business trip to Paris tomorrow, please plan my itinerary and book my flights,” it acts as an ‘active assistant’ that searches websites, compares budgets, makes the optimal choices, and proceeds with the payment on its own.

In fact, at the 2023 Neural Information Processing Systems (NeurIPS), a world-renowned AI conference, Large Language Model-based Autonomous Agents were treated as a key topic and drew significant attention [[NeurIPS 2023] Large Language Model-based Autonomous Agents - LG AI Research BLOG](https://www.lgresearch.ai/blog/view?seq=393).

Now, AI agents are stepping beyond the role of a simple daily assistant for humans and entering the realm of highly advanced scientific research. According to recent research, the latest LLM-based scientific agents have begun to automate extremely complex processes of scientific discovery, such as generating hypotheses, designing experiments, analyzing vast amounts of data, and running simulations on their own [[2503.24047] Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents](https://arxiv.org/abs/2503.24047). Furthermore, a massive experimental environment has been built that gathers thousands of virtual AI agents to simulate the behaviors of human society [AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society](https://arxiv.org/html/2502.08691v1).

However, a critical problem arises here: “How can we evaluate whether this AI agent is actually doing its job well and how smart it is?”

In the past, it was enough to have AI solve math or multiple-choice questions and grade them. 1 plus 1 equals 2, a clear correct answer. But evaluating an autonomously operating agent is an entirely different matter. It’s like evaluating the work performance of a new employee, where there is often no single fixed correct answer [[2503.16416] Survey on Evaluation of LLM-based Agents](https://arxiv.org/abs/2503.16416). A sophisticated testbed has become desperately needed to measure how quickly and accurately AI grasps the user’s ‘true intention’ amidst vague human tastes and a complex, ever-changing real world.

The Explainer

To solve these evaluation challenges, there is an ingenious new test environment devised by AI researchers. It is called ‘Inverse Rubric Optimization (IRO)’ [Inverse Rubric Optimization: A testbed for agent science](https://fulcrum.inc/2026/06/09/inverse-rubric-optimization.html). The name might seem somewhat academic and complex, but it’s easy to understand if you recall the situation of the ‘chef and the picky food critic’ mentioned earlier.

Metaphorically, this technology can be described as a virtual obstacle course to train and evaluate AI. Let’s break down this technology into three key concepts.

1. Black-box Judge

In computer science, a ‘Black-box’ refers to a dark box whose internal structure cannot be seen at all. You put something in and a result comes out, but you have no idea what criteria and calculations inside produced that result. In the IRO test environment, the AI agent has no knowledge of the final goal or rules (the rubric) it needs to reach. This finicky entity hiding the correct answers from the agent is called the ‘Black-box Judge’ [Inverse Rubric Optimization: A testbed for agent science](https://vuink.com/post/shypehz-d-dvap/2026/06/09/inverse-rubric-optimization-d-dhtml). It is exactly like the critic who never reveals the recipe to the chef and only gives short answers like “Hmm, the aroma isn’t great” or “The texture is slightly better.”

2. Label Budget

If the agent could ask an infinite number of questions and repeat its failures, it would eventually figure out someone’s taste. However, in reality, we do not make our assistants do the same task hundreds or thousands of times while we wait. There are clear constraints of money and time. To mimic this, IRO places a strict constraint on the agent called the ‘Label Budget’ [Inverse Rubric Optimization: A Testbed for Agent Science](https://memedata.com/post/125636). Simply put, the agent has a set amount of “coins” it can use to ask the judge whether its action was right or wrong (the truth label). It is just like the chef having only 5 chances to serve the dish. The agent’s true skill lies in how efficiently it utilizes this limited budget.

3. Inverse Optimization

Normal forward optimization is giving a clear instruction (Rubric) like “Add 10g of salt and cook the meat medium-rare” and checking how well it is followed. Conversely, ‘Inverse’ is the process of seeing the outcome (the critic’s feedback) first and working backward to deduce the cause (the hidden recipe and taste).

Let’s compare it to the automotive industry. IRO is like a ‘Wind Tunnel’ that pushes air resistance testing to the limit when developing a new plane or car, or an ‘Ice Obstacle Course’ that verifies the safety of autonomous vehicles. No matter if a car engine outputs 1,000 horsepower, it is useless if it cannot stop in time on an icy road. Likewise, no matter how vast a language model’s knowledge is, it cannot be a great assistant (agent) if it fails to grasp the hidden intentions of humans within limited opportunities. IRO is a dedicated training ground that tests this exact ‘situational awareness’.

Where We Stand

This fascinating and challenging concept was systematized and proposed to the academic community by four researchers: zef, leni, kaivu, and rohuang [Inverse Rubric Optimization: A testbed for agent science ...](https://www.lesswrong.com/posts/uSighG5zWbmtBembc/inverse-rubric-optimization-a-testbed-for-agent-science). They observed that the IRO environment would serve as an excellent foundation to fundamentally advance Agent Science itself, moving beyond merely testing the current capabilities of agents.

There are two main reasons why the research team considers IRO the ultimate testbed (experimental environment).

First, IRO elicits ‘Rich behavior’ from AI agents [Inverse Rubric Optimization: A testbed for agent science](https://fulcrum.inc/2026/06/09/inverse-rubric-optimization.html). Unlike multiple-choice questions where you just guess A or B, reading a judge’s mind on a limited budget requires the AI to make highly strategic choices. Complex and creative problem-solving skills naturally emerge, such as thinking, “I should ask about the broadest scope with my first question, and narrow down the details with my second question.” This means that machines have started devising strategies just like humans.

Second, IRO demonstrates ‘Smooth scaling’ [Inverse Rubric Optimization: A testbed for agent science](https://vuink.com/post/shypehz-d-dvap/2026/06/09/inverse-rubric-optimization-d-dhtml). Take the games we play as an example. A game where the difficulty scales smoothly like stairs from level 1 to 100 can be enjoyed by everyone from beginners to experts without giving up. On the other hand, games with wildly fluctuating difficulty levels do not receive good reviews. The IRO test environment is exactly the same. It possesses a highly stable evaluation structure that can smoothly and consistently measure performance proportional to its capabilities, ranging from very basic AIs to ultra-advanced artificial intelligence that will appear in the future.

Surprisingly, the core computer code that forms the backbone of all these experiments is transparently published on the ‘fulcrumresearch/iro’ repository of the open-source platform GitHub, making it accessible and usable for anyone around the world [GitHub - fulcrumresearch/iro](https://github.com/fulcrumresearch/iro). Thanks to this minimal, lightweight, and clean codebase, countless AI scientists and corporate developers globally can now bring their own AI agents and freely test them in front of this rigorous and precise ‘Black-box Judge’.

What’s Next

The future direction of AI technology development is clear: maximizing the perfection of ‘autonomous agents’ that do work on their own with minimal human intervention. And the measure of that intelligence is now shifting entirely from “how much knowledge it has memorized” to “how accurately it can figure out the user’s hidden intention with just a few hints.”

Amidst this massive current, a sophisticated and dynamic evaluation environment like IRO (Inverse Rubric Optimization) will serve as a crucial milestone propelling agent science to the next level. In the near future, the AI assistants in the smartphones we newly purchase or the business automation robots introduced to companies will all undergo fierce training to hone their “human tact” by passing through this ‘IRO Wind Tunnel’ before they leave the factory.

The frustrating chatbots of the past, which required ten questions just to barely catch onto your thoughts, are fading into history. The day is fast approaching when we will meet true smart assistants who can read our minds in just one or two short conversations, saying, “Ah, you need rest more than work on this business trip. Shall I book a quiet hotel with an ocean view for you?”

AI’s Take

MindTickleBytes AI Reporter’s View: Reading a person’s mind and grasping their hidden intentions might be like solving the hardest math problem in the world for a machine. This is because human language is always mixed with omitted context and subtle emotions.

If the AI of the past was an ‘honor student’ who became smart by memorizing vast amounts of data by heart, it is now time to be reborn as a ‘sensible practitioner’ who finds the optimal answer even in the ambiguity of reality. Going beyond executing simple commands, IRO will be the best and most rigorous training ground for the birth of a true AI assistant equipped with intuition and tact. Won’t this technology of reverse-engineering the human mind within limited opportunities ultimately be the key to making the communication between machines and humans as natural and perfect as possible?

References

[Inverse Rubric Optimization: A testbed for agent science](https://fulcrum.inc/2026/06/09/inverse-rubric-optimization.html)
[Inverse Rubric Optimization: A testbed for agent science](https://vuink.com/post/shypehz-d-dvap/2026/06/09/inverse-rubric-optimization-d-dhtml)
[Inverse Rubric Optimization: A testbed for agent science ...](https://www.lesswrong.com/posts/uSighG5zWbmtBembc/inverse-rubric-optimization-a-testbed-for-agent-science)
[GitHub - fulcrumresearch/iro](https://github.com/fulcrumresearch/iro)
[[2503.16416] Survey on Evaluation of LLM-based Agents](https://arxiv.org/abs/2503.16416)
[AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society](https://arxiv.org/html/2502.08691v1)
[[NeurIPS 2023] Large Language Model-based Autonomous Agents - LG AI Research BLOG](https://www.lgresearch.ai/blog/view?seq=393)
[[2503.24047] Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents](https://arxiv.org/abs/2503.24047)
[Inverse Rubric Optimization: A Testbed for Agent Science](https://memedata.com/post/125636)

Share this article:

Test Your Understanding

Q1. What is the core purpose of 'Inverse Rubric Optimization (IRO)' explained in the text?

To help AI translate existing documents faster
To evaluate AI on figuring out the hidden preferences of a judge within a limited budget
To double the text generation speed of large language models

IRO (Inverse Rubric Optimization) is an evaluation environment (testbed) that makes AI agents utilize a limited number of question opportunities (label budget) to figure out the tastes and preferences of an unknown judge (black box).

Q2. Which of the following is the correct description of modern LLM-based Agents?

A simple program that only repeats predetermined answers like past chatbots.
A technology exclusively used for numerical calculations like weather forecasting.
A paradigm that interacts with complex and dynamic environments, such as generating hypotheses and designing experiments.

Modern LLM-based agents possess complex capabilities to formulate hypotheses, analyze data, and interact with dynamic environments on their own, moving beyond simple answers.

Q3. What was the biggest constraint AI agents must overcome in the IRO environment compared to?

The physical weight limit of ingredients in a recipe
The 'label budget', which limits the number of times it can ask questions or receive evaluations
An offline environment not connected to the internet

Agents cannot probe the judge's mind indefinitely. They can only receive evaluations and get hints for the right answer within a limited number of attempts called the 'Label budget'.