An AI Assistant Controlling My PC? The Secret to Reading 'Blueprints' Without Screenshots

AI Summary

A new tool called Agent-desktop has been released, enabling AI to directly control apps using the computer's 'Accessibility Tree' without screenshots or image analysis.

Introduction: AI Assistants Have Started to ‘Truly’ Understand My Computer

Imagine this. You ask your AI assistant, “Open last month’s household budget Excel file and compare it with this month’s credit card statement.” Until now, for an AI to do this, it had to capture the screen frame by frame and use ‘eyes’ (computer vision) to find where the Excel buttons were and what the numbers were within those images.

To use an analogy, it was like trying to find an exit in a foggy maze relying on a single tiny flashlight. Because the AI had to scan and analyze the screen every time, it took a long time and was prone to mistakes. But now, a way has opened for AI to clear the fog and work by directly reading the ‘blueprints’ of the computer. This is thanks to an innovative technology called Agent-desktop. Show HN: Agent-desktop - Native desktop automation CLI for AI agents

Why is this important?

The computer programs we use every day have structures completely different from websites. While websites are transparently published in code that is easy for AI to read, programs installed on my PC—like word processors, Excel, or Photoshop—are very difficult for AI to look inside.

For existing AI agents (AI programs that judge and act on their own) to control my PC, they had to analyze screen images, which brought three major headaches:

Slow Speed: Analyzing high-definition screen capture images takes a significant amount of time. It’s like taking a photo of an entire book and trying to decipher each character one by one.
Low Accuracy: If another window slightly covers a button, or if the icon shape changes just a bit because of a Windows theme change, the AI quickly loses its way and gets flustered.
High Cost: To ‘see’ the screen with eyes, expensive ‘AI Vision Models’ must be kept running constantly, consuming massive computing power and cost.

Agent-desktop solves this problem in a completely different way. Instead of ‘looking’ at the screen from the outside, it chooses to directly read the ‘map of information’ that the operating system already holds internally. [DesktopCtl

Desktop Control for AI agents](https://desktopctl.com/)

The core of this technology is a somewhat unfamiliar system called the Accessibility Tree. GitHub - ericclemmons/agent-native

Originally, the accessibility tree was created to help the visually impaired. For those who cannot see the screen, the operating system (OS) organizes what buttons and text are currently on the screen into an invisible, structural map. A Screen Reader reads this map and provides voice guidance to the user.

Agent-desktop has essentially handed this ‘Braille map’ to the AI.

To put it in perspective: If the conventional method is wandering through a complex maze with open eyes to find the way, the Agent-desktop method is like having the entire blueprint of the maze in hand and teleporting straight to the destination.

By reading the ‘blueprints’ directly, the AI can grasp the structure of an app with 100% accuracy without having to take screenshots of what is on the screen. GitHub - lahfir/agent-desktop

Key Features of Agent-desktop: The Precise Hands of a Small but Powerful AI

This tool is beginning to be evaluated by developers as the ‘most efficient hands for an AI assistant.’ Its specific features are as follows:

1. Incredibly fast and lightweight

This program is built with Rust, a modern programming language known for being extremely fast and stable. agent-desktop The total installation file size is only about 15MB. As an analogy, it weighs about as much as two or three high-resolution photos taken with a smartphone. It is very easy to install and works immediately without complex dependencies. Show HN: Agent-desktop - Native desktop automation CLI for AI agents

2. Communicates in a language AI understands (JSON)

When an AI asks, “What’s on the screen right now?”, Agent-desktop responds using a format called JSON, rather than complex electrical signals only a computer can understand. Simply put, it provides answers in a structured data format, much like a well-organized ‘receipt list’ or ‘table of contents.’ Agent-Desktop: AI Automation CLI for Desktops - PromptZone This allows the AI to judge situations and act much more clearly.

3. 53 versatile skills that can do almost anything

This tool is equipped with a total of 53 sophisticated commands, ranging from a single click to window management. Show HN: Agent-desktop - Native desktop automation CLI for AI agents By combining these commands, AI can perform tasks on your PC such as: agent-desktop | Agents AI Agent Skill | SkillsCat

Accurately finding and clicking numerous buttons and checkboxes
Typing text into input fields just like a human
Navigating the menus of complex programs without getting stuck
Moving files by dragging and dropping
Reading content copied to the clipboard or writing new content
Opening, closing, and resizing multiple running windows

Current Situation: ‘Real’ Local AI Approaches Us

Currently, Agent-desktop has been completed as a ‘cross-platform’ tool that can be used in almost all computer environments we use, including Windows, macOS, and Linux. Show HN: Agent-desktop - Native desktop automation CLI for AI agents Many AI developers around the world are already attaching these precise ‘hands’ to their AI agents. [Agent Desktop - Desktop Automation CLI for AI Agents

EveryDev.ai](https://www.everydev.ai/tools/agent-desktop)

In fact, open-source AI agents like Goose are actively utilizing these technologies to directly modify files and handle apps within a user’s computer. [goose

Your open source AI agent](https://goose-docs.ai/) Furthermore, Google’s Gemini CLI is also evolving in a direction where it performs complex tasks, such as fixing bugs, by directly utilizing tools on our PCs within a terminal environment. [Gemini CLI

Gemini Code Assist

Google for Developers](https://developers.google.com/gemini-code-assist/docs/gemini-cli)

Of course, the challenge remains that not all apps provide the ‘accessibility tree’ perfectly. However, the office software and system setting apps we commonly use have already reached a level where they can be perfectly controlled this way. Agent Desktop — AI Skill — Termo

What will the future look like? (Imagine this)

As these tools become more common, the way we interact with computers will change completely. Accio Work - Local-First Desktop AI Agent That Turns Ideas Into Profits

Just imagine. On a Monday morning, you say this to your AI while drinking a cup of coffee: “Find all the receipts among the emails that arrived last week and organize them into an Excel file. Then save that file in the ‘May Expenses’ folder and send it to the team leader via messenger.”

The AI will then use the powerful tool called Agent-desktop to open the email app, find the receipts, run Excel to create a table, and finish the series of processes, such as moving the file through the file explorer, in an instant.

Most importantly, all these processes take place locally and safely within my computer, without uploading my data to an external server. The era of a true ‘personal assistant’ is right before our eyes. Agent-Desktop: AI Automation CLI for Desktops - PromptZone

AI’s Perspective: Through the Eyes of Reporter MindTickleBytes AI

Until now, the way AI agents handled desktop apps was as clunky and frustrating as attempting precision surgery while wearing thick mittens. However, Agent-desktop is like handing an AI a very sharp and precise ‘surgical tool.’

Especially in an age where security is sensitive, the fact that all automation is processed locally without the need to transmit my screen to a cloud server is a very encouraging change. In the future, beyond ‘which AI is smarter,’ the core competitiveness will be ‘which AI handles the tools on my computer faster and more accurately.’ AI has finally taken a seat in the ‘real cockpit’ that controls the massive machine that is our PC.

References

GitHub - lahfir/agent-desktop: Native desktop automation CLI for AI agents. Control any application through OS accessibility trees with structured JSON output and deterministic element refs. · GitHub
[DesktopCtl Desktop Control for AI agents](https://desktopctl.com/)
Agent Desktop — AI Skill — Termo
GitHub - ericclemmons/agent-native: macOS native app automation CLI for AI agents · GitHub
agent-desktop
[goose Your open source AI agent](https://goose-docs.ai/)
agent-desktop - MCP Store
Accio Work - Local-First Desktop AI Agent That Turns Ideas Into Profits
[Gemini CLI Gemini Code Assist Google for Developers](https://developers.google.com/gemini-code-assist/docs/gemini-cli)
Show HN: Agent-desktop - Native desktop automation CLI for AI agents …
Agent-Desktop: AI Automation CLI for Desktops - PromptZone
[Agent Desktop - Desktop Automation CLI for AI Agents EveryDev.ai](https://www.everydev.ai/tools/agent-desktop)
[agent-desktop Agents AI Agent Skill SkillsCat](https://skills.cat/skills/lahfir/agent-desktop/agent-desktop)

Share this article:

Test Your Understanding

Q1. What does Agent-desktop use instead of screen images when controlling apps?

Web browser
Accessibility Tree
Mouse macro

Agent-desktop identifies the structure of apps through the OS's accessibility tree, so screenshots or visual analysis are not required.

Q2. Which programming language was Agent-desktop developed in?

Python
JavaScript
Rust

This tool was developed in the Rust language for performance and stability.

Q3. How many control commands does this tool provide in total?

Agent-desktop provides a total of 53 commands, including clicking, typing, and window management.