A new tool called Agent-desktop has been released, enabling AI to directly control apps using the computer's 'Accessibility Tree' without screenshots or image analysis.
Introduction: AI Assistants Have Started to ‘Truly’ Understand My Computer
Imagine this. You ask your AI assistant, “Open last month’s household budget Excel file and compare it with this month’s credit card statement.” Until now, for an AI to do this, it had to capture the screen frame by frame and use ‘eyes’ (computer vision) to find where the Excel buttons were and what the numbers were within those images.
To use an analogy, it was like trying to find an exit in a foggy maze relying on a single tiny flashlight. Because the AI had to scan and analyze the screen every time, it took a long time and was prone to mistakes. But now, a way has opened for AI to clear the fog and work by directly reading the ‘blueprints’ of the computer. This is thanks to an innovative technology called Agent-desktop. Show HN: Agent-desktop - Native desktop automation CLI for AI agents
Why is this important?
The computer programs we use every day have structures completely different from websites. While websites are transparently published in code that is easy for AI to read, programs installed on my PC—like word processors, Excel, or Photoshop—are very difficult for AI to look inside.
For existing AI agents (AI programs that judge and act on their own) to control my PC, they had to analyze screen images, which brought three major headaches:
- Slow Speed: Analyzing high-definition screen capture images takes a significant amount of time. It’s like taking a photo of an entire book and trying to decipher each character one by one.
- Low Accuracy: If another window slightly covers a button, or if the icon shape changes just a bit because of a Windows theme change, the AI quickly loses its way and gets flustered.
- High Cost: To ‘see’ the screen with eyes, expensive ‘AI Vision Models’ must be kept running constantly, consuming massive computing power and cost.
| Agent-desktop solves this problem in a completely different way. Instead of ‘looking’ at the screen from the outside, it chooses to directly read the ‘map of information’ that the operating system already holds internally. [DesktopCtl | Desktop Control for AI agents](https://desktopctl.com/) |
Easy Understanding: A Braille Map for the ‘Blind Assistant’ Becomes the AI’s Weapon
The core of this technology is a somewhat unfamiliar system called the Accessibility Tree. GitHub - ericclemmons/agent-native
Originally, the accessibility tree was created to help the visually impaired. For those who cannot see the screen, the operating system (OS) organizes what buttons and text are currently on the screen into an invisible, structural map. A Screen Reader reads this map and provides voice guidance to the user.
Agent-desktop has essentially handed this ‘Braille map’ to the AI.
- To put it in perspective: If the conventional method is wandering through a complex maze with open eyes to find the way, the Agent-desktop method is like having the entire blueprint of the maze in hand and teleporting straight to the destination.
By reading the ‘blueprints’ directly, the AI can grasp the structure of an app with 100% accuracy without having to take screenshots of what is on the screen. GitHub - lahfir/agent-desktop
Key Features of Agent-desktop: The Precise Hands of a Small but Powerful AI
This tool is beginning to be evaluated by developers as the ‘most efficient hands for an AI assistant.’ Its specific features are as follows:
1. Incredibly fast and lightweight
This program is built with Rust, a modern programming language known for being extremely fast and stable. agent-desktop The total installation file size is only about 15MB. As an analogy, it weighs about as much as two or three high-resolution photos taken with a smartphone. It is very easy to install and works immediately without complex dependencies. Show HN: Agent-desktop - Native desktop automation CLI for AI agents
2. Communicates in a language AI understands (JSON)
When an AI asks, “What’s on the screen right now?”, Agent-desktop responds using a format called JSON, rather than complex electrical signals only a computer can understand. Simply put, it provides answers in a structured data format, much like a well-organized ‘receipt list’ or ‘table of contents.’ Agent-Desktop: AI Automation CLI for Desktops - PromptZone This allows the AI to judge situations and act much more clearly.
3. 53 versatile skills that can do almost anything
This tool is equipped with a total of 53 sophisticated commands, ranging from a single click to window management. Show HN: Agent-desktop - Native desktop automation CLI for AI agents By combining these commands, AI can perform tasks on your PC such as: agent-desktop | Agents AI Agent Skill | SkillsCat
- Accurately finding and clicking numerous buttons and checkboxes
- Typing text into input fields just like a human
- Navigating the menus of complex programs without getting stuck
- Moving files by dragging and dropping
- Reading content copied to the clipboard or writing new content
- Opening, closing, and resizing multiple running windows
Current Situation: ‘Real’ Local AI Approaches Us
| Currently, Agent-desktop has been completed as a ‘cross-platform’ tool that can be used in almost all computer environments we use, including Windows, macOS, and Linux. Show HN: Agent-desktop - Native desktop automation CLI for AI agents Many AI developers around the world are already attaching these precise ‘hands’ to their AI agents. [Agent Desktop - Desktop Automation CLI for AI Agents | EveryDev.ai](https://www.everydev.ai/tools/agent-desktop) |
| In fact, open-source AI agents like Goose are actively utilizing these technologies to directly modify files and handle apps within a user’s computer. [goose | Your open source AI agent](https://goose-docs.ai/) Furthermore, Google’s Gemini CLI is also evolving in a direction where it performs complex tasks, such as fixing bugs, by directly utilizing tools on our PCs within a terminal environment. [Gemini CLI | Gemini Code Assist | Google for Developers](https://developers.google.com/gemini-code-assist/docs/gemini-cli) |
Of course, the challenge remains that not all apps provide the ‘accessibility tree’ perfectly. However, the office software and system setting apps we commonly use have already reached a level where they can be perfectly controlled this way. Agent Desktop — AI Skill — Termo
What will the future look like? (Imagine this)
As these tools become more common, the way we interact with computers will change completely. Accio Work - Local-First Desktop AI Agent That Turns Ideas Into Profits
Just imagine. On a Monday morning, you say this to your AI while drinking a cup of coffee: “Find all the receipts among the emails that arrived last week and organize them into an Excel file. Then save that file in the ‘May Expenses’ folder and send it to the team leader via messenger.”
The AI will then use the powerful tool called Agent-desktop to open the email app, find the receipts, run Excel to create a table, and finish the series of processes, such as moving the file through the file explorer, in an instant.
Most importantly, all these processes take place locally and safely within my computer, without uploading my data to an external server. The era of a true ‘personal assistant’ is right before our eyes. Agent-Desktop: AI Automation CLI for Desktops - PromptZone
AI’s Perspective: Through the Eyes of Reporter MindTickleBytes AI
Until now, the way AI agents handled desktop apps was as clunky and frustrating as attempting precision surgery while wearing thick mittens. However, Agent-desktop is like handing an AI a very sharp and precise ‘surgical tool.’
Especially in an age where security is sensitive, the fact that all automation is processed locally without the need to transmit my screen to a cloud server is a very encouraging change. In the future, beyond ‘which AI is smarter,’ the core competitiveness will be ‘which AI handles the tools on my computer faster and more accurately.’ AI has finally taken a seat in the ‘real cockpit’ that controls the massive machine that is our PC.
References
- GitHub - lahfir/agent-desktop: Native desktop automation CLI for AI agents. Control any application through OS accessibility trees with structured JSON output and deterministic element refs. · GitHub
-
[DesktopCtl Desktop Control for AI agents](https://desktopctl.com/) - Agent Desktop — AI Skill — Termo
- GitHub - ericclemmons/agent-native: macOS native app automation CLI for AI agents · GitHub
- agent-desktop
-
[goose Your open source AI agent](https://goose-docs.ai/) - agent-desktop - MCP Store
- Accio Work - Local-First Desktop AI Agent That Turns Ideas Into Profits
-
[Gemini CLI Gemini Code Assist Google for Developers](https://developers.google.com/gemini-code-assist/docs/gemini-cli) - Show HN: Agent-desktop - Native desktop automation CLI for AI agents …
- Agent-Desktop: AI Automation CLI for Desktops - PromptZone
-
[Agent Desktop - Desktop Automation CLI for AI Agents EveryDev.ai](https://www.everydev.ai/tools/agent-desktop) -
[agent-desktop Agents AI Agent Skill SkillsCat](https://skills.cat/skills/lahfir/agent-desktop/agent-desktop)
- Web browser
- Accessibility Tree
- Mouse macro
- Python
- JavaScript
- Rust
- 10
- 53
- 100