代替我點擊與打字的 AI？Google「Gemini 2.5 Computer Use」登場

AI Summary

Google 公佈了一款「代理級」AI 模型，它能理解螢幕內容、自動執行 13 種動作並操作網頁瀏覽器。

想像一下週一早晨，一進辦公室就面臨堆積如山的電子郵件和收據。你需要逐一打開，確認日期與金額，然後再一一打字輸入公司的報帳系統，這是一段既枯燥又漫長的過程。登入、上傳檔案、填寫空格，這些簡單重複的工作奪走了我們寶貴時間的相當大一部分。但如果這時對 AI 說一句：「幫我整理這些收據並提交」，會發生什麼事呢？AI 像人一樣代替我的眼睛盯著螢幕，代替我的手移動滑鼠，完美地完成所有工作。這已不再是科幻電影裡的故事，而是 Google 最近公開的 「Gemini 2.5 Computer Use」 模型為我們描繪的近未來景象。Introducing the Gemini 2.5 Computer Use model

為什麼這很重要？

到目前為止，我們所熱衷的 ChatGPT 或現有的 Gemini 主要是擅長「說話」的 AI。問它好奇的事它能對答如流，總結複雜的論文也讓我們驚嘆。但仔細想想，我們在電腦上進行的工作 80-90% 不是對話，而是具體的「行動」。包括點擊特定按鈕、向下捲動畫面（Scroll）、在搜尋框輸入文字等一系列操作。

Gemini 2.5 Computer Use 的出現，象徵著 AI 正從單純傳遞知識的「說話秘書」，演變為實際執行使用者任務的 「代理（Agent）」。Introducing Gemini 2.5 Computer Use model: A Paradigm Shift in AI’s Digital Dexterity 該模型能像人類一樣直觀地理解網頁瀏覽器或智慧型手機 App 的畫面佈局，並能直接控制滑鼠和鍵盤。[Introducing Gemini 2.5 Computer Use: AI for web and…

LinkedIn](https://www.linkedin.com/posts/googleaidevs_introducing-gemini-25-computer-use-available-activity-7381415403840864256-ycSe) 簡單來說，AI 擁有了會操作電腦的「手」。這在企業重複性辦公自動化，甚至是軟體測試方式的根本變革上，都具有巨大的潛力。[Gemini 2.5 Computer Use model

Gemini API

Google AI for Developers](https://ai.google.dev/gemini-api/docs/models/gemini-2.5-computer-use-preview-10-2025)

輕鬆理解：AI 擁有了「眼睛」和「手」

Gemini 2.5 Computer Use 的工作方式可以用 「代理迴圈（Agent Loop）」 的概念來解釋。打個比方，就像我們在陌生的路上開車時，重複著「觀察路況（眼） -> 與導航路徑對比並判斷（腦） -> 轉動方向盤或踩剎車（手）」的過程。Introducing the Gemini 2.5 Computer Use model

掌握狀況（眼）： AI 首先會拍攝當前電腦螢幕的截圖並進行即時分析。這是「看」哪裡有按鈕、哪裡有輸入框的階段。Introducing the Gemini 2.5 Computer Use model
推理（腦）： 如果使用者要求「幫我訂機票」，AI 會將當前畫面與要求進行對照，然後做出「現在應該先點擊『登入』按鈕」的判斷。Google’s Gemini 2.5 Computer Use model can navigate the web like a …
執行（手）： 一旦做出判斷，就會實際將滑鼠游標移動到該位置並點擊，或者用鍵盤打出帳號和密碼。Introducing the Gemini 2.5 Computer Use model

這種神奇的能力是基於 Google 最強大的 AI 模型之一「Gemini 2.5 Pro」卓越的視覺分析能力和推理能力開發而成的。[Introducing Gemini 2.5 Computer Use: AI for web and…

LinkedIn](https://www.linkedin.com/posts/googleaidevs_introducing-gemini-25-computer-use-available-activity-7381415403840864256-ycSe) 特別是它能精確到像素級地控制滑鼠游標，並針對網頁瀏覽器上發生的 13 種核心動作 進行了集中學習，以提高熟練度。Google News - Google releases Gemini 2.5, a new AI model with web…

再打個比方，如果說現有的 AI 是背下了整本「電腦使用手冊」的理論家，那麼 Gemini 2.5 Computer Use 就是實際拿起滑鼠投入實習的新進員工。雖然目前還處於「預覽（Preview）」階段，速度可能稍慢或偶有失誤，但能自行觀察螢幕並尋找路徑這件事本身就是一個巨大的飛躍。Google releases a preview of its Gemini 2.5 Computer Use AI model …

目前進度：發展到什麼程度了？

Google 在 2025 年 10 月初，就在競爭對手 OpenAI 提到類似技術的隔天，便全面公開了這款模型，旨在奪取 AI 代理市場的主導權。Google launches Gemini 2.5 Computer Use to rival OpenAI agents 目前，該模型正以「公開預覽」的形式提供給開發者，讓他們能親自測試並將其整合到自己的服務中。Introducing Gemini 2.5 Computer Use model: A Paradigm Shift in AI’s Digital Dexterity

Google 不僅展現了可能性，還透過客觀的性能指標（基準測試）證明了其回報。

Online-Mind2Web & WebVoyager： 在測試 AI 能否在複雜網站中不迷路並達成目標的考試中取得了優異成績。Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents - InfoQ
AndroidWorld： 不僅是 Windows 或 Mac 等 PC 環境，在測試操作 Android 手機環境的熟練度方面也展現了強大的性能。Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents - InfoQ

這些測試結果支持了 Gemini 2.5 Computer Use 能夠與人類共享看螢幕時的直覺，並能以此為基礎解決實際問題。Gemini 2.5 Computer Use Model: How It Automates Browsers

未來會如何發展？

專家預測，這次模型的登場將成為 AI 滲透我們生活方式的分水嶺。2025 Complete Guide: Gemini 2.5 Computer Use Model - Revolutionary … 不久之後，我們可能會在日常生活中面臨以下驚人的變化：

超乎想像的個人秘書： 只要說一句：「這週末要和朋友們在江南站附近見面，幫我預約評分 4 分以上的美食店，並在群組通知位置和時間」。AI 就會執行訂位 App 完成預約，並打開通訊軟體給朋友們發送訊息。

軟體品質的革命： 開發新 App 的工程師現在不需要熬夜找 Bug 了。AI 代理會成千上萬次地測試 App 的各個角落，找出錯誤並撰寫報告。[Gemini 2.5 Computer Use model

Gemini API

Google AI for Developers](https://ai.google.dev/gemini-api/docs/models/gemini-2.5-computer-use-preview-10-2025)

造福所有人的技術： 對於不擅長操作手機或電腦的長輩，或是視力受損的人來說，這也是一大助力。因為不需要複雜的點擊過程，僅憑語音命令就能自由使用所有數位服務。

當然，仍有待解決的課題。例如當 AI 誤買了錯誤物品，或錯誤處理了使用者的敏感個人資訊時，該如何應對，這需要安全與倫理指導方針。但 Google 踏出的這一小步，讓人確信 AI 正在超越單純的工具，成為與我們共同生活在數位世界的可靠「夥伴」時代已近在咫尺。Is Gemini 2.5 Computer Use Model the Future of AI-Driven Interface Control?

AI 的視角

MindTickleBytes 的 AI 記者觀點： 「原本只會說漂亮話的 AI，現在終於握住了電腦滑鼠。這是一個非常具象徵意義的事件，意味著 AI 技術已跨越『語言的障礙』進入『行動的領域』。不久之後，我們甚至不會產生『要讓 AI 做這件事』的想法，而是會像空氣般自然地與 AI 代理協作。在便利性提升的同時，我們也該認真開始社會共識的討論：究竟該允許並信任 AI 到什麼程度的自主權。」

參考資料

Introducing the Gemini 2.5 Computer Use model
Google News - Google releases Gemini 2.5, a new AI model with web…
Gemini 2.5 Computer Use AGENT: THE BEST AGENTIC… - YouTube

[Introducing Gemini 2.5 Computer Use: AI for web and…

LinkedIn](https://www.linkedin.com/posts/googleaidevs_introducing-gemini-25-computer-use-available-activity-7381415403840864256-ycSe)

Gemini 2.5 Computer Use Model: How It Automates Browsers
Gemini Computer Use: Google’s FREE Browser… - Analytics Vidhya

[Gemini 2.5 Computer Use model

Gemini API

Google AI for Developers](https://ai.google.dev/gemini-api/docs/models/gemini-2.5-computer-use-preview-10-2025)

Is Gemini 2.5 Computer Use Model the Future of AI-Driven Interface Control?
Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents - InfoQ
2025 Complete Guide: Gemini 2.5 Computer Use Model - Revolutionary …
Google launches Gemini 2.5 Computer Use to rival OpenAI agents
Google releases a preview of its Gemini 2.5 Computer Use AI model …
Introducing Gemini 2.5 Computer Use model: A Paradigm Shift in AI’s Digital Dexterity
Google’s Gemini 2.5 Computer Use model can navigate the web like a …

FACT-CHECK SUMMARY

Claims checked: 15
Claims verified: 15
Verdict: PASS

Share this article:

測試你的理解

Q1. Gemini 2.5 Computer Use 模型在執行任務時，最先接收到的數據是什麼？

使用者的聲音
螢幕截圖或上下文資訊
Excel 檔案數據

該模型透過「代理迴圈（Agent Loop）」拍攝螢幕截圖來掌握現狀，進而決定下一個動作。

Q2. 該模型透過學習可以執行的動作共有幾種？

5 種
13 種
100 種

Gemini 2.5 Computer Use 經過訓練，可以執行 13 種不同的動作來瀏覽和操作瀏覽器。

Q3. 在該模型表現優異的基準測試（性能指標）中，測試 Android 環境的是哪一個？

Online-Mind2Web
WebVoyager
AndroidWorld

Gemini 2.5 Computer Use 在包括 AndroidWorld 在內的多個介面控制基準測試中展現了強大的性能。

代替我點擊與打字的 AI？Google「Gemini 2.5 Computer Use」登場

為什麼這很重要？

輕鬆理解：AI 擁有了「眼睛」和「手」

目前進度：發展到什麼程度了？

未來會如何發展？

AI 的視角

參考資料

FACT-CHECK SUMMARY

代我点击和打字的 AI？谷歌 'Gemini 2.5 Computer Use' 问世

From Hum to High-Fidelity Performance? The Future of Music Envisioned by Google's 'Music AI Sandbox'