AI Looks at My Computer Screen and Works for Me? The Emergence of Alibaba's Qwen3.7-Plus

AI Summary

Released by Alibaba in June 2026, Qwen3.7-Plus is a 'multimodal agent' AI that goes beyond a simple chatbot, looking at computer screens and using tools on its own to handle complex tasks.

Imagine this. You get to work in the morning, turn on your computer, and say to the AI: “Find only the emails from yesterday with receipts attached and organize them into an Excel file.” An AI of the past would have stopped at kindly telling you how to use Excel functions or writing out a report format in text. Ultimately, typing on the keyboard, clicking the mouse, and finishing the job was up to us.

But now it’s different. The AI directly opens your email window, visually reads the receipt images, launches the Excel program, and inputs the data one by one. It’s as if you have an ‘invisible assistant’ who looks at the exact same computer monitor as you and moves the mouse on your behalf.

This science-fiction-like story has become a reality. This is thanks to Qwen3.7-Plus, a new AI model released by Alibaba on June 1, 2026 [Qwen3.7Plus vs Qwen3.7Max in 2026: Multimodal Agent or…]. Moving beyond a simple ‘smart chatbot,’ this AI plays the role of a true ‘digital intern’ that looks at the computer screen on its own and works as if moving a mouse.

Why is this important?

The chatbot AIs we’ve used so far were like competent ‘librarians’ who absolutely never leave their seats. If you ask a question, they will dig through an enormous amount of books to find a great answer, but they won’t finish a report on your behalf and email it to your boss.

In contrast, Qwen3.7-Plus is not just a conversational AI, but an Agent model (a program that autonomously performs actions to achieve goals) [[Qwen3.7-Plus: Multimodal Agent Intelligence — LLM…

explainx.ai](https://explainx.ai/llms/qwen3-7-plus-multimodal-agent-intelligence)]. Simply put, beyond just giving the AI a mouth to answer questions, it has been given ‘hands’ and ‘judgment’ to directly use software tools, write code, and lead the entire workflow of productivity tasks [Qwen3.7-Plus - Qwen Cloud].

This means the significance of the time we spend in front of monitors every day could fundamentally change. Multistep tasks like coding, data analysis, and complex web searches no longer need to be instructed one by one by a human. The AI can autonomously process tasks by opening web browsers and running necessary programs back and forth [[Qwen3.7 Plus API

AIML API](https://aimlapi.com/models/qwen3-7-plus)].

Understanding it easily: An AI that gained eyes and hands

To fully appreciate the incredible capabilities of Qwen3.7-Plus, you need to know the meaning of the word Multimodal (a technology that simultaneously understands various forms of data, such as images and sounds, in addition to text). “Modal” refers to a kind of ‘sense’ for receiving data. Multimodal significantly adds a ‘visual’ capability to existing AIs that only read text, allowing them to comprehend images, videos, and even graphical user interfaces (GUIs, the visual elements shown on a screen like icons or menu windows) on computer screens at a glance [Qwen3.7-Plus Review: Alibaba’s GUI Agent, Tested].

To use a more everyday analogy: traditional text-based AI was a smart coworker who only worked over ‘phone calls.’ You had to verbally describe the tables or images you had on your screen at length and in detail for them to grasp the situation and give advice. Out of frustration, we often just ended up doing it ourselves.

However, Qwen3.7-Plus is a coworker who sits right next to you and looks at the computer monitor together with you. It can directly ‘see’ and intuitively understand where the ‘save’ icon is in the corner of the screen or what numbers are written in a complex Excel table [[Qwen3.7 Plus model

NanoGPT](https://nano-gpt.com/models/text/qwen3.7-plus)].

Alibaba’s research team massively upgraded this visual capability on top of a solid fundamental backbone that processes text logically. Through this, they integrated the process of visually grasping a situation and logically inferring the next action via language into a single seamless workflow [Research - Qwen]. As a result, going beyond simply guessing what an image is, it has reached an astonishing level of autonomously determining tool invocations, thinking, “Looking at this screen, I should click this button next and run that tool” [Qwen3.7-Plus 发布：多模态 Agent 该怎么测 - HotAI - 博客园].

Current status: A dual-track of flagship text AI and multimodal agents

Alibaba first officially introduced this powerful Qwen3.7 product family at the Alibaba Cloud Summit held from May 20 to 21, 2026 [Qwen 3.7 Complete Guide: Alibaba’s Strongest AI Model Yet (2026)]. On May 19, the day before the official event, they even surprised people by quietly revealing a preview version through Qwen Chat [Qwen 3.7 Review: Alibaba’s New Flagship Ranks #1 in China …]. The most interesting point to watch is that Alibaba simultaneously released two flagship models with different specialties.

The first player is Qwen3.7-Max, which focuses all its intelligence on logical thinking using purely ‘text’. This model is extremely specialized only in pure-text processing. It recorded a remarkable accuracy rate of 60.6% on SWE-Bench Pro, a very difficult and authoritative test that evaluates software engineering capabilities. This proved its top-tier reasoning abilities, on par with human programmers [Qwen3.7Plus vs Qwen3.7Max in 2026: Multimodal Agent or…].

The second player is Qwen3.7-Plus, which we focused on today. While inheriting the robust text logical capabilities (text backbone) of Max, this model dramatically elevated its ability to read images, videos, and visual computer screens (vision-language). Instead of solving lab exam questions, it is a very ‘balanced’, versatile model focused on directly executing actions to perform complex real-world tasks [[Qwen3.7 Plus: The Balanced Multimodal Flagship

Qwen 3.7](https://qwen3lm.com/qwen3.7-plus/)].

So, how can we try out this smart AI assistant? Currently, these models can be encountered through platforms like Alibaba’s Model Studio and Bailian [Qwen3.7-Plus: Multimodal Agent on Bailian - kiadev.net]. They are not in an ‘open-source’ format that anyone can download the code for and install freely on their computers; they are being serviced in a closed-weights manner, cautiously accessible only via API (a communication tool for exchanging data between programs) [Qwen 3.7 Complete Guide: Alibaba’s Strongest AI Model Yet (2026)].

What will happen in the future?

The spectacular emergence of Qwen3.7-Plus sends us an important message. It shows that the technology of large language models (LLMs) worldwide is moving far beyond the level of merely conversing with text across a screen. AI is now evolving at a frightening pace toward ‘embodied intelligence’ (artificial intelligence that solves problems by interacting with the environment through a body or tools)—acting by directly colliding with the physical real world or computer operating system environments—and advanced agent systems [Multimodal Agent Receives Major Upgrade! Alibaba Officially …].

In the past, the hassle of copying, pasting, and executing code generated by AI was a human’s responsibility. Now, AI models have entered the realm of true ‘executability,’ where they autonomously establish work plans without human intervention, write code and execute it immediately (self-programming), and relentlessly find causes and fix them on their own without stopping when an error occurs (autonomous iteration) [Alibaba Unveils Qwen3.7-Plus Multimodal AI Agent Model].

In the near future, the way we give work instructions will change completely. The era of demanding only fragmented results, like “Translate this English document into Korean,” will fade. Instead, we will welcome an exciting new era of delegating entire massive chunks of authority over tasks, saying things like, “Starting from market research on competitors for this new product project, take care of everything up to analyzing the data and writing the final PPT report for the presentation.”

MindTickleBytes AI Reporter’s View: The appearance of a multimodal agent with eyes and hands suggests that the paradigm of how humans and computers communicate is changing entirely. In the past, humans had to operate computers according to the rules of keyboards and mice, but now computers directly understand human ‘natural language instructions’ and ‘visual environments’ and move on their own. Qwen3.7-Plus is akin to a declaration that the most excellent assistant—who perfectly understands our instructions and works tirelessly—has already begun to live inside our computers. Your next reliable work partner might not be a human.

References

Qwen3.7-Plus - Qwen Cloud
[Qwen3.7-Plus: Multimodal Agent Intelligence — LLM… explainx.ai](https://explainx.ai/llms/qwen3-7-plus-multimodal-agent-intelligence)
Qwen3.7Plus vs Qwen3.7Max in 2026: Multimodal Agent or…
[Qwen3.7 Plus API AIML API](https://aimlapi.com/models/qwen3-7-plus)
[Qwen3.7 Plus model NanoGPT](https://nano-gpt.com/models/text/qwen3.7-plus)
Qwen 3.7 Complete Guide: Alibaba’s Strongest AI Model Yet (2026)
Qwen3.7-Plus Review: Alibaba’s GUI Agent, Tested
Qwen3.7-Plus 发布：多模态 Agent 该怎么测 - HotAI - 博客园
Qwen3.7-Plus: Multimodal Agent on Bailian - kiadev.net
Multimodal Agent Receives Major Upgrade! Alibaba Officially …
Research - Qwen
Alibaba Unveils Qwen3.7-Plus Multimodal AI Agent Model
[Qwen3.7 Plus: The Balanced Multimodal Flagship Qwen 3.7](https://qwen3lm.com/qwen3.7-plus/)
Qwen 3.7 Review: Alibaba’s New Flagship Ranks #1 in China …

Share this article:

Test Your Understanding

Q1. What is the biggest feature of the Qwen3.7-Plus model?

It can only process text.
It is a multimodal agent that can look at computer screens and use tools.
It is open-source and can be downloaded for free by anyone.

Qwen3.7-Plus is a multimodal agent capable of understanding not only text but also images, videos, and computer screens, and can invoke tools.

Q2. Among the Qwen3.7 product family, which model focused solely on text processing capabilities and recorded a high score on SWE-Bench Pro?

Qwen3.7-Mini
Qwen3.7-Plus
Qwen3.7-Max

Qwen3.7-Max is a pure-text flagship model that recorded a score of 60.6% on coding benchmarks.

Q3. How can Qwen3.7-Plus currently be accessed?

Anyone can download its weights.
It operates only as a smartphone app.
It is a closed-weights model accessible only via API.

Currently, both the Qwen3.7-Plus and Max models are closed-weights and can only be accessed through an API.