ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


A new AI agent has emerged from the parent company of TikTok to take control of your computer and perform complex workflows.

Much like Anthropic’s Computer Use, ByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning and takes autonomous, step-by-step action. 

Trained on roughly 50B tokens and offered in 7B and 72B parameter versions, the PC/MacOS agents achieves state-of-the-art (SOTA) performance on 10-plus GUI benchmarks across performance, perception, grounding and overall agent capabilities, consistently beating out OpenAI’s GPT-4o, Claude and Google’s Gemini.

“Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua University write in a new research paper. 

Source: Arxiv

How UI-TARS explains its thinking

UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments.

Its UI features two tabs — one to the left showing its step-by-step “thinking,” and a larger one to the right where it pulls up files, websites and apps and automatically takes action. 

For example, in a demo video released today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th next month and filter by price in ascending order.”

In response, UI-TARS navigates to the website for Delta Airlines, fills in the “from” and “to” fields, clicks in the relevant dates and sorts and filters by price, explaining each step in its thinking box before taking action. 

In another scenario, it is instructed to install an autoDocstring extension in VS Code. Here is its step-by-step thinking as it completes that task: 

  • It first reports that “to proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.” 
  • Recognizing that the VS Code window has not fully loaded, the model notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionalities are available for the next steps.” 
  • It then explains that it must access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar. 
  • Running into a small glitch, it reasons: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I will try clicking on it again to ensure the Extensions view opens successfully.”
  • Once in the Extensions view, it types in ‘autoDocstring’ before concluding: “Since the installation process has already been initiated, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.”

Outperforming its rivals

Across a variety of benchmarks, researchers report that UI-TARS consistently outranked OpenAI’s GPT-4o; Anthropic’s Claude-3.5-Sonnet; Gemini-1.5-Pro and Gemini-2.0; four Qwen models; and numerous academic models.

For instance, in VisualWebBench — which measures a model’s ability to ground web elements including webpage quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%). 

It also did significantly better on WebSRC benchmarks (understanding of semantic content and layout in web contexts) and ScreenQA-short (comprehension of complex mobile screen layouts and web structure). UI-TARS-7B achieved leading scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5 and GPT-4o. 

“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual ability lays the foundation for agent tasks, where accurate environmental understanding is crucial for task execution and decision-making.”

UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2 , which assess a model’s ability to understand and localize elements in GUIs. Further, researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments, and benchmarked it on OSWorld (which assesses open-ended computer tasks) and AndroidWorld (which scores autonomous agents on 116 programmatic tasks across 20 mobile apps). 

Source: Arxiv
Source: Arxiv

Under the hood

To help it take step-by-step actions and recognize what it’s seeing, UI-TARS was trained on a large-scale dataset of screenshots that parsed metadata including element description and type, visual description, bounding boxes (position information), element function and text from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but spatial relationships and overall layout. 

The model also uses state transition captioning to identify and describe the differences between two consecutive screenshots and determine whether an action — such as a mouse click or keyboard input — has occurred. Meanwhile, set-of-mark (SoM) prompting allows it to overlay distinct marks (letters, numbers) on specific regions of an image. 

The model is equipped with both short-term and long-term memory to handle tasks at hand while also retaining historical interactions to improve later decision-making. Researchers trained the model to perform both System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate) reasoning. This allows for multi-step decision-making, “reflection” thinking, milestone recognition and error correction. 

Researchers emphasized that it is critical that the model be able to maintain consistent goals and engage in trial and error to hypothesize, test and evaluate potential actions before completing a task. They introduced two types of data to support this: error correction and post-reflection data. For error correction, they identified mistakes and labeled corrective actions; for post-reflection, they simulated recovery steps. 

“This strategy ensures that the agent not only learns to avoid errors but also adapts dynamically when they occur,” the researchers write.

Clearly, UI-TARS exhibits impressive capabilities, and it’ll be interesting to see its evolving use cases in the increasingly competitive AI agents space. As the researchers note: “Looking ahead, while native agents represent a significant leap forward, the future lies in the integration of active and lifelong learning, where agents autonomously drive their own learning through continuous, real-world interactions.”

Researchers point out that Claude Computer Use “performs strongly in web-based tasks but significantly struggles with mobile scenarios, indicating that the GUI operation ability of Claude has not been well transferred to the mobile domain.” 

By contrast, “UI-TARS exhibits excellent performance in both website and mobile domain.” 



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

China’s JD.com moves into food delivery, starts recruiting restaurants

China's e-commerce giant JD.com is venturing into the country's highly competitive food delivery...

Realme P3 Pro Design Teased; to Be Available With a Glow in the Dark Rear Panel

Realme P3 Pro is set to be unveiled in India on February 18....

Texas awards grants to five space companies

WASHINGTON — A Texas state agency awarded $47.7 million in grants to five...
SULTAN88
SULTANSLOT
RAJA328
JOIN88
GFC88
HOKIBET
RUSIASLOT88
TAHU69
BONANZA99
PRAGMABET
MEGA55
LUXURY777
LUXURY333
BORJU89
QQGAMING
KEDAI168
MEGA777
NAGASLOT777
TAKSU787
KKSLOT777
MAS77TOTO
bandar55
BOS303
HOKI99
NUSA365
YUHUSLOT
KTP168
GALAXY138
NEXIA138
PETIR33
BOOM138
MEGA888
CABE888
FOSIL777
turbospin138
KAPAKBET
SUPERJP
sultankoin99
dragon88
raffi888
kenzobet
aladin666
rgo365
ubm4d
GERCEP88
VIVA99
CR777
VOXY88
delman567
intan69
CABE888
RNR303
LOGO303
PEMBURUGACOR
mpo383
cermin4d
bm88
ANGKA79
WOWHOKI
ROKET303
MPOXL
GURITA168
SUPRASLOT
SGCWIN
DESA88
ARWANA388
DAUNEMAS
ALADDIN666
BIOWIN69
SKY77
DOTA88
NAGA138
API5000
y200m
PLAYBOOK88
LUXURY12
A200M
MPO700
KENANGAN4D
cakrabola
PANDAGENDUT
MARVEL77
UG300
HOKI178
MONTE77
JASABOLA
UNTAR4D
LIDO88
MAFIABOLA77
GASPOL189
mpo999
untung138
TW88
JAGUAR33
MPOBOS
SHIO88
VIVO4D
MPOXL
JARISAKTI
BBO303
AONCASH
ANGKER4D
LEVIS4D
JAGO88
REPUBLIK365
BOSDEAL88
BOLA168
akunjp
WARTEGBET
EZEBET
88PULSA
KITAB4D
BOSDEAL88
STUDIOBET
MESINKOIN
BIMA88
PPNUSA
ABGBET88
TOP77
BAYAR77
YES77
BBTN4D
BBCA4D
VSLOTS88
MPO800
PAHALA4D
KPI4D
JURAGAN77
QQ188
BOLAPELANGI
C200M
QQ998
GWKTOGEL
MEGABANDAR
COLOWIN
VIP579
SEVEN4D
MPO188
DEWATA88
SURAT4D
SINAR123
LAMBO77
GUDANG4D
AWAN4D
PLANETLIGA
GT88
ROYALSPIN88
MAMAJITU
MITO99
PEDIA4D
WIBU69JP
333HOKI
SIDARMA88
NAGAEMAS99
HOLA88
CAKAR76
KINGTOTO
RATUGAMING
SSI168
PILAR168
ACTOTO
EYANGTOGEL
KAISAR328
SLOT628
KAISAR88
DOTA88
MAXWIN369
ALIBABA99
MM168
SQUAD777
NAGABET88
JAYABOLA
SEMPATIGAME
PANDAJAGO
PIKAT4D
SINGA77
YUYU33
MASTERPLAY99
VICTORY39
NASA4D
PERMATA55
SAKAUSLOT
CK303
MPOTOWER
CIPUTRABET
WINJUDI
DEWI5000
IYA777
MAHIRTOTO
GOSLOT88
TIPTOP4D
RAJA787
JBO680
JOKER188
EPICPLAY88
TRIVABET
KAISAR189
JOKER81
JPSPIN88
MAYORA4D
DJARUMPLAY
OVO88
BAKTI78
WINGSLOT77
ICAFE4D
PDTOTO
JETPLAY88
CMBET88
CMBET88
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
Rebahin
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com
unyagh.org
fairparkcounseling.com/gap/
impress-newtex.com/ajax/
SULTAN88
SULTANSLOT
RAJA328
JOIN88+
HOKIBET
GFC88
RusiaSlot88
Tahu69
BONANZA99
Pragmabet
mega55
luxury777
luxury333
borju89
qqgaming
KEDAI168
mega777
nagaslot777
TAKSU787
kkslot777
MAS77TOTO
BANDAR55+
BOS303
Login-HOKI99/
NUSA365
YUHUSLOT
ktp168
GALAXY138