MrBeast Is Officially Trying to Buy TikTok
The guy who brought you a bunch of dumb online videos wants to buy the site that distributes dumb online videos.
A new AI agent has emerged from the parent company of TikTok to take control of your computer and perform complex workflows.
Much like Anthropic’s Computer Use, ByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning and takes autonomous, step-by-step action.
Trained on roughly 50B tokens and offered in 7B and 72B parameter versions, the PC/MacOS agents achieves state-of-the-art (SOTA) performance on 10-plus GUI benchmarks across performance, perception, grounding and overall agent capabilities, consistently beating out OpenAI’s GPT-4o, Claude and Google’s Gemini.
“Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua University write in a new research paper.
How UI-TARS explains its thinking
UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments.
Its UI features two tabs — one to the left showing its step-by-step “thinking,” and a larger one to the right where it pulls up files, websites and apps and automatically takes action.
For example, in a demo video released today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th next month and filter by price in ascending order.”
In response, UI-TARS navigates to the website for Delta Airlines, fills in the “from” and “to” fields, clicks in the relevant dates and sorts and filters by price, explaining each step in its thinking box before taking action.
In another scenario, it is instructed to install an autoDocstring extension in VS Code. Here is its step-by-step thinking as it completes that task:
- It first reports that “to proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.”
- Recognizing that the VS Code window has not fully loaded, the model notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionalities are available for the next steps.”
- It then explains that it must access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar.
- Running into a small glitch, it reasons: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I will try clicking on it again to ensure the Extensions view opens successfully.”
- Once in the Extensions view, it types in ‘autoDocstring’ before concluding: “Since the installation process has already been initiated, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.”
Outperforming its rivals
Across a variety of benchmarks, researchers report that UI-TARS consistently outranked OpenAI’s GPT-4o; Anthropic’s Claude-3.5-Sonnet; Gemini-1.5-Pro and Gemini-2.0; four Qwen models; and numerous academic models.
For instance, in VisualWebBench — which measures a model’s ability to ground web elements including webpage quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%).
It also did significantly better on WebSRC benchmarks (understanding of semantic content and layout in web contexts) and ScreenQA-short (comprehension of complex mobile screen layouts and web structure). UI-TARS-7B achieved leading scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5 and GPT-4o.
“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual ability lays the foundation for agent tasks, where accurate environmental understanding is crucial for task execution and decision-making.”
UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2 , which assess a model’s ability to understand and localize elements in GUIs. Further, researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments, and benchmarked it on OSWorld (which assesses open-ended computer tasks) and AndroidWorld (which scores autonomous agents on 116 programmatic tasks across 20 mobile apps).
Under the hood
To help it take step-by-step actions and recognize what it’s seeing, UI-TARS was trained on a large-scale dataset of screenshots that parsed metadata including element description and type, visual description, bounding boxes (position information), element function and text from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but spatial relationships and overall layout.
The model also uses state transition captioning to identify and describe the differences between two consecutive screenshots and determine whether an action — such as a mouse click or keyboard input — has occurred. Meanwhile, set-of-mark (SoM) prompting allows it to overlay distinct marks (letters, numbers) on specific regions of an image.
The model is equipped with both short-term and long-term memory to handle tasks at hand while also retaining historical interactions to improve later decision-making. Researchers trained the model to perform both System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate) reasoning. This allows for multi-step decision-making, “reflection” thinking, milestone recognition and error correction.
Researchers emphasized that it is critical that the model be able to maintain consistent goals and engage in trial and error to hypothesize, test and evaluate potential actions before completing a task. They introduced two types of data to support this: error correction and post-reflection data. For error correction, they identified mistakes and labeled corrective actions; for post-reflection, they simulated recovery steps.
“This strategy ensures that the agent not only learns to avoid errors but also adapts dynamically when they occur,” the researchers write.
Clearly, UI-TARS exhibits impressive capabilities, and it’ll be interesting to see its evolving use cases in the increasingly competitive AI agents space. As the researchers note: “Looking ahead, while native agents represent a significant leap forward, the future lies in the integration of active and lifelong learning, where agents autonomously drive their own learning through continuous, real-world interactions.”
Researchers point out that Claude Computer Use “performs strongly in web-based tasks but significantly struggles with mobile scenarios, indicating that the GUI operation ability of Claude has not been well transferred to the mobile domain.”
By contrast, “UI-TARS exhibits excellent performance in both website and mobile domain.”
This New AI Search Engine Has a Gimmick: Humans Answering Questions
When online search engines first appeared, they seemed miraculous. Now, though? It is a truth near-universally acknowledged that search is in the dumps, corroded by spam and ads.
Big players like Google are insistent that AI is the savior of search, despite many early attempts to integrate AI ending in disaster. There’s a wave of services offering AI-powered answers, including Perplexity and OpenAI’s SearchGPT. Recently, I got an email promoting another new AI search engine—but this one has a notably quirky approach to answering questions. Called Pearl, it’s coming out of beta this week. Like other AI-powered search engines, Pearl initially answers questions by using large language models to provide answers. Then it does something unusual: It offers a human fact-check and the option to connect with experts and chat online or on the phone about the answer.
Reading about its gimmick, I didn’t really understand why it bothered with the AI answers at all. Why not just go straight to the human? I called its CEO, Andy Kurtzig, to find out.
Kurtzig stressed that Pearl is an extension of another search project he’s been working on for decades: a more traditional offering called JustAnswer, which charges a subscription and connects people to subject-matter experts based on their questions. “We started playing with the concept of AI combined with professional services about 11 years ago,” he says. When the generative AI boom took off, he decided to make Pearl a stand-alone product. (The company has had an older chatbot product named Pearl for many years and at one point rebranded JustAnswer as Pearl and then changed it back.)
Pearl’s LLM is built on top of a number of popular foundational models, including ChatGPT, and is customized to include JustAnswer’s trove of data, which includes an extensive history of questions posed and answered since it launched in 2003.
In Kurtzig’s view, Pearl lowers the barrier to entry for answers from experts. While JustAnswer costs money, Pearl has a freemium model. Its AI answers are free, as is its first layer of human fact-check, the TrustScore™, a ranking on a scale from 1 to 5 about the quality of an AI answer. When Pearl users want to go a step further and have an expert expand upon an AI answer, they are prompted to sign up for its $28-a-month service fee.
One line in particular had jumped out at me in the initial email I received about Pearl. It had claimed that Pearl would “solve many of the mounting legal challenges AI search engines face.” But … how? Kurtzig noted that most AI search engines could be legally liable under Section 230 of the Communications Decency Act for the answers they give, since they are acting more as a publisher than a platform. As Pearl incorporates human experts into its answer process, Kurtzig believes that Pearl will have the Section 230 protections that shield traditional search engines.
On top of that, he claims that Pearl is significantly less likely to provide misinformation than many other AI search engines—which he believes are likely to deal with “a tidal wave” of lawsuits based on bad answers they give. “Those other players are building amazing technologies. I call them Ferraris or Lamborghinis,” Kurtzig says. “We’re building a Volvo—safety first.”
This pitch about Pearl’s superiority, of course, made me even more keen to try it. Kurtzig seemed so certain that Pearl would still enjoy Section 230 protections. I asked the AI if it agreed.
Pearl said it likely qualifies as an “interactive computer service” under Section 230, which would mean that it’d be shielded from being treated as a publisher, just as Kurtzig suspected. But, the AI went on, “Pearl’s situation is unique because it generates content using AI.” It didn’t have a definitive answer for me after all.
When I asked to speak to a lawyer directly, it rerouted me to JustAnswer, where it asked me to provide the answer I wanted verified. I said I needed to go back and copy the answer, as it was several paragraphs long, but when I navigated back to the Pearl website, the conversation was gone and it had reset to a fresh chat.
When I tried again, this time opening the Pearl browser on desktop, I received a similarly uncertain answer. I decided to trigger a human-fact check; after several minutes, I received the TrustScore™—a measly 3!
Pearl recommended that I seek out an actual expert opinion, porting me to its subscription page. I’d been given a log-in so I didn’t have to pay while I tested the tool. It then connected me with one of its “legal eagle” experts.
Unfortunately, the lawyer’s answers were not clearer than the AI. He noted that there was ongoing legal debate about how Section 230 will apply to AI search engines and other AI tools, but when I asked him to provide specific arguments, he gave a strange answer noting that “most use shell companies or associations to file.”
When I asked for an example of one such shell company—quite confused about what that has to do with a public debate about Section 230—the “legal eagle” asked if I wanted him to put together a package. Even more confused, I said yes. I got a pop-up window indicating that my expert wanted to charge me an additional $165 to dig up the information.
I declined, frustrated.
I then asked Pearl about the history of WIRED. The AI response was serviceable, although basically the same stuff you’ll find on Wikipedia. When I asked for its TrustScore™ I was once again confronted with a 3, suggesting it was not a very good answer. I selected the option to connect with another human expert. This time around, possibly because it was a question about the media and not a straightforward legal or medical topic, it took a while for the expert to appear—well over 20 minutes. When he did, the expert (it was never established what gave him his media bona fides, although his profile indicated he’d been working with JustAnswer since 2010) gave me a remarkably similar answer to the AI. Since I was doing a free test, it didn’t matter, but I would’ve been annoyed if I had actually paid the subscription fee just to get the same mediocre answer from both a human and an AI.
For my last stab at using the service, I went for a straightforward question: how to refinish kitchen floors. This time, things went much more smoothly. The AI returned an adequate answer, akin to a transcript of a very basic YouTube tutorial. When I asked the human expert to assign a TrustScore™, they gave it a 5. It seemed accurate enough, for sure. But—as someone who really does want to DIY refinish my kitchen’s old pine planks—I think when I actually go looking for guidance, I’ll rely on other online communities of human voices, ones that don’t charge $28 a month: YouTube and Reddit.