PhoneWorld: Building Scalable and Realistic Environments for Mobile Agents

Table of Contents

PhoneWorld: Building Scalable and Realistic Environments for Mobile Agents

🌐 Introduction
#

Over the past year, Mobile Agents have evolved rapidly. Modern models can now understand mobile interfaces, navigate across applications, perform multi-step workflows, and execute increasingly complex tasks on smartphones. Yet despite significant advances in model capabilities, a fundamental bottleneck remains: the environment.

For Mobile Agents, progress is determined not only by model architecture and training algorithms, but also by the quality of the environments in which they learn. Training data, action execution, result verification, and failure reproduction all depend on the underlying environment.

Recent advances in AI-powered software generation have dramatically lowered the cost of building mobile applications. Tools such as Google’s “Generate App with a Single Sentence” demonstrate how natural language can now produce functional Android applications with minimal human intervention. In theory, this enables the creation of thousands of training environments for Mobile Agents.

However, a critical question emerges:

Are AI-generated applications truly representative of real-world mobile experiences?

If generated applications merely resemble real apps visually while lacking realistic navigation flows, state transitions, user behaviors, and interaction patterns, then agents trained within those environments may struggle to generalize to actual smartphone usage.

This challenge is the central focus of PhoneWorld: Scaling Phone-Use Agent Environments, a research project led by Tencent Hunyuan in collaboration with multiple academic institutions.

Rather than simply generating applications, PhoneWorld seeks to create environments that are:

Trainable
Verifiable
Resettable
Scalable
Realistic enough to transfer to real-world mobile scenarios

📱 Why Real Apps Are Not Enough
#

At first glance, using real applications appears to be the ideal solution.

After all, real apps represent the exact environment Mobile Agents are expected to operate within.

In practice, however, large-scale training on production applications introduces several significant challenges.

State Reset Is Difficult
#

Many mobile tasks modify application state.

Examples include:

Sending messages
Adding products to carts
Saving bookmarks
Changing settings
Creating posts

Once these actions occur, the application’s internal state changes permanently.

Re-running identical tasks requires restoring:

User accounts
Local storage
Application databases
Session information
Cached content

At scale, this process becomes extremely expensive and difficult to automate.

Task Verification Is Challenging
#

Determining whether an agent successfully completed a task is surprisingly difficult.

Consider a simple instruction:

Send a message to a contact.

A model may claim success, but verifying the result requires direct access to the application’s internal state.

Most commercial applications do not expose reliable interfaces for such verification, making automated evaluation unstable and labor-intensive.

Real Apps Introduce Excessive Noise
#

Production applications contain countless variables that are irrelevant to training objectives.

Common sources of instability include:

Login expiration
Captcha challenges
Security mechanisms
Notification popups
Permission dialogs
Network fluctuations
Application updates
Dynamic content feeds

As a result, identical tasks can follow different execution paths at different times.

For reproducible training and evaluation, this variability becomes a major obstacle.

🌍 PhoneWorld’s Core Idea
#

PhoneWorld seeks a middle ground between synthetic environments and production applications.

The project’s objective is not to recreate every feature of a real app, but rather to preserve the aspects that matter most for Mobile Agents:

Interface structure
Navigation paths
State transitions
Task objectives
User interaction patterns

The result is a mobile environment that remains:

Runnable
Verifiable
Resettable
Scalable

while retaining meaningful similarities to real-world usage.

🏗️ From Real Apps to Trainable Android Environments
#

The PhoneWorld pipeline can be summarized as follows:

Recover the interaction structure of real applications and transform it into executable Android environments suitable for agent training.

Rather than manually designing environments, PhoneWorld begins with data collected from real application usage.

The system analyzes:

Screenshots
User interaction trajectories
Page transitions
State-changing actions

Using this information, it reconstructs the functional skeleton of an application.

Rebuilding Usage Structure Instead of Screenshots
#

Many application-generation systems focus primarily on visual replication.

PhoneWorld takes a different approach.

The goal is not to copy appearances but to reconstruct how users actually interact with applications.

The system first identifies:

Home pages
Search interfaces
Detail views
Chat screens
Shopping flows
Order management pages

It then determines:

Which pages appear most frequently
How users navigate between them
Which actions modify application state

This information forms the basis for structured Product Requirements Documents (PRDs).

Generating Structured Application Specifications
#

For each important page type, PhoneWorld automatically generates a detailed specification.

These PRDs describe:

Layout structure
Interactive elements
Navigation behavior
Visual characteristics
State update requirements

In effect, the PRD becomes a blueprint for environment generation.

Instead of asking:

What does this screen look like?

PhoneWorld asks:

How is this screen actually used?

This distinction is crucial for creating meaningful training environments.

🔄 Building Applications with Real State Changes
#

Many automatically generated applications provide only static navigation.

While such prototypes may appear functional, they are insufficient for agent training.

Most real-world tasks involve changing the environment.

Examples include:

Bookmarking content
Sending messages
Adding products to carts
Posting comments
Updating settings

PhoneWorld therefore incorporates a controllable data layer.

Read-Only Content
#

The environment contains static entities such as:

Products
Videos
Contacts
Locations
Music
Social posts

These support browsing and information retrieval tasks.

Mutable State
#

The system also maintains dynamic data structures for:

Shopping carts
Bookmarks
Messages
Comments
Orders
User preferences

As agents interact with the environment, these states are updated and stored within a local database.

This transforms the application from a simple interface prototype into a fully interactive environment.

Most importantly, the state can be reset at any time, enabling reproducible experimentation.

🤖 AI-Generated Apps Need Verification
#

Generating a runnable Android APK is only part of the challenge.

PhoneWorld uses coding agents to automatically produce:

Kotlin projects
Jetpack Compose interfaces
Android application packages

However, deployment alone does not guarantee usability.

For Mobile Agent training, environments must support reliable execution of real tasks.

Each generated application therefore undergoes extensive validation.

Automated Testing
#

Automated evaluation verifies:

Navigation correctness
Button functionality
State updates
Task execution pathways

Manual Auditing
#

Researchers additionally compare generated applications against real-world counterparts.

The review process focuses on:

Core page structure
Navigation flows
User interactions
State transition behavior

This dual-layer validation ensures that generated environments remain useful for agent development.

✅ Tasks Must Be Executable and Verifiable
#

An application alone is not sufficient.

A useful training environment requires tasks whose outcomes can be objectively evaluated.

PhoneWorld generates tasks directly from:

Page specifications
Data schemas
Application entities
State definitions

As a result, every task references information that genuinely exists inside the environment.

Examples include:

Products
Contacts
Group chats
Locations
Messages

This enables reliable verification mechanisms.

Information Retrieval Verification
#

For query-based tasks, the system checks whether the agent’s response matches the correct answer.

Examples include:

Finding a product price
Looking up a location
Retrieving contact information

State Change Verification
#

For action-based tasks, PhoneWorld directly examines database state.

The verifier can confirm whether:

A message was sent
A bookmark was created
An item was added to a cart
A comment was submitted

This approach removes ambiguity and enables large-scale automated evaluation.

📊 PhoneWorld Infrastructure at Scale
#

The current PhoneWorld ecosystem includes:

Metric	Value
Mock Android Apps	34
Application Domains	16
Human-Audited Evaluation Tasks	120
Successful Agent Trajectories	3,354
Interaction Steps	36,193

These numbers represent one of the largest publicly described infrastructures specifically designed for Mobile Agent research.

🚀 Does PhoneWorld Actually Improve Agents?
#

The ultimate question is straightforward:

Can a synthetic environment built from real trajectories meaningfully improve Mobile Agents?

PhoneWorld addresses this through a series of experiments.

Training Value
#

Researchers replaced part of the auxiliary AndroidWorld training data with only 10,000 PhoneWorld interaction steps.

The results showed improvements across four independent benchmarks:

Benchmark	Improvement
HYMobileBench	+17.7
AndroidControl	+6.0
AndroidWorld	+14.7
PhoneWorld	+52.5

These results indicate that PhoneWorld contributes transferable skills beyond its own environment.

Can Synthetic Environments Replace Real Apps?
#

Researchers also tested a more aggressive setup by replacing AndroidWorld auxiliary data entirely with PhoneWorld data.

The outcome was revealing.

PhoneWorld performance continued improving, while HYMobileBench and AndroidControl maintained gains.

However, AndroidWorld performance declined.

This suggests that synthetic environments are not complete replacements for real-world data.

Instead, they serve as highly effective complements.

Real applications provide authentic distributional coverage, while PhoneWorld provides scalable and controllable supervision.

Does Environment Scaling Continue to Help?
#

Researchers examined whether increasing environment size produces continued benefits.

Scaling Interaction Data
#

As PhoneWorld supervision increased from:

0 steps
10K steps
20K steps
36K steps

task success rates improved from:

14.2%
64.2%
70.0%
73.3%

This demonstrates strong scaling behavior.

Scaling Application Diversity
#

Under a fixed training budget, researchers compared environments derived from:

5 apps
10 apps
20 apps
34 apps

Performance improved consistently across all major benchmarks.

The findings suggest that increasing application diversity remains a powerful source of agent improvement.

🔮 The Future of Mobile Agent Training
#

The development of Mobile Agents is gradually shifting focus.

The primary question is no longer:

Can the model click buttons and navigate interfaces?

Instead, the emerging challenge is:

Can the model train inside a sufficiently realistic world?

Real applications provide realism but are difficult to reset, verify, and scale.

Purely synthetic applications are easy to generate but often lack behavioral fidelity.

PhoneWorld proposes a third path.

By reconstructing page structures, navigation flows, state transitions, and task objectives from real-world interaction trajectories, it creates environments that are both practical for large-scale training and meaningfully connected to real mobile usage.

🎯 Conclusion
#

PhoneWorld addresses one of the most important challenges facing Mobile Agents: environment scaling.

The project demonstrates that realistic training environments can be systematically generated from real application interactions while preserving the properties required for reproducible research:

Executability
Verifiability
Resetability
Scalability

More importantly, PhoneWorld suggests a broader shift in how the field may evolve.

As foundation models continue improving, future progress may depend less on increasing model size and more on expanding the quality and diversity of the worlds those models can interact with.

In the emerging AI phone era, stronger models are inevitable.

What may ultimately determine their capabilities is whether we can build enough realistic, interactive, and verifiable environments for them to learn from.

Andrej Karpathy Joins Anthropic to Automate AI Pretraining

20 May 2026·1228 words·6 mins

Andrej Karpathy Anthropic Artificial Intelligence LLM Machine Learning Claude AI Deep Learning OpenAI AI Research Pretraining

NVIDIA RTX Spark: Has the Ultimate AI Laptop Finally Arrived?

3 June 2026·1492 words·8 mins

NVIDIA RTX Spark AI PC Windows on Arm Blackwell GPU Grace CPU Artificial Intelligence Gaming Laptops Content Creation Machine Learning

Why Google AI Still Fails at Simple Letter Counting

29 May 2026·1509 words·8 mins

Google AI LLM Artificial Intelligence Google Search Tokenization Machine Learning AI Hallucination OpenAI Meta AI Natural Language Processing

🌐 Introduction #

📱 Why Real Apps Are Not Enough #

State Reset Is Difficult #

Task Verification Is Challenging #

Real Apps Introduce Excessive Noise #

🌍 PhoneWorld’s Core Idea #

🏗️ From Real Apps to Trainable Android Environments #

Rebuilding Usage Structure Instead of Screenshots #

Generating Structured Application Specifications #

🔄 Building Applications with Real State Changes #

Read-Only Content #

Mutable State #

🤖 AI-Generated Apps Need Verification #

Automated Testing #

Manual Auditing #

✅ Tasks Must Be Executable and Verifiable #

Information Retrieval Verification #

State Change Verification #

📊 PhoneWorld Infrastructure at Scale #

🚀 Does PhoneWorld Actually Improve Agents? #

Training Value #

Can Synthetic Environments Replace Real Apps? #

Does Environment Scaling Continue to Help? #

Scaling Interaction Data #

Scaling Application Diversity #

🔮 The Future of Mobile Agent Training #

🎯 Conclusion #

Related

🌐 Introduction
#

📱 Why Real Apps Are Not Enough
#

State Reset Is Difficult
#

Task Verification Is Challenging
#

Real Apps Introduce Excessive Noise
#

🌍 PhoneWorld’s Core Idea
#

🏗️ From Real Apps to Trainable Android Environments
#

Rebuilding Usage Structure Instead of Screenshots
#

Generating Structured Application Specifications
#

🔄 Building Applications with Real State Changes
#

Read-Only Content
#

Mutable State
#

🤖 AI-Generated Apps Need Verification
#

Automated Testing
#

Manual Auditing
#

✅ Tasks Must Be Executable and Verifiable
#

Information Retrieval Verification
#

State Change Verification
#

📊 PhoneWorld Infrastructure at Scale
#

🚀 Does PhoneWorld Actually Improve Agents?
#

Training Value
#

Can Synthetic Environments Replace Real Apps?
#

Does Environment Scaling Continue to Help?
#

Scaling Interaction Data
#

Scaling Application Diversity
#

🔮 The Future of Mobile Agent Training
#

🎯 Conclusion
#