Intent-Driven Code Generation for An-
droid Application Testing Using Large Lan-
guage Models
Master’s Thesis in Computer Science and Engineering
Ali Gholamhosseinpour
Xiaoran Zhang
Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025

Master’s Thesis 2025
Intent-Driven Code Generation for Android
Application Testing Using Large Language
Models
Ali Gholamhosseinpour
Xiaoran Zhang
Department of Computer Science and Engineering
Chalmers University of Technology
University of Gothenburg
Gothenburg, Sweden 2025
Intent-Driven Code Generation for Android Application Testing Using Large Lan-
guage Models
Ali Gholamhosseinpour
Xiaoran Zhang
© Ali Gholamhosseinpour & Xiaoran Zhang, 2025.
Supervisor: Yinan Yu, Department of Computer Science and Engineering
Advisor: Dhasarathy Parthasarathy, Volvo Group Truck Technology
Examiner: Christian Berger, Department of Computer Science and Engineering
Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000
Typeset in LATEX
Gothenburg, Sweden 2025
iv
Ali Gholamhosseinpour
Xiaoran Zhang
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
Abstract
Modern Android interfaces evolve rapidly, and conventional UI test automation
struggles to keep pace with this change. This thesis presents an intent–driven frame-
work that leverages large language models (LLMs) in combination with multi-modal
UI representations to translate natural-language testing goals into executable An-
droid tests. While inspired by crawler-based exploration, the framework adopts
a modular architecture that separates planning, selection, execution, and observa-
tion stages. It incorporates memory for state tracking and includes an evaluator–
optimizer loop to refine LLM outputs dynamically during execution. A hybrid screen
representation—combining XML hierarchies and screenshots—enables the system
to reason over both structural and visual elements of the UI, while a Python-based
control layer drives actions on physical devices.
The framework is evaluated on three production-grade Volvo Group applications
(Alarm Clock, System Settings, and Load Indicator). Across 45 refer-
ence scenarios, the generated tests achieve a 60% aggregate pass rate – compared
to manual tests at 87%, reach up to 88% functional correctness, and reduce the
amount of written code by as much as 70% compared to manually implemented
baselines. Ablation studies show that visual input in addition to XML consistently
supports task success and rarely confuses the model, contributing to improved rea-
soning across a wide range of UI challenges. XML remains valuable for precise
element localization, especially where structural anchors are critical. A reasoning
analysis over 42 planner steps yields an average score of 4.3 out of 5 for correctness,
indicating strong semantic alignment between global testing goals and selected local
actions. The framework exhibits weaknesses in dynamic screens, complex seekbar
interactions, and backend-dependent states, where test reliability remains limited.
This work contributes a modular LLM-based system for intent-driven UI testing, em-
pirical evidence of its effectiveness and conciseness on industrial applications without
model fine-tuning, and practical design guidelines for future intelligent testing tools,
including prompt structures, tool invocation patterns, and memory-based tracking
heuristics.
Overall, the study shows that combining multi-modal LLM reasoning with struc-
tured UI representations advances automated mobile testing toward more adaptive,
maintainable, and goal-aligned workflows.
Keywords: Android UI Testing, Large Language Models, Intent-Driven Code Gen-
eration, Automated Software Testing, Multi-Modal Models, Test Script Generation,
Semantic Reasoning
v

Acknowledgements
We would like to thank our academic supervisor Yinan Yu for their guidance and
support throughout the thesis. We would also like to thank our advisor Dhasarathy
Parthasarathy and colleagues at Volvo Group Truck Technology for their input and
collaboration. Finally, we would like to thank our examiner Christian Berger for
their constructive feedback.
Ali Gholamhosseinpour & Xiaoran Zhang, Gothenburg, June 2025
vii

Contents
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Automated Android UI Testing . . . . . . . . . . . . . . . . . . . . . 5
2.2 Android User Interface Structure . . . . . . . . . . . . . . . . . . . . 5
2.2.1 View Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Dynamic Behavior of Android Applications . . . . . . . . . . . 5
2.3 Controlling Android Applications . . . . . . . . . . . . . . . . . . . . 6
2.3.1 UI Automator: Accessibility-Based Interaction . . . . . . . . . 6
2.3.2 Espresso and Instrumentation-Based Interaction . . . . . . . . 6
2.3.3 Black-Box vs. White-Box Control Methods . . . . . . . . . . . 7
2.4 Fundamentals of Black-Box GUI Testing . . . . . . . . . . . . . . . . 7
2.4.1 What is Black-Box Testing in Mobile Context . . . . . . . . . 7
2.4.2 Limitations of Accessibility-Based Control . . . . . . . . . . . 8
2.5 State Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1 Difficulties in Defining and Detecting App States . . . . . . . 8
2.5.2 State Explosion . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.3 Model-Based Approaches . . . . . . . . . . . . . . . . . . . . . 11
2.6 Tools and Frameworks Overview . . . . . . . . . . . . . . . . . . . . . 11
2.6.1 Python-Based Control Layers . . . . . . . . . . . . . . . . . . 12
2.6.2 ATX Server and Remote Control . . . . . . . . . . . . . . . . 13
3 Related Work 15
3.1 Automated UI Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Practical Exploration Techniques . . . . . . . . . . . . . . . . . . . . 16
3.3 Language Model Integration in Testing . . . . . . . . . . . . . . . . . 16
3.4 Multi-Modal LLMs for Vision-Enhanced UI Understanding . . . . . . 16
3.5 Prompting Strategies and Evaluation for Multi-Modal Models . . . . 17
ix
Contents
3.6 Recent Advances in GUI Agents for Automated UI Testing . . . . . . 18
3.6.1 Gap in Intent-Based Android Test Generation . . . . . . . . . 19
4 Methods 21
4.1 Overview of Methodological Approach . . . . . . . . . . . . . . . . . 21
4.2 Justification for Design Science . . . . . . . . . . . . . . . . . . . . . 21
4.3 DSR and Research Questions . . . . . . . . . . . . . . . . . . . . . . 22
4.4 APP Selection for the Evaluation . . . . . . . . . . . . . . . . . . . . 24
4.5 Exploratory Work: Crawler-Based Initial Implementation . . . . . . . 27
4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.3 UI State Representation and Abstraction . . . . . . . . . . . . 29
4.5.4 State Identification and Hashing . . . . . . . . . . . . . . . . . 29
4.5.5 Interaction and Exploration Strategies . . . . . . . . . . . . . 31
4.5.6 Navigation Graph Construction . . . . . . . . . . . . . . . . . 32
4.5.7 Intent Resolution and Semantic Annotation . . . . . . . . . . 33
4.5.8 Crawling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 LLM-Based Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6.2 Screen Representation . . . . . . . . . . . . . . . . . . . . . . 36
4.6.3 State Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6.4 Exploration Procedure . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Artifact Design and Implementation . . . . . . . . . . . . . . . . . . . 37
4.8 System Architecture and Implementation Details . . . . . . . . . . . 38
4.8.1 Modules using LLMs . . . . . . . . . . . . . . . . . . . . . . . 38
4.8.2 Planning Module . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8.3 Selection Module . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.4 Execution Module . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8.5 Observation Module . . . . . . . . . . . . . . . . . . . . . . . 40
4.8.6 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.8.7 Evaluator–Optimizer Workflow . . . . . . . . . . . . . . . . . 42
4.8.8 Assertion Module . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.8.9 Network Sniffer Integration . . . . . . . . . . . . . . . . . . . 42
4.9 RQ1 Evaluation: Code-Level Comparison Between Generated and
Manual Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.9.1 Line-Level Correctness . . . . . . . . . . . . . . . . . . . . . . 44
4.9.2 Unnecessary Steps . . . . . . . . . . . . . . . . . . . . . . . . 45
4.9.3 Flakiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.9.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.9.5 Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 RQ2 Evaluation: Semantic Understanding and Planner Evaluation . . 47
4.11 RQ3 Evaluation: Prompting Strategies and Multi-Modal Effectiveness 49
4.11.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . 49
4.11.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 50
4.11.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Results 53
x
Contents
5.1 Test Selection and Comparison Overview . . . . . . . . . . . . . . . . 53
5.1.1 Example Manual vs. Automated Test Cases . . . . . . . . . . 53
5.2 RQ1 Results: Code-Level Comparison Between Generated and Man-
ual Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 RQ2 Results: Semantic Understanding and Planning Module Evalu-
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 RQ3 Results: Prompting Strategies and Multi-Modal Effectiveness . . 61
5.4.1 click_xy: Spatial Localization Accuracy . . . . . . . . . . . . 62
5.4.2 click_id: Semantic Targeting via ID Retrieval . . . . . . . . 66
5.4.3 get_count: Object Enumeration . . . . . . . . . . . . . . . . 67
5.4.4 instance: UI Component Classification . . . . . . . . . . . . 68
5.4.5 get_text: UI Text Retrieval . . . . . . . . . . . . . . . . . . . 68
5.4.6 seekbar: Continuous Value Estimation . . . . . . . . . . . . . 69
6 Discussion 73
6.1 Discussion of RQ1: Code-Level Comparison Insights . . . . . . . . . . 73
6.1.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.2 Unnecessary Steps . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.3 Flakiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.5 Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Discussion of RQ2: Semantic Understanding and Planning Module
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Reasoning Rubric . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.2 Correctness and Semantic Alignment . . . . . . . . . . . . . . 76
6.2.3 App-Specific Reasoning Quality . . . . . . . . . . . . . . . . . 77
6.2.4 Common Error Patterns and Limitations . . . . . . . . . . . . 77
6.3 Discussion of RQ3: Prompting Strategies and Multi-Modal Effective-
ness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.1 click_xy: Spatial Localization Accuracy . . . . . . . . . . . . 78
6.3.2 click_id: Semantic Targeting via ID Retrieval . . . . . . . . 80
6.3.3 get_count: Object Enumeration . . . . . . . . . . . . . . . . 81
6.3.4 instance: UI Component Classification . . . . . . . . . . . . 83
6.3.5 get_text: UI Text Retrieval . . . . . . . . . . . . . . . . . . . 83
6.3.6 seekbar: Continuous Value Estimation . . . . . . . . . . . . . 85
6.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Overall Framework Performance . . . . . . . . . . . . . . . . . . . . . 87
7 Conclusions and Future Work 89
Bibliography 93
A Appendix I
A.1 Estimating the number of distinct states . . . . . . . . . . . . . . . . I
A.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
xi
Contents
A.1.2 Chao1 Lower Bound Estimator . . . . . . . . . . . . . . . . . II
A.1.3 Lincoln-Petersen Capture-Recapture . . . . . . . . . . . . . . II
A.1.4 Good-Turing Missing Mass Adjustment . . . . . . . . . . . . . II
A.1.5 Branching-Aware Bayesian Augmentation . . . . . . . . . . . . II
A.1.6 Combined Point Estimate and Bounds . . . . . . . . . . . . . III
A.1.7 Information-Theoretic Sample Complexity . . . . . . . . . . . III
A.1.8 Practical Limitations . . . . . . . . . . . . . . . . . . . . . . . III
A.2 Future Validation Strategy . . . . . . . . . . . . . . . . . . . . . . . . III
xii
List of Figures
2.1 Illustration of state space complexity across applications. As the
number of UI states and transitions increases, filtering techniques be-
gin to break down. In highly dynamic apps, the state space becomes
too dense and entangled to resolve effectively. . . . . . . . . . . . . . 10
4.1 Main page of the Alarm application. Additional views can be found
in Appendix A.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 System main menu view. Additional images are included in Ap-
pendix A.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Main page of the Load Indicator app. Additional examples are avail-
able in Appendix A.3. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Navigation graph from a crawling session of an Android Application. 32
4.5 An example of semantically augmented nodes and edges from 4.4
where a settings button – highlighted by a transparent circle – is
clicked in Node 0 that gets connected to Node 1 via Edge 0. . . . . . 33
4.6 Overview of the system’s workflow for intent-driven test generation. . 38
5.1 Number of click_xy predictions that fall inside the annotated bound-
ing boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 L2 distance distributions for click_xy predictions, split by bounding
box inclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Manhattan distance distributions for click_xy predictions, stratified
by bounding box inclusion. . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Predicted click points for the task “set the fourth alarm to 2:13 AM.”
Pink corresponds to XML, orange to IMG, and green to XML ⊕ IMG
modality. The X symbol marks the ground truth center of the target
button, while the dashed red border indicates the bounding box of
that button. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Predicted click points for the task “calibrate the third axle to 5 tons.”
Pink corresponds to XML, orange to IMG, and green to XML ⊕ IMG
modality. The X symbol marks the ground truth center of the target
button, while the dashed red border indicates the bounding box of
that button. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Exact match counts for the click_id task. Applicable to XML modes
only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 Edit distance histogram for click_id predictions in XML-capable
modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xiii
List of Figures
5.8 Exact match counts for get_count, where the model must return the
number of queried items. . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.9 Exact match counts for instance, where the model identifies the
correct occurrence of duplicated UI elements. . . . . . . . . . . . . . . 68
5.10 Exact match counts for the get_text task. . . . . . . . . . . . . . . . 69
5.11 Edit distance distributions for the get_text task. . . . . . . . . . . . 69
5.12 L2 distance histogram for seekbar predictions. . . . . . . . . . . . . . 70
5.13 Manhattan distance histogram for seekbar predictions. . . . . . . . . 70
5.14 Predicted click points for the task “set the 8kHz band to -9dB.” Pink
corresponds to XML, orange to IMG, and green to XML ⊕ IMG modality.
The X symbol marks the ground truth target position on the 8kHz
seekbar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.15 Predicted click points for the task “set media volume to 60%.” Pink
corresponds to XML, orange to IMG, and green to XML ⊕ IMG modality.
The X symbol marks the ground truth target location on the media
seekbar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.1 Alarm Clock App Screenshots . . . . . . . . . . . . . . . . . . . . . . V
A.2 System Settings App Screenshots . . . . . . . . . . . . . . . . . . . . VI
A.3 Load Indicator App Screenshots . . . . . . . . . . . . . . . . . . . . . VII
xiv
List of Tables
3.1 Comparison of GUI agents on test code generation, intent input,
multi-modal grounding, and assertion synthesis. . . . . . . . . . . . . 19
4.1 Robustness Rating Scale . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Readability Rating Scale . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Reasoning Quality Rubric . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Modality configurations used for evaluation . . . . . . . . . . . . . . . 50
4.5 Evaluation metrics by field, showing input types, metric used, and
the output type of each metric. . . . . . . . . . . . . . . . . . . . . . 50
5.1 Test Result Comparison: Manual (M) vs Automated (A), the manual
test pass rate is 87% and automated test pass rate is 60%. . . . . . . 53
5.2 Code-Level Comparison Between Manual (M) and LLM-Generated
(A) Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 RQ2 Quantitative step Correctness and Qualitative Reasoning As-
sessment for three different Volvo apps . . . . . . . . . . . . . . . . . 56
5.4 Average Reasoning Score by Test ID and Equivalence Case . . . . . . 57
5.5 Step-by-step evaluation of local intents in an equivalence case. Scores
range from 1 (poor) to 5 (excellent). . . . . . . . . . . . . . . . . . . 58
5.6 Step-by-step evaluation of local intents in an unequivalence case.
Scores range from 1 (poor) to 5 (excellent). . . . . . . . . . . . . . . . 58
5.7 Correlation coefficients between equivalence label and average reason-
ing score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8 Reasoning Examples by Global Intent, Local Intent, Local Intent Cor-
rectness (LIC), Reasoning, and Reasoning Score . . . . . . . . . . . . 61
xv
List of Tables
xvi
1
Introduction
Modern Android applications are both feature-rich and highly dynamic, making
reliable UI testing indispensable yet increasingly difficult. Conventional automated
testing methods including random event generators and pre-scripted workflows often
fall short in two major ways:
1. They fail to adapt when app interfaces evolve.
2. They lack the ability to align test execution with the developer’s high-level
intentions.
At the same time, recent advances in artificial intelligence, particularly large lan-
guage models (LLMs), open a promising path forward. LLMs can interpret natural-
language instructions and emit executable code, suggesting that they might bridge
the gap between what developers mean and what automated tests actually do. This
thesis explores that possibility by proposing and evaluating an intent-driven frame-
work for Android UI test generation. The framework couples LLM-based reasoning
with visual and structural exploration of the app, then plans and executes actions
iteratively, adapting to observed outcomes instead of producing static scripts.
1.1 Motivation
The motivation for this work stems from a persistent gap in automated testing:
while existing tools such as Espresso or UI Automator offer reliable and efficient
execution, they operate at a low level and require developers to manually script
view-specific interactions, leaving the actual testing intent implicit [5]. More re-
cent LLM-based agents like DroidAgent attempt to generate high-level scenarios
automatically, but rely solely on structural metadata and omit visual information
entirely, which limits their applicability in modern, dynamic interfaces [40]. In this
work, we explore whether combining LLMs with image augmented and structured,
context-aware crawling can support a more adaptive testing strategy–one that en-
gages with semantic intent, handles interface variability, and generates test cases
aligned with realistic usage goals.
1
1. Introduction
1.2 Objectives
This research is guided by three main objectives:
• Design. Create an LLM-based framework that can interpret natural language
testing intents and translate them into Android UI test scripts.
• Implementation. Build a modular system that integrates planning, execu-
tion, and evaluation components, enabling adaptive test generation and ex-
ploration.
• Evaluation. Assess the system’s performance across real-world industrial
apps, focusing on its effectiveness, semantic alignment, robustness, and rea-
soning capabilities.
1.3 Contributions
This thesis makes the following contributions to the field of automated software
testing:
• It introduces a novel architecture that unites intent understanding, dynamic
app exploration, and automated test synthesis.
• It provides an empirical assessment on production-grade apps, reporting code-
level correctness, reasoning scores, and prompt-engineering insights.
• It offers practical insights and design principles that can inform future research
and development of automated UI testing tools.
By closing the gap between human intent and machine verification, the work pushes
Android UI testing toward more scalable, adaptable, and semantically meaningful
solutions.
1.4 Scope
This thesis focuses on industrial Android applications deployed on an embedded
in-vehicle infotainment platform used in heavy-duty trucks. The platform runs a
landscape-oriented display, and restricts third-party services for safety. All exper-
iments are carried out on a physical Android rig controlled via uiautomator2 and
Android Debug Bridge (ADB); the reasoning modules invoke GPT-4o without any
fine-tuning. Consequently, the empirical findings are most valid for:
• Android apps whose UI elements expose accessibility metadata (XML hierar-
chy) and maintain relatively stable visual layouts.
• Test scenarios defined by high-level user intents rather than pixel-perfect re-
gression checks.
2
1. Introduction
To capture a realistic spread of UI patterns, we evaluate three proprietary truck
apps—Alarm Clock (largely static), System Settings (scroll-intensive, many nested
elements), and Load Indicator (dynamic, list-driven with pop-ups). Together, they
cover static, scroll-heavy, and highly dynamic interaction styles. Findings should not
be extrapolated to (i) apps dominated by custom canvas rendering or ViewGroups
lacking accessibility hooks, (ii) platforms that require fully offline or on-premise
inference, (iii) test suites that hinge on strict backend data or timing constraints.
1.5 Structure of the Thesis
The remainder of this thesis is organized as follows.
• Chapter 2 – Background introduces the technical foundations of Android
UI testing, accessibility hooks, and multi-modal LLM reasoning.
• Chapter 3 – Related Work surveys prior research on automated GUI ex-
ploration, intent-driven agents, and vision-language models to position our
contribution.
• Chapter 4 – Methods details the design-science methodology, overall archi-
tecture, and experimental setup.
• Chapter 5 – Results presents empirical findings from three industrial truck
applications, contrasting LLM-generated tests with manual baselines.
• Chapter 6 – Discussion interprets the results, analyses limitations, and
outlines practical implications.
• Chapter 7 – Conclusion and Future Work summaries the contributions
and proposes future research directions.
• The Appendices supply supplementary figures, tables, and implementation
details.
3
1. Introduction
4
2
Background
2.1 Automated Android UI Testing
Modern mobile apps need to be fully tested to ensure correct behavior and user
experience. Given the fast growth of the mobile app market and the wide variety of
Android devices, automating the testing of apps’ graphical user interfaces (GUIs)
has become essential. Researchers and practitioners have developed numerous tech-
niques to generate inputs and explore app behaviors automatically [13]. Automated
Android GUI testing aims to simulate user interactions such as taps, gestures, or text
entry and verify app responses without manual effort, which improves test efficiency
and consistency.
2.2 Android User Interface Structure
2.2.1 View Hierarchy
Android UIs are structured as a hierarchical tree of UI components. Each visual
widget is a View, and container elements, which are subclasses of ViewGroup, can
hold child views or other containers, forming a nested layout structure [6]. Develop-
ers typically declare the UI in XML layout files, where a single root element, which
can be a View or ViewGroup, contains nested elements defining the interface [6].
At runtime, the Android framework inflates this XML into the corresponding View
objects, preserving the parent-child relationships. This view hierarchy is the basis
for rendering the UI and is accessible to testing frameworks via instrumentation or
accessibility APIs.
2.2.2 Dynamic Behavior of Android Applications
Android applications are event-driven and dynamic in nature. The UI presented
to the user can change over time in response to user inputs or background events.
For example, an app may start with a login screen and, upon successful login,
dynamically load a new screen or update portions of the current screen. Many
modern Android apps construct or modify their UI at runtime, for example by
5
2. Background
adding list items from a web service or switching fragments within an Activity1, so
the set of UI states is not fixed at compile time [10]. In essence, an Android app can
be modeled as a series of GUI states and transitions. Each screen state is defined by
the current view hierarchy and content, and user or system events trigger transitions
to other states [29]. This dynamic, state-dependent behavior poses challenges for
automated testing, as tools must recognize when the app has reached a new state
and handle the potentially large space of possible states.
2.3 Controlling Android Applications
2.3.1 UI Automator: Accessibility-Based Interaction
UI Automator is an Android testing framework that allows automated control of
apps from outside the app’s process, relying on Android’s accessibility interface.
It can interact with visible UI elements across different applications or system UI,
using properties like the displayed text or content description to locate widgets [7].
UI Automator tests run as a separate instrumentation process and can simulate user
actions such as clicks, swipes, or text entry on target apps without needing internal
knowledge of the app’s code. Notably, UI Automator is well-suited for scenarios such
as navigating the device UI or testing flows that span multiple apps. For example,
it can open the Settings app from within a test [7]. However, because it operates via
the accessibility service, it may require the UI elements to be accessible and include
identifiable text or resource IDs to reliably find them.
2.3.2 Espresso and Instrumentation-Based Interaction
Espresso is a popular Android UI testing framework written in Java and Kotlin.
It operates within the app under test using instrumentation and runs in the same
process as the app, allowing it to directly call UI framework methods and inspect
the UI hierarchy from the inside. Espresso uses a declarative, concise API where
testers specify view matchers to find UI elements by ID, text, or other properties,
perform actions such as click or scroll, and make assertions on view state [5].
A key feature of Espresso is its built-in synchronization. It automatically waits for
the app’s main UI thread to be idle before performing actions, which greatly reduces
flaky tests caused by timing issues [4]. Because Espresso has access to the app’s
internal structure through the instrumentation API, tests can use stable identifiers
such as resource IDs to locate widgets and can even utilize knowledge of the app’s
internals. For example, tests can use custom view matchers or access model data
if exposed, which blurs the line between black-box and white-box testing [5]. The
framework is designed for single-app testing and cannot directly interact with other
apps on the device. It focuses on fast and reliable testing of the target app’s UI.
1An Activity in Android represents a single screen with a user interface. There is no universal
standard for how activities are used; developers decide their structure. Many apps use just one
activity, relying on fragments or navigation components for UI transitions.
6
2. Background
Espresso is primarily a Java or Kotlin framework and is tightly coupled with the
Android SDK and tooling.
2.3.3 Black-Box vs. White-Box Control Methods
Android supports both black-box and white-box approaches to UI testing, each with
distinct advantages. UI Automator exemplifies a black-box method: the test treats
the app as an opaque entity, interacting only through the UI exposed to the user
and not relying on any internal implementation details [7]. This has the benefit
of not requiring access to the app’s code and enabling cross-app interactions, but
it can be limited by what information is available through the accessibility API.
In contrast, Espresso represents a white-box (or at least gray-box) approach: it
leverages instrumentation to run within the app process, giving tests insight into
the app’s structure and life-cycle.
White-box methods allow more precise and efficient operations – for instance, they
can avoid waiting arbitrarily by knowing exactly when the app is idle, and they
often produce more stable tests due to synchronization and direct access to UI
components [4]. The trade-off is that white-box tests typically require the app
under test to be instrumented or built with test support, and they cannot easily
exercise functionality outside the target app’s scope. In practice, developers often
combine these approaches: using Espresso for in-app UI flows and falling back to
UI Automator for scenarios that require system UI or multiple apps.
2.4 Fundamentals of Black-Box GUI Testing
2.4.1 What is Black-Box Testing in Mobile Context
Black-box testing is a software testing methodology in which the tester evaluates an
application solely through its external interface, with no knowledge of the internal
code or implementation [23]. In the mobile app context, black-box testing typically
means interacting with the app as an end-user would: sending touch events, entering
text, and observing the resulting screen outputs or behaviors. The tester does not
rely on internal variables or methods, instead verifying that given an input (e.g., a
button press), the app produces an expected output (e.g., a new screen or a message),
according to requirements. Mobile GUI testing under this paradigm treats the app
as a closed box where only the GUI is available for interaction and verification. This
approach aligns well with functional testing from a user’s perspective and is often
used when source code access is limited or when testing the app in a production-like
environment.
7
2. Background
2.4.2 Limitations of Accessibility-Based Control
Although black-box GUI testing using accessibility frameworks such as UI Automa-
tor or Appium is powerful, it carries inherent limitations.2 One issue is that not all
UI elements may be easily accessible or uniquely identifiable through the accessibil-
ity API. If developers have not set content descriptions, or if multiple elements share
the same text, a test might struggle to distinguish them. The information retrieved
via accessibility can sometimes be incomplete or inconsistent. For example, custom
UI components might appear with generic class names or missing attributes, making
reliable identification difficult [4].
Furthermore, because the test is external to the app, it lacks built-in synchronization
cues [7]. The testing tool might need to poll or wait for UI elements to appear,
which can lead to fragile tests if timing is off. This lack of direct insight can cause
flakiness, where tests fail intermittently. For instance, waiting for a loading spinner
to disappear may require an arbitrary sleep, as there is no direct event signal.
Performance is another consideration. Interacting via the accessibility layer can
be slower than using direct method calls, since each action involves cross-process
communication. In addition, pure black-box control cannot easily invoke certain
app behaviors that are not exposed through the UI. For example, it may be impos-
sible to trigger an internal function if there is no UI element linked to it. These
limitations mean that although accessibility-based black-box testing is broadly ap-
plicable, it may require careful design of test logic and is sometimes complemented
by instrumentation to achieve more complex validation.
2.5 State Space Exploration
2.5.1 Difficulties in Defining and Detecting App States
When performing automated exploration of a GUI – for example, in model-based
testing or crawling an app, one fundamental question is how to define a "state" of
the app. Intuitively, a state can be thought of as a distinct screen configuration of
the app: a unique arrangement of views and content that the user sees at a given
moment. However, in practice, deciding what constitutes a new state is non-trivial.
Many Android apps have content that updates or changes within the same screen
or activity. For instance, a news feed may load new items over time. Should two
instances of the feed screen with different data be treated as the same state or
different states? If the criterion is too strict, treating any difference in content as
a new state, the number of states explodes. If it is too lax, overlooking meaningful
differences, the testing process might treat distinct scenarios as one state and miss
coverage.
2https://appium.io/docs/en/2.0/ – Appium is an open-source automation framework that
supports multiple platforms and backends. It uses different automation engines depending on the
target environment, such as UIAutomator for Android and XCUITest for iOS.
8
2. Background
Detecting states automatically adds to the challenge. The testing tool must infer
from the UI hierarchy and its properties whether the app has transitioned into a
state not seen before. This involves comparing the current UI structure to previously
observed ones and deciding if it is equivalent to any prior state. The presence of
dynamically generated identifiers, animations, or nondeterministic content such as
timestamps can make such comparisons difficult, as the UI may never be exactly
identical between two runs.
Prior research highlights these difficulties. For example, defining each Android Ac-
tivity as one state is often too coarse to capture dynamic UI changes. On the other
hand, a very fine-grained view would flag even minor UI updates as separate states.
For instance, if a screen displays a timestamp, the same screen revisited one second
later could be treated as a different state solely due to the updated time [10]. As a
result, automated testing tools need robust heuristics or definitions for GUI states
to effectively navigate an app’s UI without redundancy or omission.
To manage this trade-off, testing tools can apply filtering strategies during state
comparison. A filtered state is one where specific UI elements or properties are
deliberately ignored to reduce sensitivity to irrelevant changes. For example, testers
may exclude dynamic elements such as timestamps, ads, or notification badges from
the comparison process. This allows the tool to focus on structural or semantically
meaningful differences, reducing redundant states while still preserving important
transitions in the app’s behavior.
2.5.2 State Explosion
The term "state explosion" refers to the rapid growth of the number of possible states
as the system under test becomes more complex. In the context of Android GUI
testing, state explosion can occur if the testing framework distinguishes states based
on very granular UI differences. For example, imagine a calculator app that updates
its display for each new digit entered, if the tester treats each distinct number on
the screen as a separate state, the state space would be practically infinite. This
extreme sensitivity leads to an explosion of states, overwhelming the testing process
with redundant or trivial variations [10].
This happens when states that are meaningfully different get conflated into one
because the chosen abstraction cannot tell them apart. For instance, two screens in
a shopping app showing different products might both be represented by the same
high-level "ProductPage" state if the criterion only considers the screen’s view class,
thereby losing information about which product is shown.
9
2. Background
Simple App Medium App Complex App
Resolvable States Filtered States State Explosion
M5 M4
A C M1 M3
B M2
Filtering becomes ineffective
Figure 2.1: Illustration of state space complexity across applications. As the
number of UI states and transitions increases, filtering techniques begin to break
down. In highly dynamic apps, the state space becomes too dense and entangled to
resolve effectively.
Both extremes are problematic: state explosion wastes resources and time, while
merging distinct states risks missing bugs that only occur in specific contexts.
Achieving the right balance is a key challenge. Researchers have proposed solu-
tions such as defining multi-level GUI state abstractions to tune the granularity.
Baek and Bae (2016) introduced GUI Comparison Criteria (GUICC) at multiple
levels of abstraction to decide state equivalence, finding that intermediate levels of
abstraction can significantly reduce state explosion while still distinguishing impor-
tant differences between states [10]. A conceptual overview of this trade-off is shown
in Figure 2.1, where state space growth ranges from manageable to unresolvable, il-
lustrating the need for careful abstraction strategies.
Nevertheless, some inherent trade-off remains, and no single criterion works best for
all apps; testers often must choose or adjust the state definition based on the app’s
characteristics. For example, in apps with highly dynamic content or complex inter-
active components, it may be beneficial to ignore certain regions of the UI or apply
targeted filtering strategies to treat volatile areas differently. In contrast, for simpler
or more static apps, a stricter comparison policy may be feasible without leading to
state explosion. One possible heuristic is to identify widgets that commonly host
dynamic content, such as scrolling lists or repeated items. These include compo-
nents like RecyclerView3, which can be flagged during exploration and handled
using relaxed comparison rules to avoid over-fragmenting the state space.
3https://developer.android.com/guide/topics/ui/layout/recyclerview –
RecyclerView is a flexible Android widget designed for displaying dynamic content effi-
ciently by reusing views. It is commonly used to manage large or frequently changing data
sets.
10
2. Background
2.5.3 Model-Based Approaches
Model-based GUI testing techniques construct an abstract model, often a state
machine or graph, of the app’s possible states and transitions, then generate and
execute test cases to cover that model. One notable model-based approach for
Android is the use of GUI Comparison Criteria, or GUICC, as proposed by Baek
and Bae [10].
In their framework, the app is represented as a GUI graph where nodes correspond
to distinct GUI states and edges correspond to events such as user actions or system
events that trigger transitions. GUICC defines what information is used to deter-
mine if two GUI states are considered equivalent. For example, a simple criterion
might be the Activity name, treating all screens in the same Activity as one state,
while a more detailed criterion might include the presence and properties of certain
key UI elements.
GUICC includes a multi-level design that allows the tester to toggle between different
abstraction levels for state comparison, ranging from coarse-grained to fine-grained
[10]. Their empirical results showed that using a multi-level approach can improve
exploration effectiveness. Compared to a single fixed criterion, it achieved higher
code coverage and reduced the state explosion by merging redundant states that
differ only in incidental details.
However, model-based approaches like this also have limitations. Defining the right
abstraction level often requires insight into the app’s behavior, and an inappropriate
criterion could still either miss behaviors or cause overload. Moreover, building and
maintaining a GUI model can be computationally expensive for very complex apps.
There are also diminishing returns if the model grows too large to be fully explored.
GUICC, while mitigating some issues, does not entirely solve the challenge that
some app behaviors, especially those depending on unseen data or timing, might
not be captured purely by the GUI state abstraction. Additionally, implementing
such an approach in practice might require instrumenting the app or using custom
tooling to extract the UI structure at runtime, which can be complex.
In summary, GUICC represents a significant step toward controlling state explosion
through smarter state equivalence criteria. However, it also illustrates that modeling
dynamic GUIs remains a hard problem, often requiring careful tuning and still
subject to the fundamental trade-offs of abstraction.
2.6 Tools and Frameworks Overview
A wide range of frameworks have been developed for Android UI automation. Some,
like Espresso and UI Automator, are officially supported by Android and oper-
ate via instrumentation and accessibility APIs, respectively. Others are external,
language-agnostic frameworks that wrap underlying automation engines.
11
2. Background
Appium is a widely-used cross-platform automation framework that supports An-
droid and iOS through the WebDriver protocol4. On Android, Appium supports
backends such as UIAutomator and Espresso. It allows tests to be written in mul-
tiple languages (e.g., Python, Java, JavaScript), making it popular in multi-platform
environments. However, we do not use Appium in this work due to the complexity of
its infrastructure: setting up Appium involves launching a Node.js5 server, installing
automation backends, and routing commands through several layers of abstraction.
Earlier Android-specific frameworks include Robotium6, which extends Android’s
instrumentation capabilities to simplify test writing in Java, andMonkeyRunner7,
which enables basic automation via Python scripts. While historically important,
these tools are now largely obsolete, offering less flexibility and lower compatibility
with modern Android UIs.
2.6.1 Python-Based Control Layers
In addition to official frameworks provided by Android, the testing community has
developed external tools to facilitate automated UI testing. One such innovation is
Python-based control layers for Android, exemplified by the uiautomator28 library.
These tools wrap the Android UI Automator functionality and expose it through a
network API, allowing tests to be written in high-level languages like Python instead
of Java. The typical architecture involves running a lightweight server component
on the Android device that can receive commands such as HTTP or JSON-RPC
requests, and execute them using the UI Automator API [33].
In the case of uiautomator2, a background service, sometimes called the ATX agent,
is installed on the device. It leverages Android’s accessibility bridges to perform
actions and query UI state, and communicates with a Python client running on the
tester’s PC.
This setup brings several benefits. Test scripts can be developed and iterated quickly
in Python, taking advantage of its rich ecosystem for generating inputs, logging, or
integrating with test frameworks, without going through the Android build process
for each change. It also simplifies cross-platform integration. For example, a Python-
based test system could coordinate multiple devices or interact with backend services
as part of end-to-end testing.
Essentially, Python control layers act as an intermediary that translates high-level
test logic into low-level UI Automator operations, making Android UI testing more
accessible and flexible for testers comfortable with scripting languages.
In the implementation described in this thesis, we focus on a minimal yet effective
4https://www.w3.org/TR/webdriver2/
5https://nodejs.org
6https://github.com/RobotiumTech/robotium
7https://developer.android.com/studio/test/monkeyrunner
8https://uiautomator2.readthedocs.io/en/latest/api.html
12
2. Background
set of tools. The primary execution interface is uiautomator2, which provides
programmatic access to Android’s accessibility API. For communication with the
device and issuing shell-level commands, we use the Android Debug Bridge (ADB)9.
Additionally, to support manual inspection and interaction during development, we
use a Java-based screen mirroring utility, scrcpy10.
2.6.2 ATX Server and Remote Control
The Android Testing XML (ATX) server, as used in frameworks like uiautomator2,
refers to the on-device service enabling remote control of the UI. It is usually pack-
aged as an Android application or instrumentation test that, when launched, starts
a server listening on a certain port of the device, often forwarded to the host machine
via ADB.
Through this server, a remote client can send commands such as "find UI element
with text X" or "click button with resource ID Y," which the server executes using the
device’s UI Automator APIs [33]. The approach is similar to how Appium works,
where an Appium server controls devices via the WebDriver protocol. However,
ATX is a more lightweight and direct mechanism tailored to UI Automator.
Remote interaction via such a server allows complex test scenarios. For example,
a Python test script can invoke device UI actions, then verify some conditions by
pulling data from a remote API or database, and then continue on the device—all
in one flow. The ATX server handles the execution of each action and returns the
result or any exception back to the client.
One challenge with remote interaction is maintaining synchronization and state.
Since the control commands go over a network interface, the client must sometimes
wait for confirmation that an action is complete or that a UI state has changed.
The ATX framework often provides helper methods, such as waiting for a selector
to match an element, to support this.
Overall, the presence of a remote control server on the device turns the device into
a web service for UI actions. This enables a form of black-box testing that can
be orchestrated from virtually any environment, not just from inside the device or
emulator.
9https://developer.android.com/tools/adb
10https://github.com/Genymobile/scrcpy
13
2. Background
14
3
Related Work
3.1 Automated UI Testing
Automated UI testing has long been a focus of software engineering research due to
its potential for reducing manual effort and improving reliability. However, tradi-
tional approaches face persistent challenges.
Randomized input generation tools like Google’s UI/Application Exerciser
Monkey (commonly just Monkey) [31] use stochastic or heuristic-based methods
to simulate user interactions. Monkey is a utility that sends a stream of random
UI events to an app; it’s often used for stress-testing and is fully black-box, though
not systematic. This tool differs from heuristic-based generators like Humanoid by
focusing purely on random input streams without any intelligent guidance. How-
ever, such approaches often fail to achieve comprehensive coverage, particularly in
complex applications with dynamic UI elements.
Model-based testing frameworks, such as proposed by Amalfitano et al. [2],
address this by constructing finite state machine (FSM) models (e.g., GUI graphs)
to guide systematic exploration. Tools like GUICC use these models to generate test
cases, but scalability remains a critical limitation [10]. Modern Android applications,
with their dynamically loaded content and context-dependent behaviors, strain the
ability of such methods to adapt in real time [2].
Another framework, Calabash1, provided Cucumber-style tests in natural language
for Android and iOS, though its development has slowed in favor of newer tools.
There are also emerging tools focusing on specific aspects, such as Sapienz2 (a
search-based test input generator for Android developed at Facebook) and Stoat
[35], which use stochastic or evolutionary algorithms to automatically explore app
states. Cloud-based testing services, including Firebase Test Lab3 and Browser-
Stack4, often incorporate one or more of these frameworks to allow running auto-
mated GUI tests on many device models.
1https://github.com/calabash/calabash-android
2https://engineering.fb.com/2018/05/02/developer-tools/
sapienz-intelligent-automated-software-testing-at-scale/
3https://firebase.google.com/products/test-lab
4https://www.browserstack.com
15
3. Related Work
In summary, beyond the primary Android Testing Support Library frameworks,
there is a broad ecosystem: some tools prioritize ease of scripting (Appium, Python
wrappers), others target intelligent exploration (automated crawlers and model-
based tools), and each comes with its own set of trade-offs in terms of setup com-
plexity, supported languages, and level of control. Test engineers choose among
these based on project needs, sometimes combining them to leverage the strengths
of each [15].
3.2 Practical Exploration Techniques
To bridge the gap between random exploration and intent-driven testing, lightweight
tools like DroidBot [24] leverage UI metadata to guide input generation. DroidBot
constructs a state-transition model by simulating user actions (clicks, swipes) with-
out modifying the application under test. While effective for basic scenarios, its
lack of semantic understanding—such as interpreting user intent or decoding visual
elements—limits its ability to navigate complex UIs or adapt to evolving app states.
Additionally, DroidBot’s focus on structural exploration overlooks opportunities to
incorporate natural language specifications or developer-provided goals.
3.3 Language Model Integration in Testing
Recent advancements in large language models (LLMs) have introduced novel ap-
proaches to test automation. For example, DroidAgent [40] integrates LLMs to
generate targeted test scenarios, such as creating user accounts or executing multi-
step workflows. While this demonstrates the feasibility of using natural language
to guide testing, DroidAgent’s reliance on textual app metadata (e.g., accessibility
labels) restricts its ability to interpret rich visual UI contexts, leading to misaligned
actions in visually dense interfaces. Similarly, CAT (Cost-effective UI Automation
Testing) [16] combines retrieval-augmented generation (RAG) with LLMs to map
high-level tasks (e.g., “book a flight”) to UI elements. However, CAT’s dependency
on predefined datasets and its specialization in single-app workflows (e.g., WeChat)
limit generalizability across diverse applications [16].
3.4 Multi-Modal LLMs for Vision-Enhanced UI
Understanding
Recent work on multi-modal large language models (MLLMs) shows promis-
ing ways to improve how systems understand user interfaces by combining visual
input with language reasoning. One example is ScreenAI [9], a vision-language
model from Google Research that builds on the PaLI architecture and uses a flex-
ible patching method from pix2struct. ScreenAI is trained on a mix of datasets,
including a special Screen Annotation task, allowing it to recognize UI element
types, positions, and descriptions from screenshots. It reaches top results on tasks
16
3. Related Work
like WebSRC and MoTIF, showing how vision-language models can help with UI
navigation, question answering, and summarization.
Another important example is Ferret-UI [41], a multi-modal LLM designed for bet-
ter understanding of mobile UIs. Ferret-UI deals with the challenge of long or narrow
screens and small objects by splitting screens into sub-images before sending them
through the model. It is trained on both basic tasks (like icon or widget recognition)
and more advanced tasks (like describing screen content or guessing function), and
it outperforms many open-source UI MLLMs, even doing better than GPT-4V on
some basic tests.
Finally, MobileVLM [39] focuses on both within-screen and across-screen under-
standing by adding extra pre-training steps meant just for mobile UIs. Using their
Mobile3M dataset, which has 3 million UI pages and real user transitions, Mo-
bileVLM beats general vision-language models on both their own and public mobile
benchmarks.
Overall, these systems show how combining vision and language can improve UI
understanding and interaction. However, their use in intent-driven Android UI
testing, especially for generating test code based on developer instructions, is still
not well explored. This thesis works to fill that gap by building a framework that
connects visual UI understanding with semantic intent planning to create useful
automated tests.
3.5 Prompting Strategies and Evaluation for Multi-
Modal Models
Recent work has exposed the difficulty of instruction grounding in mobile and web
interfaces. WinClick [21] introduces WinSpot, a Windows-GUI benchmark where
element selection is evaluated by matching the predicted click point against human-
annotated bounding boxes; their study shows that screenshot-only agents degrade
on cluttered or visually repetitive layouts, which suggests that vision alone is brittle.
SoEval [26] formalizes evaluation for structured outputs, arguing that schema-aware
exact-match and field-level F1 are more informative than free-form text metrics;
their benchmark targets JSON/XML answers and it reflects the strict output re-
quired in our study. This is especially relevant in UI interaction settings where a
model must resolve a single valid JSON answer.
3DAxisPrompt [25] investigates visual prompt engineering for 3-D grounding in
GPT-4o and demonstrates that injecting explicit geometric priors (axes, SAMmasks)
markedly improves localization across four 3-D datasets. While the paper does
not quantify token overhead, the prompt variants it explores differ substantially in
length, which motivates us to log token usage as a practical cost metric.
17
3. Related Work
3.6 Recent Advances in GUI Agents for Auto-
mated UI Testing
The rise of foundation models and multi-modal learning has accelerated innovation
in GUI agents, which leads to systems that can autonomously interact with digital
environments in ways previously limited to humans. This technology could provide
new methods for automated UI testing.
One notable example is Operator by OpenAI (2025), a computer-using agent (CUA)
that leverages GPT-4o’s multi-modal capabilities and reinforcement learning to ex-
ecute tasks directly on live websites, interpreting visual screenshots and performing
human-like interactions without relying on APIs [30]. Complementing this, the
Browser-Use framework (2025) offers an open-source platform that simplifies web
interactions for AI agents through structured interfaces and lightweight automation
hooks, supporting tasks from data scraping to complex workflows [14].
Further expanding this landscape, recent research has proposed advanced train-
ing strategies for GUI agents. ARPO (Agentic Replay Policy Optimization) [27]
introduces an end-to-end reinforcement learning approach with task selection and
replay buffer mechanisms to address sparse rewards and delayed feedback in GUI
environments. SpiritSight [20], on the other hand, integrates a Universal Block
Parsing method with a large-scale GUI dataset (GUI-Lasagne) to improve precision
in dynamic, high-resolution visual inputs, strengthening the agent’s grounding and
decision-making.
Beyond reinforcement learning, hybrid planning architectures like Agent S [1] com-
bine hierarchical planning and experience augmentation to improve generalizability
across operating systems, while smartphone-focused frameworks like CoCo-Agent
[28] use multi-modal inputs to achieve state-of-the-art performance on mobile au-
tomation benchmarks.
Collectively, these systems represent a new generation of UI testing that overcomes
key limitations in previous work:
• They operate across diverse environments without requiring predefined APIs
or static metadata.
• They incorporate multi-modal (text + vision) reasoning to handle rich visual
layouts.
• They apply advanced learning strategies (e.g., reinforcement learning, replay
optimization) to improve long-horizon task execution.
Compared to prior LLM-integrated tools like DroidAgent or CAT, these recent ap-
proaches demonstrate significantly higher flexibility, generalization, and robustness,
opening new opportunities for research on adaptive, vision-grounded, and intent-
aligned UI testing.
18
3. Related Work
3.6.1 Gap in Intent-Based Android Test Generation
Recent GUI agent research can be grouped into general-purpose systems for desktop
and web, and mobile-specific agents. While these frameworks demonstrate interac-
tive capabilities, none are designed to generate reusable Android test code with
structured assertions derived from external intent.
General-purpose agents such as Operator [30], Browser-Use [14], Agent S [1], ARPO
[27], and SpiritSight [20] focus on online interaction or visual understanding. These
systems do not emit test code or assertion logic, and intent is either inferred im-
plicitly or not modeled at all. Their outputs are transient and cannot be reused in
validation or CI pipelines.
Mobile-specific agents include CoCo-Agent [28] and DroidAgent [40]. CoCo executes
smartphone tasks using multi-modal inputs but is designed for task completion
rather than test generation. DroidAgent generates test code from natural language
but does not support visual inputs and lacks structured assertion synthesis.
Although DroidAgent already generates Android test code from natural-language
intents, it relies exclusively on accessibility metadata (resource-ID, content descrip-
tion) and ignores screenshots, making it unsuitable for the visually driven widgets
that dominate our industrial apps. Moreover, the publicly released prototype is
tightly coupled to a patched version of Android Instrumentation and hard-coded
UI heuristics, so adding multi-modal input or evaluator–optimizer feedback would
have required a near rewrite of the core planner and executor modules. Given these
architectural constraints, designing a clean, modular framework from scratch proved
both lower-risk and more extensible than adapting DroidAgent to our use case.
System Code Output Intent Input Multi-modal Assertions
Operator [30] No No Yes No
Browser-Use [14] No Partial Partial No
Agent S [1] No No Yes No
ARPO [27] No No No No
SpiritSight [20] No N/A Yes No
CoCo-Agent [28] No Fixed Yes No
DroidAgent [40] Yes Yes No No
Our Framework Yes Yes Yes Yes
Table 3.1: Comparison of GUI agents on test code generation, intent input,
multi-modal grounding, and assertion synthesis.
As shown in Table 3.1, to the best of our knowledge, no existing system supports
all four key requirements: accepting external intent, using multi-modal input, gen-
erating test code, and inserting structured assertions. Our framework attempts to
integrate these capabilities. Its assertion module uses multi-hop reasoning over the
GUI hierarchy and screenshots to synthesize context-sensitive checks and allow for
end-to-end generation of executable Android tests that are aligned with developer
intent.
19
3. Related Work
20
4
Methods
4.1 Overview of Methodological Approach
This thesis applies Design Science Research (DSR) as the methodological foundation
for both the construction and evaluation of the artifact. DSR is defined by its focus
on creating purposeful artifacts and generating knowledge through their iterative
refinement and testing. The method is particularly appropriate when the objective
is to solve practical problems through design, while also contributing generalizable
insights to the knowledge base. This dual purpose is core to DSR, and it distin-
guishes it from explanatory or interpretive methods that aim primarily to describe
or understand existing phenomena.
We follow the process model articulated by Peffers et al.[32] and the conceptual
framework introduced by Hevner et al.[19], in which the design process is organized
around six core activities: problem identification, definition of solution objectives,
artifact design and development, demonstration in context, evaluation against ob-
jectives, and communication of the resulting knowledge. The artifact constructed in
this project is a test generation system that uses large language models to interpret
user intent and produce executable Android UI tests. This artifact is iteratively
developed, tested, and refined through evaluations on real industrial apps. Design
choices are informed by the needs of professional Android developers, as well as
by technical limitations uncovered during empirical use. Each iteration contributes
both to improved system behavior and to the accumulation of design knowledge
related to semantic UI planning.
4.2 Justification for Design Science
DSR is the appropriate methodological choice for this work because the research has
two simultaneous aims. The first is to construct a functional artifact: a software
system that generates Android tests based on user intent. The second is to derive
reusable insights into how large language models can be structured, prompted, and
evaluated in the context of interactive software testing. DSR supports both aims by
integrating structured artifact design with systematic evaluation and by requiring
that research outputs include both implemented systems and formalized knowledge
contributions.[19]
21
4. Methods
The project begins from a real-world need: enabling software testers and developers
to write test cases without manually specifying each UI interaction. This need
defines the problem space and anchors the relevance cycle. The artifact is developed
through iterative implementation cycles, where each version is evaluated against
its ability to interpret intent and successfully operate on Android UI structures.
These iterations are part of the design cycle. Throughout the process, prior work on
automated testing, multi-modal models, and prompt engineering is used to inform
design decisions and complete the rigor cycle described by Hevner et al.
Alternative methodologies were considered but rejected. A purely experimental ap-
proach would allow performance measurement, but not artifact construction. A
grounded theory approach would allow theoretical development but would not pro-
duce a working system. DSR uniquely supports the creation of novel artifacts and
the rigorous analysis of their utility and behavior, making it the only suitable choice
for a project whose contribution is both practical and theoretical.
4.3 DSR and Research Questions
This research is guided by three core questions:
• RQ1: How effective is the proposed framework in generating valid and exe-
cutable test code for developer-defined intents?
• RQ2: How accurately does the framework understand developer intents and
translate them into meaningful and correct actions?
• RQ3: What are the most effective prompting strategies and modalities for en-
abling accurate and semantically grounded responses from multi-modal models
given visual and textual input?
These questions are designed to span all major activities in the Design Science
Research process model as defined by Peffers et al.[32], and to connect explicitly
to the three cycles in Hevner’s DSR framework: the relevance cycle (connection
to practical problem context), the rigor cycle (connection to existing knowledge),
and the design cycle (artifact creation and evaluation).[19] Each question isolates
a distinct axis of design concern: artifact utility in deployment, internal semantic
performance, and external model control via prompt design.
RQ1 concerns the artifact’s practical effectiveness when deployed in real-world de-
velopment workflows. It corresponds to the Demonstration and Evaluation phases
of the DSR process. The question asks whether the generated output meets the
expectations of professional developers in terms of correctness, coverage, and ex-
ecutability. Here, the evaluation focuses on system-level behavior using empirical
comparison against manually written tests for three industrial Android applications.
These tests are treated as application-specific ground truth, and include embedded
requirements documentation that defines intended behavior.
22
4. Methods
The system’s output is evaluated based on pass/fail outcomes, execution trace valid-
ity, and behavioral alignment with those expectations. This question also supports
the relevance cycle by grounding the artifact in a real industrial need: reducing
the manual burden of writing UI tests by enabling developers to express them at a
semantic level. In DSR terms, RQ1 evaluates whether the artifact, as instantiated,
fulfills its practical design objectives in context and demonstrates fitness for purpose.
RQ2 addresses the internal reasoning quality of the artifact and corresponds to the
Design and Evaluation activities of DSR. While RQ1 tests whether the system per-
forms well at the output level, RQ2 investigates whether that output is produced
through semantically coherent and logically consistent internal steps. Specifically, it
asks how well the system interprets global intent and translates it into meaningful
actions using the current GUI context. This includes the performance of the plan-
ning module, which decomposes intent into sub-steps, and the evaluation module,
which classifies whether the intermediate state transitions match the task objective.
Failures in intent understanding may manifest as action misalignment, missed pre-
conditions, or irrelevant navigation paths. The correctness of these decisions depends
not only on LLM outputs but on how those outputs are conditioned by prompt struc-
ture, state abstraction, and context memory. Within DSR, RQ2 maps to the rigor
cycle by linking design decisions—such as modular LLM role separation, screen
representation, or state tracking—with observed behavior. The goal is to assess
whether the artifact exhibits robust semantic behavior across variable UI states and
task types.
RQ3 treats prompt design as a primary design variable and investigates its effect
on the quality of outputs from a multi-modal model. It aligns with the Design and
Evaluation phases of the DSR process, but contributes primarily to the rigor cycle
by extracting knowledge that is not specific to this artifact alone. The question is
motivated by the observation that prompting strategies for multi-modal models are
still poorly understood, particularly in domains that involve vision-plus-action tasks
such as GUI reasoning[38].
We selected GPT-4o because, at the time of this work, it was the only commercially
available multi-modal LLM that simultaneously offered:
• State-of-the-art multi-modal model with vision support.
• A dependable public API with a large context window and consistent sub-
second latency.
• Native support for interleaving screenshots and structured text in a single
prompt without extra fine-tuning.
These characteristics let us run zero-shot experiments and keep infrastructure sim-
ple.
Open-source vision-language models such as Qwen-VL[11], Ferret-UI[41], or Mo-
23
4. Methods
bileVLM[39] would have required on-premise GPU clusters plus domain-specific
fine-tuning to match GPT-4o’s accuracy, which will add an impractical cost for
an industrial test rig; research models (ScreenAI[9], SpiritSight[20]) aren’t publicly
usable; and other commercial MLLMs ( Gemini 1.5 Pro1, Claude 32) are not ac-
cessible in our corporate environment. GPT-4o therefore remained the only option
combining high accuracy, low latency, and a stable vision-text API.
The artifact uses GPT-4o to consume both screenshots and XML UI hierarchies as
input, and the design of prompts determines how this mixed-modal information is
presented. RQ3 explores how the structure, phrasing, and sequencing of prompts
affect the interpretive performance of the model. Examples include whether to use
chain-of-thought decomposition over visual input, whether to annotate screenshots
with natural language descriptions, or how to order context (e.g. state summaries
before action instructions). These design decisions are not incidental—they deter-
mine the system’s ability to generalize across new apps and tasks. In DSR terms,
it evaluates how a configurable design artifact (the prompt itself) affects artifact
behavior, and produces transferable design knowledge that applies to any task in-
volving multi-modal interface reasoning.
4.4 APP Selection for the Evaluation
We selected three applications from the Volvo Android environment for evaluation:
Alarm Clock, System Settings, and Load Indicator. These were chosen not only for
practical reasons of scope but also because they expose a broad range of interaction
patterns, UI complexities, and architectural behaviors that are representative of
Volvo’s application ecosystem.
Alarm Clock appears simple in structure but presents significant challenges for
automated testing. Several UI elements in this app—such as toggles and icons—are
only visible as changes in the screenshot and are not reflected in the XML hierarchy
at all. For instance, toggling an alarm does not result in any change to the underly-
ing view tree, and the toggle itself is not recognized as an interactive element. This
means that neither the action nor the resulting state change is detectable through
structural inspection alone. Instead, image analysis is necessary to identify and
reason about these interactions. Additionally, some icons within the app are purely
visual with no descriptive attributes or resource IDs, requiring visual–structural cor-
relation to infer their role. While the navigation structure is shallow and predictable,
the app’s dependence on visual context over structural metadata makes it uniquely
challenging in the context of black-box exploration. A representative screenshot of
the main page is shown in Figure 4.1.
1Google DeepMind, “Our next-generation model: Gemini 1.5,” Google AI Blog (Feb 15 2024).
2Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku,” Model Card v1.0 (Mar 2025).
24
4. Methods
Figure 4.1: Main page of the Alarm application. Additional views can be found
in Appendix A.1.
System Settings resembles standard Android system settings in layout and behav-
ior. It includes a large number of scrollable views and deeply nested menus, making
exploration time-consuming. The app also contains several types of interactive wid-
gets that require special handling, including seekbars and toggle switches. These
elements are often sensitive to small changes in gesture precision and timing. In
many cases, setting a value cannot be done purely by invoking an action; it requires
visual feedback to determine success. The presence of multiple layers of menus
and the need for contextual interaction across those layers pose practical challenges
for planning and execution. Figure 4.2 illustrates the top-level view of the system
settings interface. Additional UI cases are shown in Appendix A.2.
25
4. Methods
Figure 4.2: System main menu view. Additional images are included in Ap-
pendix A.2.
Load Indicator is the most complex of the three applications, particularly in terms
of state space and navigation ambiguity. It contains a large number of dynamic UI
components whose appearance and layout depend on the current backend mode
of the app. These modes are often invoked through undocumented API calls, and
their effects can completely reshape the UI layout. The app includes repeated, nearly
identical screens, making state differentiation difficult. It also includes many visual-
only components such as graphical representations of axles or trailers, which cannot
be semantically interpreted through the XML hierarchy alone. Pop-up messages,
mode changes, and polling-based updates further increase the difficulty of extracting
stable state representations. In practice, understanding and interacting with Load
Indicator requires tracking navigation paths, backend context, and visual structure
in combination. A representative UI screen is shown in Figure 4.3.
26
4. Methods
Figure 4.3: Main page of the Load Indicator app. Additional examples are avail-
able in Appendix A.3.
These three applications were selected after evaluating other available apps in the
same environment. Collectively, they encompass the most common design patterns
and interaction challenges observed across Volvo’s internal applications. Based on
this coverage, we consider it reasonable to expect that methods developed and eval-
uated on this set will generalize to a broader portion of the application ecosystem.
4.5 Exploratory Work: Crawler-Based Initial Im-
plementation
4.5.1 Overview
The crawler explores the graphical user interface (GUI) of an Android application
without human guidance. Its output is a directed navigation graph whose vertices
represent distinct interface states and whose edges represent user actions. That
graph feeds a test generator that can replay paths matching a given intent.
Assumptions
The crawler operates under several structural and behavioral assumptions about the
application under test and the Android platform. It assumes that the application
is the only foreground Activity and that the Android accessibility layer exposes a
complete and stable view hierarchy on demand. The crawler treats the screen as
stable during the short interval following an action, meaning that the bounds and
structure of UI elements do not change in response to background processing or
transient animations within that window. This permits accurate coordinate-based
interaction with screen elements, as their positions are assumed not to shift after
initial observation.
27
4. Methods
For the purposes of this thesis, we define a UI change as any detectable modification
in screen state based on either (i) a structural difference in the view hierarchy—such
as node additions, deletions, or attribute changes—or (ii) a difference in visual ap-
pearance inferred through LLM-based reasoning over screenshots. Unlike low-level
pixel hashing, our visual comparison is semantic in nature: the multi-modal model
is used to judge whether a visible change has occurred that is meaningful in context,
such as a toggle changing from "enabled" to "disabled" or a view element appearing
or disappearing. This approach allows the crawler to capture both structural tran-
sitions and visually grounded changes, even when such changes are not reflected in
the XML view hierarchy.
Every action performed by the crawler is expected to produce a deterministic and
reproducible state, up to elements that are explicitly excluded from hashing (such
as dynamic text). Transitions between screens are assumed to be reversible either
through identifiable back-navigation elements or the Android system back button.
This reversibility allows the crawler to return to prior states when exploration re-
quires it.
The use of a language model is limited to generating semantic labels for states and
UI elements. These labels are intended to assist with downstream test planning and
reporting but are not used in any decision-making during exploration itself. The
model is not queried continuously but only once per new element or state that lacks
meaningful metadata.
4.5.2 System Architecture
The crawler is structured as a modular pipeline in which each stage transforms
the application state or performs a decision based on it. At the base is the device
interface, which executes low-level input events and retrieves the current screen
content. On top of that, the state extraction module interprets the GUI hierarchy
and screenshot to form an abstract representation of the screen that captures stable
identifiers and actionable elements.
The extracted state is passed to the hashing module, which computes a content-
based identifier to determine whether this state has already been visited. The ex-
ploration engine then selects an element with unexplored interactions and chooses
the highest-priority action for that element. Once an action is executed, the crawler
captures the resulting screen, computes its hash, and determines whether it repre-
sents a new state or a virtual variant of an existing one.
A central controller coordinates these modules and maintains a navigation graph
as new states and transitions are discovered. This graph encodes the structure of
the application’s UI flow and accumulates metadata required for downstream test
generation.
A single orchestrator object coordinates the pipeline: capture → hash → decide →
act → repeat.
28
4. Methods
4.5.3 UI State Representation and Abstraction
Each visited UI node stores a structured state descriptor, which summarizes the
current screen configuration and serves as the primary unit of exploration. This
descriptor is defined as the tuple ( )
S = pkg, activity, I, E ,
where pkg is the package name, activity the foreground activity, I a base-64 screen-
shot thumbnail, and E = {e1, . . . , en} the set of UI elements extracted from the view
hierarchy.
This representation captures all semantic and structural data necessary for planning
and test synthesis, but it is not used as a state identifier. Equality between UI
states—used, for example, in caching or cycle detection—is defined separately via a
hashing function over selected components of S (see Section 4.5.4).
For every element ei ∈ E the crawler stores:
• a stable view-tree path (XPath),
• screen center (x, y),
• widget class and resource identifier,
• interaction capabilities (clickable, toggle, scroll, editable),
• semantic labels (name, description, intent) generated once by an LLM.
Special element groups
Certain compound widgets demand bespoke handling:
• Toggles - binary widgets (check-boxes, switches).
• Number pickers - triplets {decrement, numeric display, increment}.
• Replicators - list/grid containers whose children share an identical structure.
These groups are detected heuristically (class names, sibling layout) and treated as
virtual action spaces so that repetitive elements do not bloat the graph.
4.5.4 State Identification and Hashing
The crawler assigns a stable identifier to each observed UI state to detect revisits,
control the exploration loop, and manage graph connectivity. Hashing is based
on the structural and semantic content of the screen rather than its pixel-level
appearance.
Each state is assigned a hash value derived from its activity name and a set of
hashed element descriptors. An element descriptor includes the element’s resource
29
4. Methods
ID, class name, XPath, and optionally an embedded image crop if no stable textual
or structural identifier is available. Attributes such as live text, scroll positions, or
dynamic counters are excluded to avoid false negatives due to incidental UI changes.
Formally, the state hash is define(d as: )
m
h(S) = MD5 activity ∥ concatMD5(pj)
j=1
where pj is the serialized stable descriptor of the j-th UI element. The descriptors
are sorted to make the hash invariant to view hierarchy ordering.
MD53 is a hashing algorithm used here to generate fixed-length fingerprints of both the
full state and individual UI element descriptors. It provides a compact, deterministic
way to compare structured input for equality.
The term hierarchy ordering refers to the structure and ordering of elements returned
by the Android UI automation layer. In practice, the XML view hierarchy can vary
across identical screens in several ways: (i) the order of attributes within a node may
change; (ii) the order of child nodes may be unstable; and (iii) the absolute screen
coordinates of elements may shift slightly due to layout rendering or animation
effects. These variations do not necessarily reflect any meaningful change in the
user interface. To ensure hash stability under such non-semantic differences, all
descriptors pj are normalized and sorted before hashing. This normalization allows
the crawler to treat visually identical but structurally reordered screens as the same
logical UI state.
To account for screens that are visually different but structurally equivalent, the
crawler includes several layers of comparison. When a newly observed state produces
a hash collision with an existing vertex, the corresponding UI trees are structurally
compared using a recursive attribute-matching algorithm. If the number and types
of elements are identical and their bounding boxes and identifiers match within a
threshold, the states are treated as equal and no new vertex is added.
If the structural diff detects small changes that do not justify a new node (e.g. label
text or image content variations), the new state is added as a virtual variant of the
parent. These virtual nodes are flagged and tracked separately from primary graph
vertices. In cases where the structure is ambiguous, or the element identifiers are
absent or repetitive, the crawler captures a cropped screenshot region for each such
element and includes a perceptual hash of the crop in the element descriptor. This
provides a fallback path to distinguish visually unique components that lack stable
metadata.
States that are similar but not identical are still stored independently if the struc-
tural diff suggests different possible interactions or navigational affordances. This
conservative approach avoids mistakenly collapsing distinct user-visible configura-
tions.
3https://www.ietf.org/rfc/rfc1321.txt, the official specification of MD5 by the Internet
Engineering Task Force (IETF).
30
4. Methods
4.5.5 Interaction and Exploration Strategies
The crawler explores each GUI state by attempting interactions on every actionable
UI element. An interaction is defined as the application of a concrete user-level
action to a specific UI element in a specific state. To avoid redundant or semantically
inert actions, the crawler follows a fixed priority scheme when selecting which action
to attempt. Each element supports a subset of the actions click, long_click,
toggle, and set_text, based on its declared properties and inferred type. The
priority order is:
set_text ≺ toggle ≺ long_click ≺ click
This order reflects the assumption that text input and toggling are more likely to
discover sub-states rather than completely new states, which means they have a
lower degree of disruptiveness.
Each action is executed only once per element and per state unless the element is
later reclassified. After performing an action, the crawler captures the resulting
state and compares its hash against known states. If the hash is unchanged, the
action is marked as non-state-changing and not repeated. If the action produces a
new state or a virtual variant, an edge is added to the navigation graph and the
resulting state is further explored.
Toggle widgets require additional logic. If a click does not change the screen hash,
the element is tentatively marked as a toggle. The crawler performs a second click
to test reversibility. If the second click restores the original hash, the element is
confirmed to be binary and both toggled states are modeled as virtual vertices.
These transitions are labeled with synthetic intent strings (toggle_on, toggle_off)
and the toggle behavior is not explored further.
For composite structures such as number pickers and replicator containers, the
crawler applies group-level logic. In a number picker, increment and decrement
buttons are treated as independent actions that change an internal numeric value
but not necessarily the whole screen state. A sequence of increment actions may
be simulated to discover value ranges or trigger dynamic behaviors. Replicators,
such as lists or grids, are handled by sampling representative children and applying
actions to a subset, rather than fully traversing every repeated instance. To avoid
redundant transitions, only the first few distinct elements within a replicator are
explored unless their content appears to change dynamically.
Each element-action pair is recorded with metadata about its outcome: whether it
caused a state change, whether the transition was reversible, and whether it exposed
any new interactive elements. This metadata is used by the planner to avoid re-
visiting inert interactions in future paths and to prioritize transitions likely to yield
new application states.
31
4. Methods
4.5.6 Navigation Graph Construction
The crawler maintains a directed multi-graph G = (V,E), where each vertex v ∈
V corresponds to a unique UI state and each edge e ∈ E represents a concrete
user interaction that causes a transition between states. This graph is updated
incrementally during crawling and serves as the primary data structure for modeling
application behavior.
Figure 4.4: Navigation graph from a crawling session of an Android Application.
Each vertex is keyed by a state hash and stores the full abstract state S, including
the UI elements, activity name, semantic annotations, and a screenshot thumbnail.
Vertices are flagged as either primary or virtual, depending on whether they repre-
sent distinct application screens or intermediate configurations introduced by toggles
or similar behaviors. Virtual vertices are not explored recursively but are connected
by reversible edges to their parent nodes.
Edges encode the user interaction that triggered a transition from one state to an-
other. Each edge is labeled with the action type (e.g., click, toggle), the element’s
stable XPath, its center coordinate on the screen, and an optional semantic intent
string. These intent labels are derived from resource identifiers or inferred using the
language model and are used to support intent-based test generation.
32
4. Methods
When the crawler executes an action on an element in state S and observes a re-
sulting state S ′, it computes the hash h′ = h(S ′). If h′ does not exist in the graph,
a new vertex is added and linked from the current state. If h′ matches an existing
vertex, a new edge is added from the current state to that vertex. If the content
differs only slightly, the new state may be added as a virtual child of the matched
node, as explained in Section 4.5.4.
The graph also stores metadata used for navigation and analysis, including whether
a vertex has been fully explored, whether an edge is reversible, and which elements
have unexplored actions. Edges can also be annotated post hoc with additional
metadata such as execution success or failure, timeouts, or abnormal terminations.
The graph structure supports multiple downstream queries: shortest path retrieval
between arbitrary states, matching of screen states to intent descriptions, detection
of navigation loops, and extraction of subgraphs rooted at semantically important
states. All crawl-time decisions and post-processing analysis operate over this shared
graph representation.
4.5.7 Intent Resolution and Semantic Annotation
The crawler utilizes an LLM to augment the navigation graph with semantic in-
formation. This is done in a constrained and strategic manner, without allowing
the model to influence any control flow or decision-making within the exploration
algorithm. The model is invoked only when the built-in metadata of a UI element
or state is insufficient to describe its purpose or content.
Figure 4.5: An example of semantically augmented nodes and edges from 4.4
where a settings button – highlighted by a transparent circle – is clicked in Node 0
that gets connected to Node 1 via Edge 0.
33
4. Methods
When a new UI element lacks a descriptive text label or resource identifier, the
crawler extracts relevant context such as visible text, class name, and screen location,
then queries the language model to generate a short name and a coarse intent string
as can be seen in 4.5. These fields are stored alongside the element and associated
with any outgoing edges that result from interacting with it. Examples of generated
intent labels include descriptions like “navigate to profile settings” or “submit form.”
Similarly, for each unique UI state, the crawler may request a short summary title
and a one-line description. This is done only once per state and cached in the graph.
These summaries improve the interpretability of the graph and assist downstream
tools that generate tests from descriptions.
The semantic fields are optional and used only to annotate the graph. No interaction
selection, path prioritization, or state comparison depends on the output of the
language model. The crawler is designed to function fully without these annotations,
and the presence or absence of a label does not influence its behavior.
4.5.8 Crawling Algorithm
Algorithm 1 summarizes the exploration loop; it proceeds until no vertex contains
an unexplored interactive element.
Algorithm 1: Android GUI Crawling Loop
1 S ← CaptureState();
2 h← Hash(S);
3 InsertVertex(h, S);
4 while true do
5 (htarget, i)← NextUnexplored();
6 if htarget = nil then
7 break;
8 end
9 if h ̸= htarget then
10 NavigateTo(htarget);
11 h← htarget;
12 end
13 a← SelectAction(h, i);
14 Execute(a);
15 S ′ ← CaptureState();
16 h′ ← Hash(S ′);
17 ProcessTransition(h, a, h′, S ′);
18 h← h′;
19 end
Explanation of Algorithm 1. The crawler begins by capturing the initial GUI
state and inserting it as a vertex in the exploration graph. On each iteration, it
invokes NextUnexplored(), which returns a pair (htarget, i) representing the i-th
34
4. Methods
unexplored interactive element in a previously visited state with hash htarget. If
the current state h is not equal to htarget, the crawler navigates back to it using
NavigateTo(), which replays a known path through the interaction graph.
Once at the correct state, the crawler selects and executes the i-th action using
SelectAction() and Execute(). The resulting UI state is captured, hashed,
and added to the graph via ProcessTransition(). This loop continues until all
reachable interactive elements have been explored.
The function NextUnexplored() selects the next unexplored action based on a
priority ordering defined by interaction type and availability. These heuristics are
described in Section 4.5.5, and guide the crawler toward interactions that are more
likely to yield meaningful state transitions.
Termination Guarantee
In principle, the crawling loop halts once every vertex has no remaining unexplored
element–action pair. This assumes that the set of reachable states is finite and that
the hash function correctly identifies all structurally equivalent screens. However,
in practice, this assumption does not always hold.
Android provides no reliable or stable mechanism for uniquely identifying GUI states
across time. Neither the view hierarchy, nor logcat4, nor low-level diagnostics such
as surfaceflinger5 or dumpsys6 expose a persistent identifier or canonical structure
for UI states. As a result, hash collisions and near-duplicates are both possible and
unavoidable.
When the hash function under-approximates state identity, the crawler may treat
visually different but functionally identical screens as distinct. This leads to state
space explosion, particularly in apps that are highly dynamic. Conversely, when
the hash function over-approximates and collapses states that differ in behaviorally
relevant ways, some transitions may be lost or misclassified.
Therefore, the crawler does not have a formal termination guarantee in the general
case. Its practical termination depends on the stability of the app under test, the
reliability of state hashing, and the degree of structural noise in the UI. In well-
behaved applications with static or semi-static screens, the crawl typically completes
in finite time with manageable graph size. In highly dynamic applications, manual
intervention or additional heuristics may be required to limit or bound the crawl.
4https://developer.android.com/tools/logcat
5https://source.android.com/docs/core/graphics/surfaceflinger-windowmanager
6https://developer.android.com/tools/dumpsys
35
4. Methods
4.6 LLM-Based Crawler
4.6.1 Motivation
The original crawler implementation used algorithmic heuristics and tree-based XML
comparison to explore the GUI application. This method was limited by its inabil-
ity to reliably identify whether a new screen represented a unique application state.
Small layout shifts or structural differences in XML often led to duplicated traversal
or missed state transitions. To address this, a second crawler design was imple-
mented. Unlike the first, this version delegates most decision-making to a LLM,
using visual and semantic descriptions of the interface rather than structural com-
parisons.
4.6.2 Screen Representation
Each screen is represented using two primary sources: a screenshot and its cor-
responding UI XML hierarchy. The system collects both and passes them into a
module called DescribeScreen, which returns a textual title, a semantic descrip-
tion of the interface, and a list of actions that can be taken. This information
is stored as a representation of the current screen. Screens that involve scrollable
containers use the DescribeScreens module, which aggregates descriptions from
multiple vertically scrollable subviews.
4.6.3 State Tracking
The crawler attempts to determine whether the current screen has already been
explored. This process uses a hybrid approach. First, image similarity is computed
using the Structural Similarity Index Measure (SSIM7). If the SSIM between the
current screen and any previous screen exceeds a fixed threshold (≥ 0.95), the
system considers them visually similar. Second, the LLM module IsUniqueState is
invoked to evaluate whether the semantic description of the current screen matches
any stored descriptions. Only if both conditions suggest duplication is the screen
treated as a previously seen state. Otherwise, the crawler assumes a new state and
records it.
This approach does not fully eliminate the state detection problem. Visual simi-
larity does not guarantee semantic equivalence, and LLM-based descriptions may
introduce inconsistency or aliasing. As a result, the system’s memory may contain
duplicate or overly merged states. This design is functional for limited runs of the
crawler where the memory is used only as an approximate record of visited states.
4.6.4 Exploration Procedure
The exploration loop consists of the following steps:
7SSIM is a method for measuring the similarity between two images. It evaluates perceived
changes in structural information by comparing patterns of pixel intensities, and returns a score
between −1 and 1, where 1 indicates perfect similarity.
36
4. Methods
1. Capture the current screen image and XML hierarchy.
2. Describe the screen using DescribeScreen or DescribeScreens.
3. Determine whether the state has already been seen.
4. Select an unexplored action from the list associated with the current screen.
5. Use the Execute module to generate code that performs the action.
6. Run the code, capture the post-execution screen, and classify the change using
GetChangingType.
7. Update the memory:
• If the action caused a layout change, add a new node and edge in the
memory graph.
• If the action only altered element states or failed, record the result ac-
cordingly.
8. If there are no more unexplored actions from the current state, resume from
another state that still has unexplored transitions.
Each node in the memory graph stores a screen title, description, actions, and
screenshot. Each edge represents an action taken between two screens, along with
the code used.
4.7 Artifact Design and Implementation
This study develops an AI agent prototype system designed to support automated
UI test generation for Android apps based on user intent. The system is implemented
in Python and integrates several key components:
• An Evaluator–Optimizer workflow, which systematically evaluates LLM-generated
outputs and fine-tunes prompt parameters to improve accuracy, alignment,
and overall performance.
• GPT-4o, a multi-modal large language model used to interpret natural lan-
guage intents and generate test logic;
• uiautomator2, a tool for interacting with and controlling Android devices at
the UI level;
These parts work together so the AI agent system could take a user instruction,
understand the current screen of the Android app, and choose the best actions to
perform, similar to what a human tester would do.
37
4. Methods
Figure 4.6: Overview of the system’s workflow for intent-driven test generation.
4.8 System Architecture and Implementation De-
tails
The system is written in Python 3.10 and follows a modular design, based on the
structure shown in Figure 4.6. Each module has its own role and handles one part
of the test generation and execution pipeline.
4.8.1 Modules using LLMs
The robot-like blue chat icon represents the LLMs are used inside the module, e.g.
use LLMs to generate code, or use LLMs as a judge. In our system, the Planning
Module and Selection Module use LLMs as decision maker, the Execution Module
and Assertion Module use LLMs as code generator and the Observation Module use
LLMs as judge and reasoner.
4.8.2 Planning Module
First, let us define the inputs and outputs to this module. The requirements for the
test case, which can also be understood as the ultimate goal we want the system to
achieve at the end of the loop, is called global intent1.
Global Intent. A natural language string representing the user’s input that serves
as the overall guiding objective for the system. The global intent is sent to the
Planning Module, where it is decomposed into a sequence of local intents for step-
by-step execution. It is also be sent to the observation module, which continuously
evaluates whether the system has successfully fulfilled the user’s intended goal. This
evaluation governs the control flow of the Evaluator–Optimizer Workflow: if the
global intent is satisfied, the loop terminates.
Another input to the Planning Module is the GUI context2, which provides visual
38
4. Methods
information about the current UI state and utilize the multi-modal ability of modern
language models.
GUI Context. The GUI context represents the observable state of the application
at a specific moment in time. It consists of a list of current screenshots capturing
the visible UI and an accompanying XML hierarchy file that encodes metadata
about the underlying UI elements—such as their types, positions, properties, and
relationships.
The output of this module is the local intent3, a single string contains the next
immediate action to be perform, the action has certain types for the language model
to select from, and will be generated with details indicating the UI elements to
interact with or certain values to be input. for e.g. Tap on on the ’Edit’ button
under the first Alarm.
Local Intent. A decomposed intent from Global Intent to be performed on the
current app screen, generated by the Planning Module. The generation of this local
intent is subject to several constraints: (i) all actions must operate strictly within
the current screen context (i.e., visible screenshot(s)); (ii) once an action may cause
a page transition, no further actions should be chained after it; (iii) the output must
focus solely on accomplishing the next immediate step toward fulfilling the global
intent, without over-executing; (iv) for scrolling actions, the specific scrollable list
must be identified; and (v) if a scroll is intended to locate a specific string, that
string must be explicitly mentioned in the step.
Together, the global intent, GUI context, and the reasoning output from the Ob-
servation Module—captured after the last set of local actions—are passed as input
to the language model. This enables comprehensive understanding and reasoning
about the current task. Using chain-of-thought prompting, the language model gen-
erates the most appropriate next action to be performed, referred to as the local
intent. In addition, the model outputs a textual reasoning trace, which reveals the
intermediate thought process behind the decision and allows for human inspection
and evaluation of its reasoning quality.
This module act as the decision maker of the entire system, it guides the system by
outputting local intent as next step for Execution Module to run local actions, then
the result of local actions4 will be captured by Observation Module and send back
to Planning Module for better reasoning.
4.8.3 Selection Module
Local Actions. An single action step generated by the Selection Module, trans-
lated from the local intent into actionable commands under the Volvo Android
environment. The action represent one or multiple UI operations required to fulfill
the specified local intent. These actions are executed on the real application using
uiautomator2, enabling direct interaction with the device’s user interface.
39
4. Methods
The Selection Module take the local intent as input, send it to LLMs for tool selection
and generate Local Action4 as output. The tool selection involves a list of pre-defined
tools and limits the LLMs to pick the most appropriate one to use under current app
screen. The local action must be one of the predefined primitive types: tap, swipe,
scroll, set_numberpicker, set_seekbar. The tool list contains normal Android
operations like Tap, Swipe and Scroll. There are also certain types of special UI
actions like set_numberpicker or set_seekbar, which involve a complicated and
correctly ordered series of actions like ’long click’ to enter edit mode, ’set text’ to
set the digit, ’click’ to save the value of this digit. We use a few-shot prompting to
guide the language model to generate a correct series of local actions.
4.8.4 Execution Module
This module receives the local actions, send it to the language model, and language
model translates it into a set of python code that can be executed on the Android
device.
We use uiautomator2 to perform actions such as clicking buttons, swiping, or
entering text, enabling direct interaction with the device’s user interface. For
example, a local action such as “tap the ‘Save’ button” may be translated into
device(text="Save").click(), while a local action like “scroll to the item labeled
‘Volume’” could result in device(scrollable=True).scroll.to(text="Volume").
The execution result, along with any error messages, is then passed to the Observa-
tion Module as the outcome of the Execution Module.
4.8.5 Observation Module
After the python code translated from local actions have been executed on the app
device, the application may transition to a new state (e.g., a different screen or
modified UI elements) or remain in the original state, which may indicate that the
actions were unsuccessful. To determine which path the app has taken and to obtain
a visual understanding of the current state, we recapture the GUI context and pass
it with the execution result of local actions and the global intent, to the language
model.
The language model is first prompted to assess whether the performed local actions
have successfully fulfilled the global intent. If so, the system exits the Evaluator–
Optimizer loop. If not, the model is instructed to reason about why the global
intent remains unfulfilled. This reasoning includes: (i) whether the app state has
changed as expected; (ii) whether the local actions were successfully executed; (iii)
the degree to which the global intent has been fulfilled; and (iv) the remaining steps
required to achieve full completion of the intent.
The reasoning will be sent back to the Planning Module for better understanding
the next move.
40
4. Methods
4.8.6 Memory
The memory component maintains a record of previously visited application screens
and supports the identification of already-explored states. Each time a new screen
is encountered, the framework captures a screenshot and its corresponding XML
hierarchy. These inputs are passed to a summarization module that uses an LLM
to produce a textual description of the screen. The prompt includes constraints and
examples to constrain the model, increase consistency, and limit irrelevant output.
This textual description becomes the key under which the screen’s metadata is
stored.
The metadata includes the screenshot, XML hierarchy, and information about the
visible UI elements and their bounding boxes. Because both the image and bounds
are available, individual elements can be isolated by cropping the relevant region
from the screenshot. These stored elements and descriptions are used later to com-
pare new screens against previously observed ones.
To determine whether a newly visited screen has already been seen, the framework
generates a fresh description and searches for similar entries in memory. If candidates
are found, it compares the new screenshot against stored screenshots using SSIM8. If
the similarity score exceeds a threshold (e.g., 0.95), the screen is considered visually
identical to a known state. This comparison is performed over the entire image.
Using SSIM alone is not always sufficient. In Android applications, screens may
include dynamic content such as timestamps, small status messages, or slightly
changed icons. These can cause minor visual differences without affecting the actual
layout or meaning of the screen. A high SSIM score despite these small variations
is usually a good indicator that the screens are functionally the same.
In addition to image similarity, the framework also considers how the screen was
reached. If the current path through the app (for example, moving from screen
A to B to C) matches a previously recorded path that also ended at a similar-
looking screen, this provides further evidence that the two states are the same. The
description, screenshot, XML hierarchy, and navigation path together form a more
clear context.
When needed, the framework can provide the LLM with a shortlist of similar memory
entries. The model is then asked whether the current screen should be treated as
new or already visited. This additional step helps disambiguate borderline cases
where image similarity is high but uncertainty remains. Combining SSIM, semantic
description, execution path, and LLM judgment provides a more reliable basis for
determining whether a screen is new or already known.
8SSIM (Structural Similarity Index) is a method for measuring the similarity between two
images. It evaluates perceived changes in structural information by comparing patterns of pixel
intensities, and returns a score between −1 and 1, where 1 indicates perfect similarity.
41
4. Methods
4.8.7 Evaluator–Optimizer Workflow
The Evaluator–Optimizer Workflow is a structured pattern commonly used in agen-
tic systems and LLM-based pipelines to iteratively improve output quality. In this
workflow, an optimizer generates an initial response or action, and an evaluator
assesses its quality based on predefined criteria. The evaluation result is then used
to refine the next output, enabling continuous feedback and learning [8].
In our system, the Planning Module, Execution Module, and Observation Module
collectively implement this workflow. The process starts by evaluating how well the
generated actions match the user’s intent and how accurate those actions are when
executed on the real UI. The evaluation outcomes are then used to update prompt
parameters, adjust prompt structure, and guide the Planning Module toward more
accurate decisions.
This feedback loop ensures that the system keeps improving as it executes more
tests and observes more GUI contexts. The result is a dynamic, learning-capable AI
agent that can behave like a human tester: understanding user intent, planning next
steps, executing actions, and validating outcomes—all while refining its behavior in
real time.
This AI agent system is a key part of the research. It shows how LLMs can be used
in Android testing. It also lets us run tests to check how well it works by measuring
things like test coverage, how well it matches the user’s intent, and if the actions
succeed on real Android apps used at Volvo Trucks.
4.8.8 Assertion Module
After the global intent has been fulfilled, the system exits the Evaluator–Optimizer
Workflow and transitions to the Assertion Module. The Assertion Module take the
complete code snippets as input to LLMs and let it generate reasonable assertions
line by line to the input code snippets. At this point, assertions are added to each
code snippet produced from the local actions. All these annotated code snippets
are then assembled to compose a complete test case corresponding to the original
global intent. Then the entire code snippet will be sent as the Compose Final Test
Script, which is the output of the entire workflow.
4.8.9 Network Sniffer Integration
Certain UI states in Volvo Android applications are tightly coupled to backend
behavior. For example, some pop-up dialogs or confirmation messages are triggered
only after a specific API response is received. Similarly, application settings may
be populated dynamically based on periodic polling of backend services. These
interactions are not observable through the UI hierarchy alone and cannot be reliably
inferred from screen transitions. This made it necessary to design a mechanism that
could capture and analyze backend traffic associated with UI actions.
To address this, we implemented a custom packet sniffer and integrated it into the
42
4. Methods
framework. The goal was to observe HTTP traffic in response to UI events and ex-
tract information such as endpoint, method, status code, and request/response pay-
loads. This information can be used for reverse engineering undocumented behavior
and supporting test assertions that depend on backend responses. Additionally,
identifying polling mechanisms or control requests helps in reconstructing payloads
to restore application state in later tests.
Rather than relying on high-level packet capture libraries such as PyShark9, the snif-
fer was implemented from scratch to allow fine-grained control over both the packet
parsing and the capture strategy. This decision was motivated by two constraints:
first, many execution environments had strict limitations on installing external de-
pendencies; and second, we required full control over when and how packets were
filtered, parsed, and assembled. The sniffer operates locally or remotely via SSH,
with support for interface discovery and scoped, timed capture windows.
At the core of the sniffer is a TCP assembler. This component is responsible for
grouping raw packets into bidirectional TCP connections. It tracks sequence num-
bers and connection metadata to avoid duplicate payloads and incomplete streams.
It detects connection setup (SYN flag), teardown (FIN or RST)10, and classifies
each connection type as transient, persistent, WebSocket, or event-stream based on
observed headers. All payloads for each direction are accumulated separately and
stored once the connection is closed on both ends.
Once assembled, each TCP stream is parsed by a custom HTTP parser. This parser
extracts fields from both the client and server side, including method, endpoint
path, status code, headers, and content-type. If the content-type suggests a JSON
payload, the parser attempts to deserialize it. The parser also includes logic to detect
Volvo-specific media types and extract application name and version identifiers.
Parsed results are stored as structured objects and logged to file for later analysis.
The sniffer supports both continuous capture and scoped capture. In scoped capture
mode, the sniffer records only the transactions that occur during a defined window,
typically tied to a UI action. This is implemented via a context manager that
synchronizes packet sniffing with the start and end of the UI interaction. The
mechanism is integrated at the HTTP request layer of the device control interface,
so that any click or swipe can be wrapped with packet capture automatically. The
duration of capture is configurable and defaults to a short window sufficient to
include most API responses.
Captured transactions can be filtered post-capture using a flexible filtering sys-
tem. Transactions can be included or excluded based on criteria such as HTTP
method, status code, endpoint, or application name. These filters are used to sup-
press background noise or isolate specific types of traffic, such as polling endpoints
9PyShark is a Python wrapper for the Wireshark packet analyzer. It provides access to parsed
packet fields but depends on tshark and is less suitable for constrained environments.
10SYN, FIN, and RST are TCP control flags. SYN initiates a connection, FIN signals an orderly
shutdown, and RST indicates an immediate termination of the connection.
43
4. Methods
or user-triggered control messages.
The packet sniffer is not actively integrated into the core framework beyond basic
logging. Its current role is limited to passive observation and payload capture, with-
out influencing test generation or execution. However, it establishes the groundwork
for more advanced functionality. Future iterations could support automated genera-
tion of backend assertions, classification of traffic types, or active replay of requests
to emulate specific application states. The implementation was built to be modular
and extensible for these purposes.
4.9 RQ1 Evaluation: Code-Level Comparison Be-
tween Generated and Manual Tests
To evaluate RQ1 — “How effective is the proposed framework in generating valid and
executable test code for developer-defined intents?”, we define effectiveness as the
ability of a test to correctly, reliably, and clearly express a developer’s intent with
minimal manual intervention or maintenance overhead. A generated test is effective
if it (i) performs the intended interactions correctly, (ii) exhibits stable behavior
across repeated executions, (iii) avoids redundant operations, and (iv) is written
in a form that is understandable and maintainable by human engineers. Based on
this definition, our evaluation includes only the generated tests that passed, along
with their corresponding manually written counterparts that also passed, which
serve as the ground truth. All tests are assessed across five structured metrics.
Failed tests are excluded because the structured metrics—such as robustness and
readability—are designed to evaluate the quality of valid and executable test cases.
4.9.1 Line-Level Correctness
Line-Level Correctness in software engineering refers to the accuracy and reliability
of individual lines of code within a program. It assesses whether each line performs
its intended function correctly and contributes appropriately to the overall behavior
of the software. In our system, we measure the Line-level correctness by annotated
each line as either correct or incorrect. A line is considered correct if it con-
tributes directly to achieving the developer’s specified intent through appropriate
UI interaction or validation. Lines that target the wrong elements, perform actions
not aligned with the test goal, or introduce unintended state changes are labeled
incorrect.
This metric directly quantifies the functional precision of the generated code. An
effective test must exhibit high correctness, meaning that nearly every line plays a
valid and necessary role in fulfilling the intent. This is the most central metric in
determining whether the generated output is actually valid and usable as a test.
44
4. Methods
4.9.2 Unnecessary Steps
Unnecessary steps refer to actions in a test script that do not contribute to fulfilling
the intended goal or verifying the desired behavior of the system under test. We
identify and count operations that do not contribute meaningfully to achieving the
intended goal of our test. These are annotated manually and include:
• repeated navigations already covered by previous lines,
• sleep delays without a corresponding asynchronous dependency,
• hard–coded values or paths that do not influence test success.
Recent study by Amalfitano et al. also suggest that structured exploration leads to
more efficient and meaningful test coverage[3], and study by Wang et al. discussed
how such unnecessary navigations can significantly inflating test duration and di-
verting focus from intended test goals[37]. In our evaluation, we use this criterion
to assess how well the generated test code remains focused and efficient in fulfilling
its intended purpose.
4.9.3 Flakiness
Flakiness is measured as a binary attribute. A test is marked flaky if it produces
inconsistent results (i.e., passes and fails non-deterministically) across three consec-
utive runs on a stable device configuration with unchanged application state.
Flaky tests are considered ineffective by definition, since their results cannot be
trusted in regression analysis or continuous integration pipelines. Even a seman-
tically correct test becomes unusable if it behaves inconsistently under identical
conditions. Therefore, absence of flakiness is treated as a necessary condition for
test effectiveness.
4.9.4 Robustness
Robustness is the degree to which a system or component can function correctly in
the presence of invalid inputs or stressful environmental conditions[22].
To evaluate this property in the context of generated test code, we inspect the degree
to which a test can tolerate minor UI, timing, or state changes without failing. We
assign robustness on a fixed 5–point scale based on presence of defensive constructs.
In industrial settings where network delays or minor UI shifts are common, robust-
ness ensures that tests remain valid and maintainable over time. High robustness
supports effectiveness by reducing the cost and frequency of manual rework.
45
4. Methods
Score Definition
1 No assertions or handling; test fails immediately on any unexpected
condition.
2 Sparse assertions; no retries or timeouts.
3 Includes basic assertions and a minimal form of retry or timeout han-
dling.
4 Assertions on critical steps and conditional logic for error recovery.
5 Fully defensive: structured assertions, retries, timeouts, and fallback
paths.
Table 4.1: Robustness Rating Scale
4.9.5 Readability
Readability captures the clarity and structure of the test code. This includes use
of helper abstractions, meaningful naming, logical structure, and appropriate com-
menting. Each test is scored on a 5–point scale via dual human review, supplemented
with pylint output.
Score Definition
1 Flat structure, poor naming, no modularization or comments.
2 Basic structure with limited abstraction or clarity.
3 Adequate structure; readable but lacks modularity.
4 Well-structured with helper methods and consistent naming.
5 Highly readable; modular, clearly named, and well documented.
Table 4.2: Readability Rating Scale
Effective tests are not only functionally correct but also maintainable by others.
High readability reduces onboarding time and debugging effort in collaborative and
long–term codebases.
46
4. Methods
4.10 RQ2 Evaluation: Semantic Understanding
and Planner Evaluation
To evaluate RQ2 — “How accurately does the framework understand developer in-
tents and translate them into meaningful and correct actions?” — we perform a
semantic alignment analysis between the input developers intent and planner gen-
erated actions.
We evaluate the ability to translate a global intent1 into accurate local intent3 using
a two-part process comprising (i) quantitative action-correctness scoring and (ii)
qualitative reasoning assessment. We evaluate a total of 42 planner-generated steps
across three representative Volvo applications: Alarm Clock(N=16), System Settings
(N=11), and Load Indicator(N=15). For each step, both the action correctness and
reasoning quality were annotated manually.
Quantitative step Correctness
To assess the planner’s ability to generate valid and executable steps, we sample 42
planner logs under real app contexts. Each input includes a global intent and GUI
context2. These inputs are fed to the planner, which queries the language model
and returns a single string that contains the next immediate step, plus a reasoning
field of the context and the thinking behind this action step. Human annotators
assess whether the generated action is valid under the context of the current app
screen and action sequence:
• Correct: The action is executable and appropriate in context. Performing it
advances the app meaningfully toward fulfilling the intended task.
• Incorrect: The action is irrelevant, misaligned, or targets an incorrect UI
element. Performing it does not bring the app closer to fulfilling the intent or
may even deviate from the intended task path.
We report binary correctness using the following counts:
• Ntotal: Total number of planner-generated steps evaluated.
• Ncorrect, Nincorrect: Number of actions labeled as correct or incorrect, based on
criteria showing above.
Qualitative Reasoning Assessment
To assess the planner’s reasoning quality, we apply a 5-point scoring rubric to each
explanation string, regardless of whether the corresponding action was correct. The
three apps were selected to cover a range of UI complexities: Alarm Clock features
relatively static layouts, System Settings involves nested toggles and configurations,
and Load Indicator presents dynamic lists and pop-up dialogs. This score reflects
the planner’s understanding of the input user intent and its use under the current
47
4. Methods
app context:
Score Description
1 Poor: Illogical or hallucinated reasoning, reflects a misunderstanding of cur-
rent screen or recognized wrong UI elements.
2 Weak: Partial understanding of current context, but flawed logic reasoning
such as correct understanding of the current app functionality but misunder-
standing the elements in the current screen.
3 Moderate: Plausible reasoning with key gaps, such as correct understanding
of app context but provides an incorrect chain of thought.
4 Good: Mostly sound reasoning with minor issues, such as correct understand-
ing of entire context but missed one or two components on the screen.
5 Excellent: Clear, contextually sound reasoning aligned with intent, totally
correct understanding of entire context, strong reasoning of the app logic and
functionalities.
Table 4.3: Reasoning Quality Rubric
Each step’s reasoning explanation was scored independently of correctness, using
the 5-point rubric in Table 4.3. Reasoning was assessed based on clarity, contextual
understanding, and logical soundness. Scores were recorded by human annotators
with domain knowledge of the app under test. Table 5.8 presents example global
intents, planner steps, reasoning texts, and their corresponding scores.
We report reasoning quality scores for each app as µ and σ: Mean and standard
deviation of reasoning scores, computed across all, correct, and incorrect steps, re-
spectively. Aggregate statistics across apps (shown in the “Total” row of Table 5.3)
are computed by summing across all steps from all apps. Mean and standard devi-
ation values are calculated over the full set of reasoning scores.
48
4. Methods
4.11 RQ3 Evaluation: Prompting Strategies and
Multi-Modal Effectiveness
Research Question 3 (RQ3) investigates the question: "What are the most effective
prompting strategies and modalities for enabling accurate and semantically grounded
responses from multi-modal models given visual and textual input?"
This study evaluates multi-modal prompting strategies through synthetic and iso-
lated tests derived from key patterns used in the full system. While the deployed
framework often combines multiple inference modules in a single pipeline (e.g., ID
and instance resolution, or center prediction followed by action validation), this
study isolates atomic behaviors. This enables precise measurement of modality con-
tributions and avoids confounds from task chaining, history, or fallback logic.
4.11.1 Experimental Design
We define six task types that cover common prediction demands in UI automation:
• click_xy: predict the (x, y) location of a target element
• click_id: retrieve the element ID corresponding to a prompt
• get_count: count the number of visual or logical items
• get_text: return the exact or best-matching visible string
• instance: identify the index of a repeated element
• seekbar: compute the (x, y) position for setting a slider to a given value
Each task is specified as a standalone prompt and has a known correct output in
structured form. All tasks require returning a valid JSON-compatible value for a
known field (e.g., center, resource_id, count, etc.).
The dataset consists of 44 high-level prompts, decomposed into 480 atomic tasks
spanning six task types. Each task is stored in a structured CSV and linked to a
JSON file containing:
• a screenshot image (Base64-encoded PNG)
• the raw Android XML layout hierarchy
• the labeled answer fields for each task
Each task is evaluated under three input modality configurations—XML only (x),
image only (i), and both (xi)—as summarized in Table 4.4. For each configura-
tion, three deterministic seeds (42, 88, 96) are used, resulting in 1440 total model
invocations. All evaluations use GPT-4o via a stateless API with temperature=0.
49
4. Methods
Configuration XML Image
XML ✓ –
IMG – ✓
XML ⊕ IMG ✓ ✓
Table 4.4: Modality configurations used for evaluation
Prompts are dynamically generated per task to predict a single output field (e.g.,
text, target_center). Each prompt includes a natural language instruction and
the relevant inputs (XML layout, screenshot, or both), passed via labeled fields such
as task, xml_hierarchy, and image. Prompts enforce schema consistency and field
naming across all tasks.
No few-shot examples, retrieval augmentation, chain-of-thought reasoning, fallback
logic, or external schema validation are used. Only the model’s final structured pre-
diction is scored. All outputs are logged with the prompt, gold label, and computed
metrics for analysis.
4.11.2 Evaluation Metrics
Each predicted field is evaluated with metrics appropriate to its type. The following
table summarizes the metrics used:
Field Type Metric Metric Type
CIB (binary) binary
element_center (int, int) L2 float
Manhattan int
(int, int) L2 floattarget_center Manhattan int
Exact match binary
resource_id string Edit distance int
string Exact match binarytext Edit distance int
count int Exact match binary
instance int Exact match binary
Table 4.5: Evaluation metrics by field, showing input types, metric used, and the
output type of each metric.
For coordinate predictions such as element_center and target_center, we report
50
4. Methods
the L2 (Euclidean11) distance and Manhattan12 distance to the ground truth in pixel
units. Additionally, for element_center, we include a binary Center-in-Box (CIB)
metric indicating whether the predicted center lies within the gold bounding box.
Here, element_center refers to the center of a UI element with defined bounds (e.g.,
a button), while target_center denotes a specific point of interaction, typically on
continuous controls like sliders or seekbars.
String fields like resource_id and text are evaluated using case-sensitive exact
match as the primary metric. We also compute character-level edit distance (Lev-
enshtein13) to support fine-grained analysis, though it does not affect correctness
scoring.
Integer fields, including count and instance, are evaluated solely via exact match.
Any deviation from the gold value is treated as incorrect.
All predictions are parsed deterministically. Missing, malformed, or improperly
typed outputs are scored as incorrect. There is no learned evaluator and no tolerance
via soft thresholds or fuzzy matching.
4.11.3 Limitations
This evaluation isolates each task to measure model performance on a single prompt
with known inputs. It does not evaluate recovery, chaining, or interaction planning,
which are part of the broader system. For instance, the main framework includes
validation checks (e.g., verifying a predicted center lies within a bounding box or
that an element exists before interaction), but these are excluded here to maintain
test purity.
Likewise, prompt design is uniform and does not include demonstrations, retrieval
augmentation, or few-shot conditioning. This maximizes internal consistency but
may underestimate potential performance achievable with richer context or in-system
integrations.
11Euclidean (L2) distance is the straight-line distance between two points in pixel space, com-
puted as the square root of the sum of squared differences across axes.
12Manhattan distance is the sum of absolute differences across axes, reflecting axis-aligned move-
ment between two points.
13Levenshtein distance measures the minimum number of single-character edits (insertions, dele-
tions, or substitutions) required to change one string into another.
51
4. Methods
52
5
Results
5.1 Test Selection and Comparison Overview
As shown in Table 5.1, from the broader pool of test cases created manually by
Volvo developers, we selected a subset and compared our results against 11 manual
tests for the Alarm Clock, 16 for System Settings, and 60 for the Load Indicator.
We use the global intent1 as input to our Planning Module. However, a portion of
the manual tests includes verification of backend API data and external link checks,
which are currently beyond the scope of our system. Therefore, we exclude these
tests from our evaluation. As a result, we focus on 9 automated tests for the Alarm
Clock, 14 automated tests for System Settings, and 22 automated tests for the Load
Indicator.
The test scenarios were defined first by Volvo’s team as manual test cases; we then
used the exact same high-level intents when generating automated (A) tests. Conse-
quently, every manual test (M) has a one-to-one counterpart in the automated set.
The test cases that failed in both automated and manual testing are identical.
App Method Total Passed Failed Expected
Failures
Alarm Clock M 11 7 4 0
A 9 5 4 0
System Settings M 16 16 0 0
A 14 6 8 0
Load Indicator M 60 53 4 3
A 22 16 6 0
Total M 87 76 8 3
A 45 27 18 0
Table 5.1: Test Result Comparison: Manual (M) vs Automated (A), the manual
test pass rate is 87% and automated test pass rate is 60%.
5.1.1 Example Manual vs. Automated Test Cases
To illustrate the differences captured by our code-level metrics, we present a repre-
sentative pair of test scripts targeting the same user intent in the Alarm Clock app.
Both test cases fulfill the following scenario:
53
5. Results
Click on the edit button of Alarm1 and once it opens just click the save
button. Navigate to the second page of the app. Click on the edit button
of Alarm4 and once it opens just click the save button. Finally, return
to the home screen.
Manual Test Case
The manually written test uses helper methods and is well-annotated with seman-
tic markers such as preconditions, expected behavior, and system-level tags. It
abstracts UI actions through reusable domain-specific methods for maintainability.
Automated Test Case
The generated test explicitly encodes every UI interaction using raw device com-
mands. While this ensures transparency and functional correctness, it also leads to
verbosity, repetition, and reliance on fixed delays. Unlike the manual tests, which
use heavily abstracted layers that can sometimes obscure intent and make mainte-
nance harder, the generated test remains straightforward.
5.2 RQ1 Results: Code-Level Comparison Between
Generated and Manual Tests
As described in Section 4.9, we evaluate five dimensions of metrics including Line-
level Correctness, Unneccesary steps, Flakiness, Robustness, Readability. The eval-
uation is performed on passed manual and generated test cases, including 5 for
Alarm Clock, 6 for System Settings, and 16 for Load Indicator.
Table 5.2 presents the results. Each row corresponds to a test case, where the ID
column indicates the identifier of the original manual test case used by Volvo devel-
opers, which serves as a traceable reference for aligning generated and manual test
pairs. M denotes a manually written test and A denotes a generated (automated)
test. The table enables side-by-side comparison between manual and automated
tests targeting the same functionality.
54
5. Results
App ID Type Correct (%) Unnec. Steps Flaky Robust Readable
Alarm Clock
A01 M 100 0 F 4 3
A 100 0 F 3 3
A06 M 100 2 F 4 3
A 100 0 F 3 2
A08 M 100 2 F 5 3
A 83 1 T 2 2
A10 M 100 2 F 5 3
A 82 0 T 3 3
A14 M 100 1 F 5 3
A 85 1 F 3 3
System Settings
S04 M 100 1 F 4 4
A 65 1 T 2 3
S05 M 100 1 F 4 4
A 60 1 T 2 2
S06 M 100 1 F 4 4
A 57 4 T 1 2
S08 M 100 1 F 4 4
A 53 4 F 2 2
S09 M 100 1 F 4 4
A 53 3 T 2 2
S10 M 100 1 F 4 4
A 60 3 T 2 2
Load Indicator
L02 M 100 0 F 4 4
A 44 1 F 2 2
L08 M 100 1 F 5 4
A 46 4 F 1 2
L09 M 100 1 F 5 4
A 54 4 F 2 2
L12 M 100 1 F 5 4
A 29 8 T 1 1
L13 M 100 1 F 5 4
A 29 7 T 1 1
L16 M 100 1 F 4 4
A 33 5 F 1 1
L20 M 100 1 F 5 4
A 50 3 F 2 2
L21 M 100 1 F 4 4
A 40 7 F 1 2
L22 M 100 1 F 5 4
A 33 5 F 1 2
L23 M 100 1 F 4 4
A 50 1 F 1 2
L24 M 100 1 F 4 4
A 57 0 F 1 3
L25 M 100 1 F 3 4
A 40 3 F 1 2
L26 M 100 1 F 5 4
A 33 2 F 1 2
L27 M 100 0 F 5 5
A 15 6 F 1 1
L28 M 100 0 F 5 5
A 20 2 F 2 1
L29 M 100 0 F 5 5
A 18 6 F 1 1
Table 5.2: Code-Level Comparison Between Manual (M) and LLM-Generated
(A) Tests
55
5. Results
5.3 RQ2 Results: Semantic Understanding and
Planning Module Evaluation
As described in Section 4.10, we assess the semantic alignment between developer
global intents1 and the Planning Module generated local intents3 using both quanti-
tative and qualitative measures. Specifically, we evaluated 42 Planning Module logs
across three applications–Alarm Clock (N=16), System Settings (N=11), and Load
Indicator (N=15).
Each low-level intent is annotated for action correctness (correct or incorrect) and
the reasoning field generated together with low-level intent is rated on reasoning
quality using a 5-point rubric.
Table 5.3 summarizes the evaluation results. For each application, we report the
total number of Planning Module generated low-level intent (Ntotal), the number
of correct and incorrect intents (Ncorrect, Nincorrect), and the average reasoning score
(µ) with standard deviation (σ) across all, correct-only, and incorrect-only low-level
intents.
App Ntotal Ncorrect Nincorrect µtotal σtotal µcorrect σcorrect µincorrect σincorrect
Alarm Clock 16 14 2 4.00 1.32 4.29 1.07 2.00 1.41
System Settings 11 8 3 3.91 1.70 4.88 0.35 1.33 0.58
Load Indicator 15 8 7 3.13 1.06 3.88 0.83 2.29 0.49
Aggregate 42 30 12 3.67 1.37 4.33 0.92 2.00 0.74
Table 5.3: RQ2 Quantitative step Correctness and Qualitative Reasoning
Assessment for three different Volvo apps
Table 5.4 presents the average reasoning scores and binary equivalence outcomes for
all 35 automated tests paired with their corresponding manual tests.
56
5. Results
Test ID Equivalent Average Reasoning Score
A01 T 5.00
A02 F 2.14
A03 F 2.65
A05 F 2.15
A06 T 4.83
A08 T 4.97
A10 T 5.00
A12 F 2.78
A14 T 5.00
S04 T 4.36
S05 T 4.85
S06 T 4.02
S07 T 4.44
S08 T 3.46
S09 T 4.00
S10 T 4.70
S11 T 4.92
S12 F 3.47
S13 F 1.28
S14 F 1.89
S15 F 2.33
S16 F 2.01
S17 F 3.08
L01 F 1.87
L02 F 1.67
L03 F 2.36
L04 F 1.33
L05 F 1.79
L08 T 3.52
L09 T 4.61
L10 F 2.25
L12 T 3.04
L13 F 2.49
L16 T 3.68
L19 F 2.24
L20 T 4.57
L21 T 3.08
L22 F 1.55
L23 F 1.17
L24 F 2.64
L25 F 2.01
L26 F 2.87
L27 F 1.83
L28 T 2.36
L29 F 1.98
Table 5.4: Average Reasoning Score by Test ID and Equivalence Case
Table 5.5 presents one example of an automated test case that is equivalent to
its corresponding manual test. It lists the local intents generated by the Planning
Module, their correctness, and the associated reasoning scores for each local intent.
57
5. Results
Local Intent Correct Score
Tap on the ’Edit’ button for Alarm1 T 5
Tap the ’Save’ button in top-right T 5
Tap on right arrow to go to next page T 5
Tap on the ’Edit’ button for Alarm4 T 5
Tap the ’Save’ button in top-right T 5
Tap the ’Home’ button T 5
Total Correct 6/6 Average = 5.0
Table 5.5: Step-by-step evaluation of local intents in an equivalence case. Scores
range from 1 (poor) to 5 (excellent).
Table 5.6 presents one example of an automated test case that is not equivalent
to its corresponding manual test. It lists the local intents generated by the Planning
Module, their correctness, and the associated reasoning scores for each local intent.
Local Intent Correct Score
Tap on the ’Edit’ button for Alarm1 T 5
Tap on the "Save" button to save changes to Alarm1 F 3
Tap on the right chevron icon to navigate to next alarms T 2
Tap on the "Edit" button for second alarm F 2
Tap on the toggle button to enable Repeat for fifth alarm F 2
Tap on the toggle button to enable Repeat for fifth alarm F 2
Total Correct 2/6 Average = 2.67
Table 5.6: Step-by-step evaluation of local intents in an unequivalence case.
Scores range from 1 (poor) to 5 (excellent).
Table 5.7 presents the correlation results computed from the data in Table 5.4, quan-
tifying the relationship between average reasoning score and test equivalence using
Spearman Rank and Point-Biserial coefficients. Both metrics indicate a strong pos-
itive association, with the point-biserial coefficient showing particularly high align-
ment.
Correlation Type Coefficient p-value
Spearman Rank 0.820 5.706× 10−12
Point-Biserial 0.842 4.165× 10−13
Table 5.7: Correlation coefficients between equivalence label and average
reasoning score.
Table 5.8 provides detailed examples of Planning Module generated Low-level Intent
evaluated for semantic correctness and reasoning quality. Each row corresponds to
a single Low-level Intent derived from a global intent, and is annotated with:
• the Global Intent provided as input,
58
5. Results
• the corresponding Local Intent generated to fulfill the intent,
• a binary label indicating whether the Low-level Intent was Correct,
• the Reasoning justifying the Low-level Intent, and
• a Score from 1 to 5 based on reasoning quality.
59
5. Results
Global Intent Local Intent LIC Reasoning Score
The current screen displays two
alarms, with the first alarm having the
Tap on the "Edit" but- repeat option enabled for specific days,
ton for the second while the second alarm does not haveTurn ON the Repeat alarm to modify its set- the repeat option enabled. The intentoption for all the five tings and enable the re- is to ensure that the repeat option isalarms. Verify the re- peat option. After tap- True turned on for all five alarms. Since the 1curring status of all
alarms on UI. ping, wait for 0.5s to
second alarm does not have the repeat
allow the screen to up- option enabled, the next step is to edit
date. this alarm to enable the repeat option.The "Edit" button for the second alarm
is visible and can be tapped to modify
its settings.
Click ’Advanced Set-
tings’, verify options The intent is to access the ’Advanced
Legal information [Sys- Settings’ screen and verify specific op-
tem WebView licences, tions. The previous steps included
Third-party licences], Scroll down the list to scrolling and tapping on ’AdvancedReset options [Reset reveal more options, in- Settings’, but the current screen stillnetwork, Reset app cluding ’Advanced Set- shows the main settings menu. This in-preferences, Delete all
data (factory reset)], tings’. Wait for 0.5s af- False
dicates that the tap on ’Advanced Set-
ter the scroll action to tings’ did not successfully navigate to
1
Apps and notifications allow the screen to up- the intended screen. It is possible that[Show all apps, De- the ’Advanced Settings’ option was not
fault apps, App per- date. visible or not tapped correctly, or there
missions, Special app might be a misidentification of the ’Ad-
access], About [Build vanced Settings’ option in the XML hi-
number, Android ver- erarchy.
sion].
The current screen displays three
alarms, with the first alarm’s repeat
option not active, and the second and
Turn ON the Repeat Tap on the right third alarms having the repeat option
option for all the five chevron icon to navi- active only for Monday. The intent is
alarms. Verify the re- gate to the next set of True to enable the repeat option for all five 2
curring status of all alarms. After tapping, alarms, but only three alarms are visi-
alarms on UI. wait for 0.5s to allow ble on this screen. To proceed, we needthe screen to update. to ensure that the repeat option is en-
abled for the visible alarms and then
navigate to the next set of alarms to
continue the process.
The current screen displays a menu
Ensures the spapi con- with options related to calibration and
tains the correct data. Tap on the back ar- data management, but it does not
Launches the app, en- row icon at the top show the main page where the weight
sure that we start at left corner to return data is displayed. The intent is to ver-
the main page, and to the previous page ify that the displayed weights match
wait for the main page or main page where False those uploaded to the spapi. To
to properly load. As- weight data might be achieve this, we need to navigate back
2
serts that the displayed displayed. After tap- to the main page where the weight data
weights are the same ping, wait for 0.5 sec- is visible. The back arrow icon at the
as the ones uploaded to onds to allow the page top left corner of the screen is likely
spapi. to load. to take us back to the previous page,which might be the main page with the
weight data.
60
5. Results
Global Intent Local Intent LIC Reasoning Score
The intent is to verify the availabil-
ity of specific audio settings, including
Scroll up the list in the Volume Control and Equalizer, which
Click on ’Audio’ op- ’Audio’ settings menu are not currently visible in the screen-
tion, then verify avail- to reveal the options shot. The current screen shows the
ability of settings: Vol- that are currently not ’Audio’ settings menu with options
ume Control, Sound- visible. After scrolling, True like Balance/Fader, Soundstage, and 3
stage, Equalizer, Bal- wait for 0.5 seconds to Loudness, but not all required options
ance/Fader, Loudness. allow the screen to up- are visible. To fulfill the intent, we
date. need to scroll up to check if the missing
options are located above the currently
visible items.
The current screen is for editing the
settings of the first alarm. The intent
is to enable the Repeat option for all
Tap on the "Save" but- five alarms. The screenshot shows that
Turn ON the Repeat ton to save the changes the repeat option is enabled for the
option for all the five made to the first alarm. first alarm, as indicated by the high-
alarms. Verify the re- After tapping, wait for False lighted days of the week. However, the 3
curring status of all 0.5s to allow the screen task is not complete because the repeat
alarms on UI. to update and return to option needs to be enabled for the re-
the list of alarms. maining four alarms. The next stepshould be to save the changes made to
the first alarm and return to the list of
alarms to select the next one for edit-
ing.
The current screen displays the ’Au-
Scroll down the ’Au- dio’ settings, but the ’Speed Depen-Click on ’Audio’ and dent Volume’ option is not visible. To
scroll to ’Speed Depen- dio’ settings list to re- fulfill the intent, we need to locate and
dent Volume’, verify if veal more options. Af-ter scrolling, wait for True access the ’Speed Dependent Volume’ 4options Off, Low, Mid, 0.5 seconds to allow the settings. Since the option is not vis-High can be changed. screen to update. ible, it is likely further down the list,requiring us to scroll through the list
to find it.
Click on edit button
of Alarm1 and once it
opens just click save The intent is to edit Alarm1 and then
button. Navigate to Tap on the "Edit" but- navigate to the second page to edit
second page of the app. ton for Alarm1. Af- Alarm4. Currently, the screen displays
Click on edit button ter tapping, wait for True three alarms, with Alarm1 being the 5
of Alarm4 and once it 0.5 seconds to allow the first one on the left. The immediate
opens just click save edit screen to load. step is to tap the "Edit" button for
button. Click on Home Alarm1 to proceed with the intent.
button.
Table 5.8: Reasoning Examples by Global Intent, Local Intent, Local Intent
Correctness (LIC), Reasoning, and Reasoning Score
5.4 RQ3 Results: Prompting Strategies and Multi-
Modal Effectiveness
Quantitative Results Across Interaction Modes
We report quantitative results from an ablation study comparing three interac-
tion modes: XML-only (XML), Image-only (IMG), and XML & Image (XML ⊕ IMG)
for a multi-modal large language model (MLLM) across multiple UI-focused tasks.
Each mode was evaluated using consistent prompts and assessed using task-specific
61
5. Results
metrics, covering structured classification, spatial localization, text prediction, and
continuous regression.
5.4.1 click_xy: Spatial Localization Accuracy
The click_xy task requires the model to predict a screen coordinate corresponding
to a target UI element. Figure 5.1 quantifies the number of predictions that fall
within ground-truth bounding boxes. The IMG modality shows the lowest correct
counts while both XML and XML ⊕ IMG modalities yield the similarly high counts.
350
300
250
200
150 285 290
100
50
60
0
XML IMG XML IMG
Figure 5.1: Number of click_xy predictions that fall inside the annotated
bounding boxes.
In addition to inclusion counts, spatial precision is evaluated using continuous dis-
tance metrics. Figures 5.2 and 5.3 show histograms of Euclidean and Manhattan
distances, respectively, stratified by whether the click fell inside or outside the tar-
get bounding box. In both metrics, XML and XML ⊕ IMG exhibit better distributions
around low-error regions, particularly for in-box clicks, however the IMG has poor
distribution across the board and most of the clicks are outside of the bounds.
62
Count
5. Results
300
250
200
150
100
50
0
250
200
150
100
50
0
120
100
80
60
40
20
0
0 200 400 600 800 1000 1200 1400
L2 Distance
XML IMG XML IMG
Figure 5.2: L2 distance distributions for click_xy predictions, split by bounding
box inclusion.
63
Out BBox Count In BBox Count Total Count
5. Results
300
250
200
150
100
50
0
300
250
200
150
100
50
0
100
80
60
40
20
0
0 250 500 750 1000 1250 1500 1750 2000
MANHATTAN Distance
XML IMG XML IMG
Figure 5.3: Manhattan distance distributions for click_xy predictions, stratified
by bounding box inclusion.
64
Out BBox Count In BBox Count Total Count
5. Results
Figure 5.4 illustrates the predicted click locations for the task “set the fourth alarm
to 2:13 AM” using three different input modalities: XML, IMG, and XML ⊕ IMG. In this
case, the interface initially displays only the first three alarms, and accessing the
fourth requires interaction with a pagination control. Each modality’s prediction
reveals a distinct pattern of reasoning: the XML model appears to rely solely on the
view hierarchy and confuses the third alarm’s “Edit” button with the target; the
IMG model predicts an approximate region for the navigation button without precise
alignment; and the combined modality aligns its prediction with the interactive
element expected to lead to the fourth alarm.
XML IMG XML IMG
Figure 5.4: Predicted click points for the task “set the fourth alarm to 2:13 AM.”
Pink corresponds to XML, orange to IMG, and green to XML ⊕ IMG modality. The X
symbol marks the ground truth center of the target button, while the dashed red
border indicates the bounding box of that button.
Figure 5.5 presents the predicted click locations for the task “calibrate the third axle
to 5 tons” in the Load Indicator application. The screen displays a graphical truck
representation with weight values aligned to each axle, and the user is expected to
select the appropriate axle by clicking on the associated control. In this instance,
both the XML and XML ⊕ IMG modalities predict clicks at the correct location, pre-
cisely at the center of the target button tied to the third axle. The IMG modality,
by contrast, produces a click prediction that is misplaced toward the bottom right
of the screen, far from any actionable UI element.
65
5. Results
XML IMG XML IMG
Figure 5.5: Predicted click points for the task “calibrate the third axle to 5 tons.”
Pink corresponds to XML, orange to IMG, and green to XML ⊕ IMG modality. The X
symbol marks the ground truth center of the target button, while the dashed red
border indicates the bounding box of that button.
5.4.2 click_id: Semantic Targeting via ID Retrieval
In the click_id task, the model must identify and click the element associated with
a given semantic ID. Only XML-enabled modes (XML, XML ⊕ IMG) are evaluated here.
Figure 5.6 reports exact match accuracy, where both XML and XML ⊕ IMG modalities
have similar high accuracies. The reason IMG modality is excluded is because the
information for element_id is not present within images.
66
5. Results
250
200
150
230 220
100
50
0
XML IMG XML IMG
Figure 5.6: Exact match counts for the click_id task. Applicable to XML
modes only.
Figure 5.7 shows edit distance distributions for the same task, providing a finer-
grained view of output fidelity. Both XML and XML ⊕ IMG modalities have similar
distributions.
250
XML
200 XML IMG
150
100
50
0
0 5 10 15 20
Edit Distance
Figure 5.7: Edit distance histogram for click_id predictions in XML-capable
modes.
5.4.3 get_count: Object Enumeration
The get_count task evaluates the model’s ability to enumerate specific UI compo-
nents. Figure 5.8 presents exact match counts. The XML modality performs worse,
while both the IMG and XML ⊕ IMG modalities achieve similarly strong performance.
67
Count Count
5. Results
100
80
60
90 90
40 70
20
0
XML IMG XML IMG
Figure 5.8: Exact match counts for get_count, where the model must return the
number of queried items.
5.4.4 instance: UI Component Classification
The instance task evaluates the model’s ability to correctly identify a specific
occurrence of a duplicated UI element. As shown in Figure 5.9, all modalities
performed similarly well, achieving high accuracy in instance selection across tasks.
80
70
60
50
40 75 75 75
30
20
10
0
XML IMG XML IMG
Figure 5.9: Exact match counts for instance, where the model identifies the
correct occurrence of duplicated UI elements.
5.4.5 get_text: UI Text Retrieval
The get_text task requires the model to retrieve specific text from the UI. The XML
and IMG modalities perform similarly. The XML ⊕ IMG modality performs best, with
a small but measurable improvement.
68
Count Count
5. Results
140
120
100
80
60 120 120 125
40
20
0
XML IMG XML IMG
Figure 5.10: Exact match counts for the get_text task.
140
XML
120 IMG
100 XML IMG
80
60
40
20
0
0 2 4 6 8 10
Edit Distance
Figure 5.11: Edit distance distributions for the get_text task.
5.4.6 seekbar: Continuous Value Estimation
The seekbar task involves predicting XY coordinates corresponding to a specified
seekbar position within the UI. Figures 5.12 and 5.13 show the L2 and Manhattan
distance distributions between predicted and ground truth coordinates. The XML
modality shows a tight distribution centered near zero, indicating many exact or
near-exact predictions. The IMG modality displays a broad distribution with high
error. The XML ⊕ IMG modality also shows a wide error distribution and performs
worse than XML.
69
Count Count
5. Results
50
XML
40 IMG
XML IMG
30
20
10
0
0 50 100 150 200 250 300 350
L2 Distance
Figure 5.12: L2 distance histogram for seekbar predictions.
50
XML
40 IMG
XML IMG
30
20
10
0
0 100 200 300 400
MANHATTAN Distance
Figure 5.13: Manhattan distance histogram for seekbar predictions.
Figure 5.14 presents the predicted click locations for the task “set the 8kHz band
to -9dB” in the Equalizer interface. The screen contains eight vertical seek bars,
each corresponding to a different frequency band. The target in this case is the
8kHz band, and the intended click location lies near the lower end of its range, as
indicated by the X mark. The XML and XML ⊕ IMG modalities both predict clicks at
the 8kHz band but align with the upper part of the seekbar, closer to +9dB and
overlooking the required negative value. The IMG modality prediction is farther off,
registering a click near the top of the adjacent 16kHz band instead.
70
Count Count
5. Results
XML IMG XML IMG
Figure 5.14: Predicted click points for the task “set the 8kHz band to -9dB.”
Pink corresponds to XML, orange to IMG, and green to XML ⊕ IMG modality. The X
symbol marks the ground truth target position on the 8kHz seekbar.
Figure 5.15 shows click predictions for the task “set media volume to 60%” in the
Volume Control interface. Multiple horizontal seek bars are visible, each tied to a
different audio category. The media volume seekbar, situated near the top of the
screen, includes a circular handle that marks the target location, approximated by
the X. Both the XML and XML ⊕ IMG modalities predict click points close to this
handle which aligns with the intended 60% position. The IMG modality prediction
diverges significantly, producing a click location in the upper-left corner of the screen,
disconnected from any seekbar or handle.
71
5. Results
XML IMG XML IMG
Figure 5.15: Predicted click points for the task “set media volume to 60%.” Pink
corresponds to XML, orange to IMG, and green to XML ⊕ IMG modality. The X
symbol marks the ground truth target location on the media seekbar.
72
6
Discussion
6.1 Discussion of RQ1: Code-Level Comparison
Insights
To understand the practical implications of the RQ1 results, we analyzed 27 matched
test pairs across three Volvo apps — Alarm Clock, System Settings, and Load
Indicator. Each pair includes a manually written test (M) and an LLM-generated
test (A), evaluated on five structured metrics. The key observations and implications
are as follows:
6.1.1 Correctness
Manual tests consistently achieve 100% correctness, serving as the ground truth. In
contrast, LLM-generated test correctness varies widely from 15% to 100%. Impor-
tantly, several automated test cases in Load Indicator have low correctness such as
L27 has only 15% line-level correctness, which indicates it contains a large amount
of incorrect lines and produce incorrect actions on the app. Correctness tends to
drop significantly in the Load Indicator app, with most Automated tests scoring
below 60%. This suggests that LLM performance degrades in complex scenarios,
where coming up with correct local intents is harder.
6.1.2 Unnecessary Steps
Automated tests overall exhibit more unnecessary steps compared to manual tests.
This trend is especially pronounced in the Load Indicator app, where the number
of redundant or exploratory actions is significantly higher than in Alarm Clock or
System Settings. The primary reason is the increased complexity and density of UI
elements in Load Indicator, which makes it more challenging for the framework to
semantically interpret the app structure and avoid redundant interactions during UI
exploration.
An important nuance in this comparison is that manual tests often use abstraction
through external packages (e.g., assert app.select_edit_timer(0)), which effec-
tively eliminates unnecessary steps at the script level but introduces an additional
abstraction layer that can increase maintenance overhead. In contrast, automated
73
6. Discussion
tests—especially in the simpler Alarm Clock and System Settings cases—achieve
comparable functionality without introducing new abstraction layers. Instead, they
operate directly using elementary operations that are provided by UIAutomator,
which can enhance maintainability and offer a more developer-friendly, transparent
structure.
6.1.3 Flakiness
As expected, all manual tests are non-flaky, as they serve as the ground truth and
have been carefully curated by developers. However, several automated tests are
marked as flaky, particularly in the System Settings app. This flakiness primarily
arises from the scrolling actions required to explore the settings menus. Even when
performing the same scroll commands, slight differences in system delays, touch
accuracy, or gesture precision can lead to divergent app states, causing inconsistent
outcomes across repeated runs.
Additionally, the seekbar manipulation actions in System Settings further contribute
to flakiness, as accurately setting the seekbar position can be sensitive to pixel-level
precision. Similar issues appear in the Alarm Clock and Load Indicator apps, where
two automated tests in each were marked flaky. In these cases, the root cause
was often related to actions such as number picker interactions, which depend on
accurate scrolling behavior.
Overall, these findings highlight a current weakness in the framework’s handling
of scroll-based and precision-sensitive actions. Addressing this limitation will re-
quire enhancing the system to select more reliable alternative actions or introducing
corrective mechanisms to improve action accuracy and consistency.
6.1.4 Robustness
The manual tests, serving as ground truth, consistently achieve high robustness
scores. This is because they include assertions and validations on nearly every
critical line of code. However, it is important to note that many of these assertions
are embedded within external abstraction layers (e.g., assert app.save_timer()),
which encapsulate multiple low-level checks into a single high-level call.
In contrast, the automated tests exhibit relatively low robustness scores. This lim-
itation stems from the current weaknesses of our system: at present, the system
lacks the ability to check backend data or interact with API endpoints to validate
deeper system states. As a result, automated tests are restricted to UI-level asser-
tions, which limits the scope and reliability of validations. This leads to a noticeable
absence of robust validation and comprehensive assertions in most automated test
cases, reducing their defensive capacity compared to the manual baselines.
74
6. Discussion
6.1.5 Readability
The manual tests generally demonstrate good readability as ground truth. How-
ever, their readability is somewhat reduced by the use of abstraction layers (e.g.,
app.launch_by_navigation()). While helper methods encapsulate complex logic,
they can obscure the underlying actions, making it difficult for developers to under-
stand the full behavior without inspecting the external package or diving into the
function definitions.
In contrast, the automated tests provide full visibility into the test structure, as
all actions are primitive UIAutomator calls explicitly listed within the test cases.
While this flat and transparent listing enhances immediate understandability, it lacks
meaningful abstraction, helper methods, logical structuring, and documentation —
key factors in our readability criteria. As a result, the overall readability score of
the automated tests is reduced. Moving forward, we plan to introduce enhanced
prompting strategies to encourage the language model to generate modularized,
well-commented test cases to improve readability.
6.2 Discussion of RQ2: Semantic Understanding
and Planning Module Interpretation
6.2.1 Reasoning Rubric
As shown in Table 4.3, we evaluate reasoning quality using a 5-point rubric. The
scoring is justified through an inductive comparison against our ground truth—
manual test cases—based on the following equivalence criteria:
1. The automated test leads to the same resulting app state.
2. The automated test reproduces the same functionality as the manual test, step
by step.
Given a Global intent as input, we inspect each local intent generated by the Plan-
ning Module within the automated test. Each reasoning step is scored individually
according to the rubric, and the overall reasoning quality is determined by averaging
these scores.
As shown in Table 5.5 and Table 5.6, a high average reasoning score across the
full test case aligns with equivalence between the automated and manual tests in
both outcome and functionality. Conversely, a low average score indicates that the
automated test fails to faithfully replicate the manual test behavior. These case-level
comparisons provide the basis for our justification of reasoning quality evaluation.
Table 5.4 presents the average reasoning scores and binary equivalence outcomes for
all 35 automated tests paired with their corresponding manual tests. To validate the
effectiveness of our Reasoning Rubric, we analyzed the relationship between these
two variables—average reasoning score (continuous) and equivalence (binary).
75
6. Discussion
We computed two correlation measures to quantify the strength of this association:
Spearman1 Rank correlation and Point-Biserial2 correlation. As shown in Table 5.7,
the Spearman Rank coefficient is 0.820 with a p-value of 5.706 × 10−12, and the
Point-Biserial coefficient is 0.842 with a p-value of 4.165× 10−13. Both coefficients
indicate a strong positive correlation between reasoning quality and test equiva-
lence. The relatively high coefficient values suggest that higher reasoning scores are
strongly predictive of successful equivalence. The low p-values3 for both tests con-
firm that these relationships are statistically significant. These findings support our
claim that the Reasoning Rubric effectively captures meaningful aspects of agent
behavior and closely aligns with the system’s ability to produce functionally correct
outcomes.
6.2.2 Correctness and Semantic Alignment
To further evaluate how accurately the framework translates developer intents into
meaningful and correct actions, we conducted a quantitative analysis of local intents
across three representative Volvo applications: Alarm Clock, System Settings, and
Load Indicator. Each local intent was annotated for correctness and the associated
reasoning score was computed using our rubric.
As shown in Table 5.3, the framework generated a total of 42 local intents, of which
30 were deemed correct. This corresponds to an overall correctness rate of 71.4%,
indicating that in the majority of cases, the Planning Module produced contextually
valid and meaningful local intents.
We conduct a qualitative analysis on the reasoning that Planning Module has made
to generate the local intents. More importantly, the reasoning scores provide ad-
ditional insight into the quality of these local intents. Correct steps had a high
average reasoning score of 4.33, while incorrect steps had a significantly lower av-
erage score of 2.00. This consistent gap demonstrates that the Reasoning Rubric is
not only correlated with equivalence at the test level, but also sensitive enough to
differentiate correct vs. incorrect actions at the local intent level.
These findings reinforce the conclusion that high reasoning quality strongly predicts
successful local intent generation. Misunderstandings or reasoning breakdowns often
coincide with functional failures, validating our semantic alignment methodology for
evaluating intent understanding.
1Spearman rank correlation measures monotonic relationships between ordinal or continuous
variables; suitable for comparing subjective scores with graded outputs.
2Point-biserial correlation measures the association between a continuous variable and a binary
variable; appropriate for linking subjective scores to correctness labels.
3A p-value indicates the probability of observing the data (or more extreme) assuming the null
hypothesis is true; lower values suggest stronger evidence against the null.
76
6. Discussion
6.2.3 App-Specific Reasoning Quality
As shown in Table 5.3, the reasoning quality of local intents generated by Planning
Module varied across the three evaluated apps, reflecting differences in UI complexity
and structural predictability. Alarm Clock, with relatively static and repetitive
layouts, achieved the highest correctness rate (87.5%) and a strong average reasoning
score (µ= 4.00). In contrast, Load Indicator, which features dynamic lists and pop-
up interactions, showed both a lower correctness rate (53.3%) and the lowest average
reasoning score (µ= 3.13).
Interestingly, the System Settings app, which includes nested toggles and deep menu
structures, exhibited higher reasoning scores for correct steps (µ= 4.88) than Alarm
Clock or Load Indicator. This suggests that when the Planning Module navigated
the structure successfully, its understanding was clearly conveyed.
These results show that the framework is more reliable in apps with predictable UI
layouts and limited variability, such as System Settings, where there is no recursion
or overlapping states — each action leads uniquely to the same destination, and the
structure (scroll, click, etc.) is quite predictable. In contrast, dynamic or deeply
nested UIs introduce challenges, as seen in the other two apps, where recursion
and overlapping states allow multiple paths to share the same states, making the
framework more prone to getting lost even if the state exists in memory. This
variability affects both the Planning Module’s decision-making accuracy and the
clarity of its reasoning.
6.2.4 Common Error Patterns and Limitations
Analysis of incorrect steps revealed several recurring issues that contributed to test
failures. One common pattern involved incorrect UI element targeting, such as
confusing sibling list items or selecting the wrong instance(e.g. incorrect selection
of an Alarm Clock instance). These errors typically coincided with reasoning scores
of 2 or lower.
Another failure mode was reasoning repetition(e.g. attempting to rediscover or
reselect UI elements that had already been accessed), especially prevalent in apps
with scrollable or layered interfaces. These issues suggest limitations in incomplete
memory or state tracking across local intents.
While our Reasoning Rubric effectively captures these reasoning flaws, the under-
lying challenge remains: enhancing the Planning module’s awareness of app state
transitions and screen-specific dynamics.
77
6. Discussion
6.3 Discussion of RQ3: Prompting Strategies and
Multi-Modal Effectiveness
6.3.1 click_xy: Spatial Localization Accuracy
Task Overview
The click_xy task evaluates the model’s ability to predict the precise screen coor-
dinates of a target UI element based on a natural language prompt. This requires
the model to interpret the prompt, identify the correct UI component from the in-
put representation(s), and output the center point of that component’s bounding
box in (x, y) format. Accurate completion of this task depends on both semantic
understanding and spatial reasoning, and serves as a measure of how well the model
can localize actionable elements within a mobile interface.
Performance Comparison Across Modalities
The results in Figure 5.1 reveal significant performance disparities across the three
interaction modalities for the click_xy task. The XML-only (XML) modality achieves
approximately 285 correct clicks out of 350, while the image-only (IMG) modality
performs considerably worse, with only around 60 successful predictions. The com-
bined modality (XML ⊕ IMG) slightly outperforms the XML-only modality, reaching
nearly 290 correct predictions.
These results suggest that XML-based information provides highly reliable spatial
localization cues. This is attributable to the structured nature of the XML data,
which explicitly encodes bounding boxes for interactive elements. Given sufficient
prompt guidance, the model can extract these coordinates, compute a center point,
and return accurate (x, y) values. In contrast, the image-only modality lacks access
to such structured spatial information. The model must infer clickable regions purely
from visual features, which is inherently ambiguous and error-prone, particularly
without specialized training on coordinate regression tasks in UI contexts.
The marginal improvement observed in the XML ⊕ IMG condition over the XML-
only modality suggests that the inclusion of images can be beneficial when used
in conjunction with XML. Although the images alone are insufficient for accurate
localization, they may help disambiguate the target element when multiple XML
nodes share similar attributes. However, this interpretation is speculative; a con-
trolled analysis isolating such cases was not conducted and thus remains an open
question.
Distance-Based Error Analysis
Figures 5.2 and 5.3 further support these conclusions by illustrating the distribu-
tion of localization errors in L2 and Manhattan distance metrics. The XML-only
modality yields a sharply peaked distribution near zero, indicating that most pre-
dictions fall very close to the true center of the target bounding box. The shape is
78
6. Discussion
reminiscent of a right-skewed unimodal distribution, concentrated in the low-error
region.
By contrast, the image-only modality displays a wider and flatter distribution. Al-
though some predictions cluster near the correct location, many are significantly
off-target. This implies that the model, despite lacking explicit spatial representa-
tions, still attempts to localize the correct region heuristically – yet it often fails to
refine its prediction to precise coordinates.
The XML ⊕ IMG modality demonstrates a modest leftward shift in the error distri-
bution relative to the XML-only case. This shift suggests that when the model
integrates visual context alongside structured XML, it can further reduce spatial
error in some cases. This may occur when visual features reinforce or clarify the
XML-based decision, leading to more confident and centered predictions. However,
the gain is small, and the XML information remains the dominant factor in spatial
localization performance.
Click Prediction Across Modalities
The results in Figure 5.4 reveal important modality-specific behaviors in response
to the task of setting the fourth alarm. The XML modality incorrectly selects the
“Edit” button associated with the third alarm. This behavior is expected given the
limitations of the XML hierarchy, which lacks any semantic markers distinguishing
the individual alarms beyond replicated structural patterns. The numerical labels
(1, 2, 3) that visually denote the alarm indices in the screenshot are not present in
the hierarchy. As such, the model encounters three visually identical alarm elements
with incrementing instance indices but no semantic grounding. Without an explicit
reasoning chain or iterative step-tracking—both of which were excluded in this one-
shot ablation setup—it defaults to the last visible instance.
In contrast, the IMG modality appears to recognize from visual cues that the fourth
alarm is not available on the current screen. It identifies an arrow-like UI component
positioned on the right edge of the interface, likely indicating a pagination control,
and attempts to click near it. While the click is not precisely aligned with the
control, the spatial reasoning exhibited here is notable given the lack of structural
information.
The combined XML ⊕ IMG modality performs the task most effectively. The visual
input provides the contextual insight that a navigation step is required, while the
structural XML allows the model to locate the actual interactive component respon-
sible for this transition. As a result, the predicted click lands directly at the center
of the pagination button, coinciding with the ground truth. This indicates that the
multi-modal combination facilitates both high-level intent recognition and low-level
target resolution, which enables a correct and contextually appropriate action.
79
6. Discussion
Limitations and Future Directions
While the results establish the primacy of XML in spatial localization, they also
raise questions about how to better leverage visual information. One unexplored
avenue is image annotation. Augmenting images with visual cues such as bound-
ing boxes or highlights may help the model correlate visual regions with semantic
targets, potentially improving image-only and multi-modal performance. Further,
training or fine-tuning with tasks explicitly targeting spatial coordinate extraction
may enhance the model’s sensitivity to such requirements.
Overall, the findings from the click_xy task affirm that current MLLM systems
benefit substantially from structured XML input when tasked with coordinate pre-
diction. Visual input, while insufficient on its own, can provide marginal gains when
used in combination.
6.3.2 click_id: Semantic Targeting via ID Retrieval
Task Overview
The click_id task assesses the model’s ability to identify and return the exact
resource ID of a UI element based on a semantic prompt. The prompt typically
describes the function, label, or visual role of the target element. To succeed, the
model must parse the prompt, locate the matching element in the input represen-
tation, and extract its unique identifier. This task tests the model’s capacity for
semantic matching and symbolic precision, particularly in structured input formats
like XML where element IDs are explicitly defined.
Modality-Based ID Retrieval
The click_id task evaluates the model’s ability to identify and retrieve the exact
element ID corresponding to a semantic prompt. As noted in the results (Figure 5.6),
the XML-only (XML) modality achieves high accuracy, with approximately 230 out of
250 samples returning the correct ID. This is expected, as the XML provides direct
access to structured metadata, including element IDs.
By design, this task excludes the image-only (IMG) modality. Images inherently lack
embedded element identifiers, making them unsuitable for this task. The absence
of an ID field in visual input renders any model operating solely on image data
incapable of meaningful participation in this evaluation.
Impact of multi-modal Input
When combining XML with images (XML ⊕ IMG), performance remains high, but a
small drop is observed – with accuracy falling slightly to around 220 out of 250. This
minor degradation may indicate that the addition of image information introduces
ambiguity into the model’s reasoning process. Specifically, it is possible that the
model attempts to correlate visual features with the structured XML representation.
Since the image does not contain element IDs, but may include visible labels or text
80
6. Discussion
resembling IDs, such correlation might lead to incorrect associations and distract the
model from selecting the correct XML element. This hypothesis remains speculative,
as no targeted ablation was performed to confirm this behavior.
Nonetheless, it is notable that the performance remains robust even with multi-
modal input. The small decline does not compromise the overall efficacy of the
system. This suggests that while there may be edge cases where the visual con-
text introduces confusion, the model can generally prioritize the structured XML
information effectively during ID resolution.
Edit Distance Analysis
To gain finer insight into the nature of incorrect predictions, we examine the edit
distance distributions in Figure 5.7. These reveal that incorrect IDs typically diverge
significantly from the ground truth. In both XML and XML ⊕ IMG modes, the majority
of incorrect outputs have edit distances well above 10, with very few near misses
(edit distance less than 4). This indicates that errors are not due to minor string
variations or case mismatches; rather, they represent entirely incorrect selections –
often pointing to unrelated elements.
This observation highlights an important caveat in the use of edit distance as a
supplementary metric for ID prediction tasks. In Android development contexts,
element IDs must match exactly, including case sensitivity. Therefore, any deviation
from the correct string constitutes a failure, regardless of how minor it may seem in
edit distance terms. The distributions, while informative, should be interpreted as
descriptive rather than diagnostic in this setting.
Interpretation and Recommendations
Overall, the findings affirm the effectiveness of the XML modality for semantic ID
retrieval tasks. Adding visual information yields no substantial improvement and
may slightly degrade performance due to multi-modal interference. This warrants
caution when designing multi-modal prompts for tasks dependent on exact symbolic
matching. Further research could explore whether integrating visual input through
more constrained or filtered mechanisms (e.g., masking irrelevant regions) might
preserve or enhance precision without introducing confusion.
The current results suggest that XML remains the most reliable source for structured
attribute extraction, and visual data should be integrated with care in tasks requiring
precise identifier resolution.
6.3.3 get_count: Object Enumeration
Task Overview
The get_count task requires the model to enumerate elements of a specific type
present on the screen. These include prompts such as “How many wheels are visi-
ble?”, “How many active alarms are currently shown?”, or “How many axles does the
81
6. Discussion
truck have?”. Solving this task requires the model to understand spatial repetitions
and context-specific definitions of visual objects, often necessitating aggregation
across either structural or visual cues.
Performance Trends Across Modalities
According to the results shown in Figure 5.8, the image-only (IMG) modality performs
best, with around 90 correct predictions out of 100. The XML-only (XML) modality
performs significantly worse, at around 70 out of 100. Notably, the combination of
XML and image (XML ⊕ IMG) matches the performance of the image-only case, with
no observable improvement or degradation.
This pattern suggests that enumeration in user interfaces is primarily a visually
grounded task. The XML representation often lacks explicit counts or does not
describe repeating visual patterns in a way that lends itself to direct aggregation.
In some cases, the XML may not contain any useful structure for the objects being
queried, or the relevant information may be encoded in a highly indirect form that
is not straightforward to extract. Such cases can be challenging even for human
readers, let alone for LLMs operating over structured data.
Interpretation and Modal Implications
The strong performance of the image-only modality indicates that vision-language
models are well-suited to handle enumeration tasks when the target elements are vi-
sually distinct or spatially grouped. This includes cases where similar shapes, colors,
or layout structures naturally imply a count. In contrast, structural representations
like XML, while rich in node-level detail, may obscure the broader visual patterns
necessary for such reasoning.
It is noteworthy that the addition of XML in the XML ⊕ IMG configuration does not
impair the model’s performance. This implies that the model can rely on visual
input when it is sufficient, without being misled by uninformative or misleading
XML structures. The ability to retain image-based performance while including
XML data suggests a degree of robustness in the model’s multi-modal integration
strategy. This could be advantageous in settings where XML data must be included
for pipeline uniformity or other task components.
This also offers a subtle insight: while XML contributes significantly to tasks requir-
ing exact structural identifiers or coordinate-based localization, it appears to play a
minimal role in visual aggregation tasks like counting. The model’s ability to ignore
extraneous or redundant input without degrading accuracy is a positive signal for
broader multi-modal deployment, particularly in scenarios involving heterogeneous
or variable-quality XML data.
82
6. Discussion
6.3.4 instance: UI Component Classification
Task Overview
The instance task evaluates the model’s ability to identify a specific occurrence of
a duplicated UI component. This is particularly relevant in dynamic user interfaces
where multiple elements of the same type and attributes exist—such as repeated
list items or buttons—requiring disambiguation by instance index. In tools like UI
Automator, selecting the correct instance is essential for reliable automation, as only
the indexed form of the element can be targeted in the view hierarchy.
Performance Trends and Observations
In this experiment, 80 test samples were evaluated under three modality configura-
tions: XML-only (XML), image-only (IMG), and combined XML and image (XML⊕ IMG).
All three modalities performed identically, achieving near-perfect accuracy in select-
ing the correct instance of the target element. This indicates that the task was
solvable using any of the input types available, without any one modality providing
a distinct advantage.
Interpretation and Modality Relevance
The uniform performance across all three conditions suggests that the information
necessary to disambiguate instances was encoded consistently across modalities. In
the case of XML, this likely comes from the explicit structural position of elements
within the hierarchy, such as the index within a parent node. In images, this may be
inferred visually through spatial order, layout alignment, or repeated visual patterns.
Importantly, the inclusion of both modalities in the XML ⊕ IMG modality did not
introduce any interference or confusion, nor did it yield improvement. This result
reinforces the idea that when both modalities redundantly encode the needed in-
formation, the model is able to select a consistent interpretation path regardless of
how the information is presented.
From a practical standpoint, this suggests that for instance-level targeting tasks,
either modality is sufficient, and multi-modal inputs can be used without risk of
performance degradation. This flexibility may be useful in deployment scenarios
where either visual or structural data is intermittently unavailable or varies in qual-
ity.
6.3.5 get_text: UI Text Retrieval
Task Overview
The get_text task evaluates the model’s ability to extract a specific string of text
from the user interface. The prompt specifies what text to retrieve, and the model
must identify and return the exact string as it appears on screen. This requires both
locating the appropriate element and correctly extracting its textual content.
83
6. Discussion
Performance Across Modalities
As shown in Figure 5.10, both the XML-only (XML) and image-only (IMG) modalities
performed similarly well, with approximately 120 correct responses out of 140. This
suggests that either the structural representation or the visual appearance alone is
often sufficient to identify and extract the requested text.
When both modalities are used together (XML ⊕ IMG), performance increases to
around 130 out of 140, indicating a measurable benefit from multi-modal input.
This improvement suggests that some instances are better handled when both XML
and visual representations are available, likely due to complementary strengths in
how the information is encoded across the two formats.
Interpretation and Modal Synergy
The observed improvement in the XML ⊕ IMG setting may reflect a resolving effect
when the model is given access to both structural and visual inputs. In some cases,
the XML may omit relevant text or present it in a non-obvious location within
the hierarchy, while the image offers a direct visual representation. In other cases,
the visual representation may be ambiguous—for example, due to overlapping UI
elements or unconventional layout—where the XML provides the needed structure
to disambiguate.
The model appears to benefit from this redundancy, aligning cues across modalities
to more reliably identify the correct answer. However, this interpretation remains
speculative, as no ablation analysis was conducted to isolate which specific samples
were corrected by the addition of visual or structural context. Whether this perfor-
mance gain is due to true complementarity or merely reinforcement of information
remains an open question.
The distribution of edit distances for the get_text task as can be seen in 5.11 shows
that when predictions are correct, they are usually exact, with an edit distance
of zero—consistent with the exact match results. When the model is wrong, the
deviations are often minor: many fall within an edit distance of 1 or 2, which
may hint at superficial errors such as mismatched capitalization or minor omissions.
We have not explicitly analyzed whether these small distances correspond to case
sensitivity, and it remains unclear whether the Levenshtein metric we use treats
case as significant. Notably, there are a few instances with much higher distances
(e.g., 10+), which likely correspond to cases where the model predicted an entire
phrase or string instead of a specific unit—something we’ve seen in examples where
disambiguation was semantically demanding. Interestingly, XML ⊕ IMG (XML ⊕ IMG)
avoids any edit distance of exactly 1, suggesting that combining modalities may
resolve certain borderline failures. Conversely, at distance 2, IMG alone was correct
while XML and XML ⊕ IMG both made errors, indicating that added structure is not
universally beneficial. These patterns suggest nuanced modality interactions that
warrant closer error-level analysis.
84
6. Discussion
Implications
This result provides a useful demonstration that combining structured and visual
representations can improve retrieval of exact text from UIs. Where accuracy is crit-
ical, using both modalities appears to offer an advantage with no apparent downside.
This may be particularly valuable in testing contexts where string equality is strict
and tolerances are low.
6.3.6 seekbar: Continuous Value Estimation
Task Overview
The seekbar task evaluates the model’s ability to either determine or set the value of
a seekbar widget within an Android UI. These widgets represent continuous values
and require interaction through specific (x, y) coordinates, either by identifying
the current position or selecting a new target location (e.g., “set seekbar to 80%”).
This task is intentionally constructed to be difficult, as it requires precise spatial
reasoning over both visual and structural inputs. It cannot be solved reliably by
either modality alone; success depends on the model’s ability to correlate visual
layout with XML metadata and compute actionable coordinate values.
Performance Across Modalities
The results, presented in Figures 5.12 and 5.13, show that the XML-only (XML)
modality performs moderately well. While not highly accurate, it produces a number
of correct or near-correct predictions. The image-only (IMG) modality, by contrast,
performs poorly, with a wide and dispersed error distribution. The combined modal-
ity (XML ⊕ IMG) also under-performs, only marginally improving upon image-only,
and in some cases even trailing XML-only.
Figures 5.15 and 5.14 present modality-specific click behavior in response to interac-
tions with seekbars, specifically for the tasks of setting the media volume to 60% and
adjusting the 8kHz equalizer band to -9dB, respectively. In Figure 5.14, a consistent
pattern is observed in the XML and XML ⊕ IMG modalities, where the model predicts
a click near the +9dB region of the 8kHz seekbar, despite the prompt requesting
-9dB. While the model appears to interpret the magnitude of “9dB” correctly, the
directionality indicated by the negative sign is ignored. This occurs despite the
visual presence of a labeled vertical scale from +12dB to -12dB on the left side of
the interface, suggesting that the negative sign was either visually overlooked or
semantically dismissed.
The IMGmodality performs worse in this task, placing its click on the 16kHz seekbar
rather than the intended 8kHz target. This misalignment is likely due to the ab-
sence of structured layout information, leaving the model to infer positions solely
from visual patterns. In contrast, the XML and combined modalities are able to cor-
rectly locate the 8kHz band—most plausibly by identifying the “8kHz” text node in
the XML hierarchy, then resolving its sibling node corresponding to the interactive
seekbar and computing its center based on bounding box coordinates.
85
6. Discussion
In Figure 5.15, which addresses setting the media volume to 60%, both the XML
and XML ⊕ IMG modalities again predict the correct click location on the “Media”
seekbar. This alignment likely stems from the presence of the label “Media” in
the XML, which allows the model to associate the prompt with the corresponding
UI element. Notably, the model performs this alignment without the aid of any
arithmetic tools or helper abstractions—inferring the target position from the four
bounding box edges alone. The IMG modality, however, predicts a click in a visually
unrelated region near the top of the screen, which does not correspond to any active
seekbar.
Interpretation and Observations
The relative success of the XML-only modality can be attributed to the presence
of bounding box metadata for the seekbar element. When prompted to set a value
like “80%,” the model can use this structural information to estimate a coordinate
corresponding to that percentage along the widget’s axis. In contrast, the image
modality lacks such direct cues. The model must first visually identify the widget,
interpret its scale and orientation, and then infer the appropriate position to interact
with—an inherently difficult sequence without precise spatial anchors.
The fact that XML ⊕ IMG slightly outperforms image-only indicates that the XML
still contributes useful information in the combined modality. However, the overall
poor performance shows that the model struggles to reconcile spatial data across
modalities in a precise, coordinated way. This is further evidenced by the error
distribution: correct clicks tend to be exact, while incorrect ones are significantly
off-target, reflecting an all-or-nothing pattern in prediction reliability.
Implications for Design and Evaluation
This task illustrates the limitations of current multi-modal prompting strategies in
scenarios that require spatial alignment and fine-grained localization. It is one of
the few examples where adding visual input degraded the performance relative to
using XML alone, suggesting that naive multi-modal fusion can hinder reasoning in
high-precision tasks.
The difficulty of this task underscores the importance of carefully designed prompts
and alignment strategies when working with coordinate-dependent interactions. Fur-
ther work is needed to improve grounding between visual and structural elements,
particularly for UI widgets that span both modalities semantically but offer distinct
kinds of spatial data.
6.4 Threats to Validity
We recognize several potential threats to the validity of our evaluation and results:
86
6. Discussion
6.4.1 Internal Validity
Our evaluation relies on human annotations for key metrics such as correctness,
robustness, readability, and reasoning quality. Although we applied a statistical
validation process to mitigate subjectivity and calculated correlations (e.g., between
equivalence and average reasoning score) to check consistency, some interpretative
bias may still remain. Additionally, our system’s current scope explicitly excludes
tests involving backend or API validations. This limitation may lead to an un-
derestimation of the true potential of the generated tests, as their robustness and
effectiveness could be enhanced with backend integrations.
For the interaction tasks involving spatial or semantic understanding, we rely on
task-specific automatic metrics (e.g., coordinate distance, exact ID match, edit dis-
tance). These may fail to capture near-miss predictions or semantically correct but
formally invalid outputs. The absence of manual correction or qualitative inspec-
tion in such cases may under-report partial success. Furthermore, without controlled
ablation studies or targeted error analysis, some claims—such as the benefits or in-
terference of combining visual and structural inputs—remain speculative.
6.4.2 External Validity
The evaluation was conducted on three representative Volvo applications (Alarm
Clock, System Settings, Load Indicator), which may limit generalizability. The
performance of the system on other domains, app types, or non-Android platforms
remains untested.
Additionally, the visual and structural properties of the selected applications may
influence the observed effectiveness of different input modalities. For example, tasks
involving visual counting or spatial localization may behave differently in UIs with
complex layouts or sparse metadata. As such, the reported modality-specific trends
should not be assumed to generalize across all UI environments without further
validation.
6.5 Overall Framework Performance
The overall performance of the proposed framework shows promising results in gen-
erating executable and meaningful automated tests that par with the manual tests
but still remains several critical areas for improvement.
Across both RQ1 and RQ2 evaluations, the framework successfully generates tests
that are often concise (lower or comparable lines of code compared to manual tests)
and functionally aligned with developer-defined global intents, particularly in sim-
pler application contexts like Alarm Clock. The correctness rates, reasoning quality,
and equivalence measures confirm that the system effectively captures developer-
intended behaviors in many cases, demonstrating the potential of LLM-driven test
generation using in real testing scenarios.
87
6. Discussion
However, several limitations emerged in more complex applications such as Sys-
tem Settings and Load Indicator. The framework struggles to handle complex UI
structures with dynamic or deeply nested elements, which leads to noticeably lower
correctness and reasoning scores. It also falls short in maintaining robustness, as
the automated tests often lack the defensive coding practices or validation layers
that are present in the manual test baselines. Additionally, the system has difficulty
avoiding flakiness, particularly in scenarios that require precise scrolling or seekbar
adjustments, where even minor differences in execution can produce inconsistent
app states. Finally, the readability of the generated code is limited, as the lack of
modularization and absence of comments reduce the maintainability and clarity of
the automated tests.
Despite these weaknesses, the framework’s ability to generate functionally aligned
automated tests, semantically interpret developer-defined global intents, and pro-
duce executable local actions represents an important advancement toward reducing
manual testing effort. Future enhancements focused on improved prompt engineer-
ing, backend and API integration, memory-aware planning, and automated post-
processing could further strengthen the framework’s performance and move it closer
to becoming a robust tool for industrial-scale automated testing.
88
7
Conclusions and Future Work
This thesis set out to examine whether large language models can translate high-
level, natural-language testing intents into reliable Android UI test scripts. Grounded
in Design Science Research, the work combined systematic crawler design, a mod-
ular LLM-driven agent, and empirical evaluation on three production-grade Volvo
applications. The resulting framework demonstrates that intent-driven automation
is not only feasible but also capable of producing concise and functionally accurate
tests that reduce manual scripting effort.
By coupling structured exploration with multi-modal reasoning, the system achieved
a 71.4% step-level correctness across 42 planner actions and generated tests that,
in simpler interfaces, matched or outperformed manual baselines in code brevity
without sacrificing clarity. The inclusion of visual context contributed positively to
the overall performance of the system. In the majority of tasks, it had no measurable
effect on accuracy, but crucially, it also did not degrade performance. In a notable
subset of cases, however, visual input significantly improved the agent’s ability to
identify the correct UI element or action, leading to successful completions that were
not achievable through structural information alone. These gains illustrate that
visual and structural data can play complementary roles. While there was a specific
interaction type where visual input led to a decline in performance due to coordinate
ambiguity, such cases were rare. Overall, visual context proved beneficial more often
than not and can serve as a valuable addition when applied with consideration for
task characteristics.
The study also surfaced clear limitations. Test robustness lagged behind manual
scripts due to sparse assertions and sensitivity to timing, and performance dropped
in highly dynamic or deeply nested interfaces where state tracking and reasoning
became less reliable. Nonetheless, the strong positive correlation between reason-
ing quality and functional equivalence confirms that qualitative assessment of LLM
rationale is a meaningful indicator of test validity.
Overall, this research contributes a concrete architecture, a reproducible evaluation
protocol, and evidence that LLM-based agents can narrow the gap between developer
intent and executable verification. While challenges remain in scaling to complex
state spaces and reinforcing defensive test practices, the thesis provides a foundation
for future refinement of intent-aligned, vision-aware testing tools.
89
7. Conclusions and Future Work
While the proposed framework demonstrates the feasibility of generating executable,
meaningful, and semantically aligned test cases for Android applications, several
areas remain underdeveloped and offer promising directions for future work.
Interaction Validity and Assertion Coverage
A critical limitation in the current pipeline is the absence of sufficient defensive
assertions. The framework frequently attempts invalid interactions without verify-
ing the existence or visibility of the targeted UI elements. Future implementations
should incorporate preconditions—such as element existence, interactivity, and vis-
ibility checks—before executing any interaction. This would significantly reduce
error propagation and improve overall test reliability.
Enhancing Screen Understanding Across Modules
The quality of the interaction, observation, and planning modules is directly tied
to their understanding of the current screen context. Misinterpretations at any
of these stages often lead to invalid actions, incorrect assertions, or flawed intent
planning. Improving prompt engineering and enabling deeper correlation between
XML structure and visual features may help the LLM generate more context-aware
responses. Structured prompt templates, visual–structural alignment, and training
on UI-specific reasoning tasks could serve as mechanisms to boost understanding.
Robust State Detection and Tracking
One of the most persistent challenges encountered during this project was reliable
state detection. Determining whether a newly visited screen is truly distinct or a
slight variation of a known state remains an unresolved issue. The current memory-
based screen comparison methods are prone to both false positives and false neg-
atives, resulting in either under-exploration or graph explosion during crawling.
Developing a hybrid state detection mechanism that combines SSIM with more
advanced semantic embeddings or learning-based screen representations may yield
more reliable state equivalence judgments.
Revisiting Crawling Strategies
Although we transitioned away from full crawling in favor of intent-driven limited
exploration, this decision was motivated by limitations in state detection, not by
inherent flaws in crawling. With improved state tracking and de-duplication, it may
be possible to revisit crawling-based approaches to achieve broader coverage. This
could be particularly beneficial for applications with deeply nested menus or non-
linear navigation paths, where full exploration remains useful for building interaction
graphs or discovering unreachable states.
90
7. Conclusions and Future Work
Tool Design and Protocol Abstraction
A promising direction for future development is the standardization of tool interfaces
through a structured Model Context Protocol (MCP). MCP1 is an abstraction protocol
that defines how tools can be represented as structured function calls with typed
inputs and outputs, allowing them to be invoked by a language model in a controlled
and deterministic manner. Each tool (e.g., seekbar setter, toggle switcher, element
selector) is described through a schema that specifies the function name, input
arguments, output structure, and side effects.
In this framework, MCP would allow tools to be defined independently of the model
and executed remotely. The LLM would be provided with the available tool defini-
tions, and instead of generating raw code or free-form descriptions, it would select
and invoke tools by emitting structured requests conforming to the MCP schema.
This provides a standardized, inspectable interface between planning modules and
execution backends.
Integrating MCP into the framework would allow for better modularization, remote
execution on test rigs, and safer interaction handling. It would also enable systematic
logging, testing, and extension of tool capabilities without altering the planner logic.
Backend Integration
The framework currently lacks a mechanism for interacting with backend APIs, even
though many application behaviors—such as mode switching, settings updates, or
state confirmations—are mediated by backend services. While we implemented a
packet sniffer to observe and correlate UI actions with backend traffic, this pipeline
stops at passive analysis. Future work should focus on designing a module that
synthesizes API calls based on observed traffic and partial documentation (e.g., in-
complete Swagger files or undocumented endpoints). Such a module could use LLMs
to infer the appropriate request structure and generate valid payloads. These back-
end interactions could serve two purposes: (1) setting up complex application modes
by issuing API calls directly, and (2) validating frontend operations by querying the
backend to ensure consistency.
In particular, these assertions would allow the framework to verify that user-facing
changes are correctly reflected in the vehicle’s internal systems. Many automotive
applications communicate with ECUs2 over the backend. Confirming that a UI-
based mode change is reflected in the corresponding ECU would significantly increase
test reliability and enable better end-to-end validation.
1For more information on MCP, see https://modelcontextprotocol.io/, https://www.
anthropic.com/news/model-context-protocol
2ECU (Electronic Control Unit) refers to embedded systems in vehicles that control specific
hardware functions, such as engine control, brakes, or transmission.
91
7. Conclusions and Future Work
92
Bibliography
[1] Saaket Agashe et al. Agent S: An Open Agentic Framework that Uses Comput-
ers Like a Human. 2024. arXiv: 2410.08164 [cs.AI]. url: https://arxiv.
org/abs/2410.08164.
[2] Domenico Amalfitano et al. “MobiGUITAR: Automated Model-Based Testing
of Mobile Apps”. In: IEEE Softw. 32.5 (Sept. 2015), pp. 53–59. issn: 0740-7459.
doi: 10.1109/MS.2014.55. url: https://doi.org/10.1109/MS.2014.55.
[3] Domenico Amalfitano et al. “Using GUI ripping for automated testing of An-
droid applications”. In: Proceedings of the 27th IEEE/ACM International Con-
ference on Automated Software Engineering. 2012, pp. 258–261.
[4] Android Developers. Espresso Idling Resources. https://developer.android.
com/training/testing/espresso/idling-resource. Official guidance on
synchronization in Espresso tests; accessed 2025-04-29.
[5] Android Developers. Espresso Testing Framework. https : / / developer .
android.com/training/testing/espresso. Accessed: 2025-04-29.
[6] Android Developers. Layouts. https://developer.android.com/guide/
topics/ui/declaring-layout. Accessed: 2025-04-29.
[7] Android Developers. UI Automator. https://developer.android.com/
training/testing/ui-automator. Accessed: 2025-04-29.
[8] Anthropic. Building Effective AI Agents. https://www.anthropic.com/
engineering/building-effective-agents. Accessed: 2025-05-22. 2023.
[9] Gilles Baechler et al. ScreenAI: A Vision-Language Model for UI and In-
fographics Understanding. 2024. arXiv: 2402.04615 [cs.CV]. url: https:
//arxiv.org/abs/2402.04615.
[10] Young-Min Baek and Doo-Hwan Bae. “Automated Model-Based Android GUI
Testing Using Multi-Level GUI Comparison Criteria”. In: Proceedings of the
31st IEEE/ACM International Conference on Automated Software Engineer-
ing. 2016, pp. 238–249. doi: 10.1145/2970276.2970313.
[11] Jinze Bai et al. Qwen-VL: A Versatile Vision-Language Model for Under-
standing, Localization, Text Reading, and Beyond. 2023. arXiv: 2308.12966
[cs.CV]. url: https://arxiv.org/abs/2308.12966.
[12] Anne Chao. “Nonparametric estimation of the number of classes in a popula-
tion”. In: Scandinavian Journal of Statistics 11.4 (1984), pp. 265–270.
93
Bibliography
[13] Shauvik Roy Choudhary, Alessandra Gorla, and Alessandro Orso. “Automated
test input generation for Android: Are we there yet?” In: Proceedings of the
2015 30th IEEE/ACM International Conference on Automated Software En-
gineering (ASE) (2015), pp. 429–440.
[14] Browser-Use Developers. Browser-Use: The AI Browser Agent. https : / /
browser-use.com. Accessed: 2025-05-25. 2025.
[15] Sergio Di Martino et al. “GUI Testing of Android Applications: Investigating
the Impact of the Number of Testers on Different Exploratory Testing Strate-
gies”. In: Journal of Software: Evolution and Process 36.7 (2024), e2640. doi:
10.1002/smr.2640.
[16] Sidong Feng et al. “Enabling Cost-Effective UI Automation Testing with Retrieval-
Based LLMs: A Case Study inWeChat”. In: Proceedings of the 39th IEEE/ACM
International Conference on Automated Software Engineering. ASE ’24. Sacra-
mento, CA, USA: Association for Computing Machinery, 2024, pp. 1973–1978.
isbn: 9798400712487. doi: 10.1145/3691620.3695260. url: https://doi.
org/10.1145/3691620.3695260.
[17] Andrew Gelman et al. Bayesian Data Analysis. 3rd ed. CRC Press, 2014.
[18] I. J. Good. “The population frequencies of species and the estimation of pop-
ulation parameters”. In: Biometrika 40.3–4 (1953), pp. 237–264.
[19] Alan R. Hevner et al. “Design science in information systems research”. In:
MIS Q. 28.1 (Mar. 2004), pp. 75–105. issn: 0276-7783.
[20] Zhiyuan Huang et al. SpiritSight Agent: Advanced GUI Agent with One Look.
2025. arXiv: 2503.03196 [cs.CV]. url: https://arxiv.org/abs/2503.
03196.
[21] Zheng Hui et al. “WinClick: GUI Grounding with Multimodal Large Language
Models”. In: arXiv preprint (2025). arXiv: 2503.04730.
[22] ISO/IEC/IEEE. ISO/IEC/IEEE 24765:2017 – Systems and Software Engi-
neering Vocabulary. https://www.iso.org/standard/71952.html. Interna-
tional Standard. 2017.
[23] Richard Kissel. Glossary of Key Information Security Terms. NIST IR 7298
Revision 3. National Institute of Standards and Technology. 2019.
[24] Yuanchun Li et al. “DroidBot: a lightweight UI-Guided test input generator
for android”. In: 2017 IEEE/ACM 39th International Conference on Software
Engineering Companion (ICSE-C). 2017, pp. 23–26. doi: 10.1109/ICSE-
C.2017.8.
[25] Dingning Liu et al. 3DAxisPrompt: Promoting the 3D Grounding and Reason-
ing in GPT-4o. 2025. arXiv: 2503.13185 [cs.CV]. url: https://arxiv.
org/abs/2503.13185.
[26] Yu Liu et al. “Are LLMs good at structured outputs? A benchmark for eval-
uating structured output capabilities in LLMs”. In: Information Processing
Management 61.5 (2024), p. 103809. issn: 0306-4573. doi: https://doi.
org/10.1016/j.ipm.2024.103809. url: https://www.sciencedirect.
com/science/article/pii/S0306457324001687.
94
Bibliography
[27] Fanbin Lu et al. ARPO:End-to-End Policy Optimization for GUI Agents with
Experience Replay. 2025. arXiv: 2505.16282 [cs.CV]. url: https://arxiv.
org/abs/2505.16282.
[28] Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. CoCo-Agent: A Comprehensive
Cognitive MLLM Agent for Smartphone GUI Automation. 2024. arXiv: 2402.
11941 [cs.CL]. url: https://arxiv.org/abs/2402.11941.
[29] Aravind Machiry, Rohan Tahiliani, and Mayur Naik. “Dynodroid: An input
generation system for Android apps”. In: Proceedings of the 2013 9th Joint
Meeting on Foundations of Software Engineering. ACM. 2013, pp. 224–234.
[30] OpenAI. Introducing Operator. https://openai.com/index/introducing-
operator/. Accessed: 2025-05-25. 2025.
[31] Samad Paydar, Mahdi Houshmand, and Elham Hayeri. “Experimental study
on the importance and effectiveness of monkey testing for android applica-
tions”. In: 2017 International Symposium on Computer Science and Software
Engineering Conference (CSSE). 2017, pp. 73–79. doi: 10.1109/CSICSSE.
2017.8364659.
[32] Ken Peffers et al. “A Design Science Research Methodology for Information
Systems Research”. In: J. Manage. Inf. Syst. 24.3 (Dec. 2007), pp. 45–77. issn:
0742-1222. doi: 10.2753/MIS0742-1222240302. url: https://doi.org/10.
2753/MIS0742-1222240302.
[33] OpenATX Project. openatx/uiautomator2. https://github.com/openatx/
uiautomator2. Accessed: 2025-04-29. 2024.
[34] G. A. F. Seber. The Estimation of Animal Abundance and Related Parameters.
2nd ed. Macmillan, 1982.
[35] Ting Su et al. “Stoat: Automated framework for efficient exploration of An-
droid apps”. In: Proceedings of the 2017 ACM SIGSOFT International Sym-
posium on Software Testing and Analysis (ISSTA). ACM. 2017, pp. 66–76.
[36] Gregory Valiant and Paul Valiant. “Estimating the unseen: An n/ log n-sample
estimator for entropy and support size”. In: Proceedings of STOC 2011. ACM,
2011, pp. 685–694.
[37] Wenyu Wang et al. “Vet: identifying and avoiding UI exploration tarpits”. In:
Proceedings of the 29th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineering. ES-
EC/FSE ’21. ACM, Aug. 2021, pp. 83–94. doi: 10.1145/3468264.3468554.
url: http://dx.doi.org/10.1145/3468264.3468554.
[38] Junda Wu et al. Visual Prompting in Multimodal Large Language Models: A
Survey. 2024. arXiv: 2409.15310 [cs.LG]. url: https://arxiv.org/abs/
2409.15310.
[39] Qinzhuo Wu et al. MobileVLM: A Vision-Language Model for Better Intra-
and Inter-UI Understanding. 2024. arXiv: 2409.14818 [cs.CL]. url: https:
//arxiv.org/abs/2409.14818.
95
Bibliography
[40] Juyeon Yoon, Robert Feldt, and Shin Yoo. “Intent-Driven Mobile GUI Testing
with Autonomous Large Language Model Agents”. In: 2024 IEEE Conference
on Software Testing, Verification and Validation (ICST). 2024, pp. 129–139.
doi: 10.1109/ICST60714.2024.00020.
[41] Keen You et al. Ferret-UI: Grounded Mobile UI Understanding with Multi-
modal LLMs. 2024. arXiv: 2404.05719 [cs.CV]. url: https://arxiv.org/
abs/2404.05719.
96
A
Appendix
A.1 Estimating the number of distinct states
This section is exploratory and not part of the core contributions of the thesis.
Estimating the number of distinct UI states in a mobile application serves three key
purposes. First, it provides a way to estimate coverage by comparing the number of
observed states to the inferred total. This allows us to know how much of the app
has been explored. Second, it helps determine how little crawling effort might be
sufficient by indicating when additional exploration is unlikely to uncover many new
states. Third, the total number of reachable UI states can act as a rough complexity
metric for the app’s user experience and interface structure.
Given an Android app, let Stot denote the (finite) set of all reachable UI states,
each being the full view hierarchy produced by some interaction sequence. Because
an exhaustive crawl is infeasible, we explore ways to obtain a rough estimate of
|Stot| based on a crawl log containing n total state visits. The techniques below are
drawn from other domains such as Ecological Statistics and Information Theory,
and should be seen as exploratory rather than definitive.
A.1.1 Notation
Stot Total number of distinct reachable UI states.
Sobs Number of distinct states observed during the crawl.
fk Number of states seen exactly k times.
f1 States seen once (singletons).
f2 States seen twice (doubletons).
n Total state visits (including repeats): = ∑n k≥1 kfk.
bi Interactive widgets unvisited on the first encounter of state si.
I
A. Appendix
A.1.2 Chao1 Lower Bound Estimator
Chao’s nonparametric estimator [12] may serve as a rough lower bound on the
number of distinct items in a population: f 212 , f2 > 0,f
Ŝ 2Chao1 = Sobs + (A.1)f1(f1 − 1)2( + 1) , f2 = 0.f2
The estimator assumes random sampling and a closed population. Since these as-
sumptions do not hold in deterministic app crawling, any result should be treated
as a rough lower bound, not a precise measure.
A.1.3 Lincoln-Petersen Capture-Recapture
The Lincoln-Petersen estimator [34], originally used in ecological capture-recapture
studies, could provide an alternate way to estimate total cardinality. If two inde-
pendent crawls are run with comparable effort, the overlap O might reflect how
saturated the exploration was:
ŜLP =
L1L2
, (A.2)
O
where L1 and L2 are the numbers of unique states found in each crawl. This method
assumes independent sampling. In practice, deterministic traversal strategies likely
break this assumption, so the resulting estimate may be unstable or misleading.
A.1.4 Good-Turing Missing Mass Adjustment
Good and Turing [18] proposed that the probability of encountering an unseen item
on the next draw is roughly f1/n. Applying this idea, one might adjust the Chao1
estimate to reflect that unseen states still c(arry ma)ss:
= 1 + f1ŜGT ŜChao1 . (A.3)
n
This adjustment is only a rough approximation, and like Chao1, it inherits strong
assumptions about the underlying sampling process.
A.1.5 Branching-Aware Bayesian Augmentation
To account for unvisited interactive elements that may lead to undiscovered states,
we consider a simple Bayesian augmentation. We assume the number of such hypo-
thetical children ui follows a Poisson distribution with rate λ, and place a Gamma
prior on λ [17]. This is a common approach for count modeling, though its suitability
here is uncertain:
α + b
E[ iui | bi] =
β + 1 . (A.4)
II
A. Appendix
With basic hyperparameters α = β = 1, this yields a soft adjustment:
S∑obs
ŜBayes = ŜGT + E[ui | bi]. (A.5)
i=1
This formulation is speculative. It implicitly assumes that unexplored widgets are
equally likely to lead to new states, which may not be true in practice.
A.1.6 Combined Point Estimate and Bounds
To bound the estimate conservatively, we take:
Ŝtot = min(ŜBayes, ŜLP), (A.6)
and define the upper bound as max(ŜBayes, ŜLP). This pairing is entirely heuristic
and not based on formal statistical justification. It is intended to provide a plausible
range, not an accurate confidence interval.
A.1.7 Information-Theoretic Sample Complexity
Valiant and Valiant [36] showed that any estimator attempting constant relative
error must observe:
S
≳ totn log (A.7)Stot
samples. Given that app crawls typically gather only n ≤ 100 visits, this theoretical
requirement is rarely met. As such, we do not expect the estimates here to con-
verge or generalize reliably. These bounds serve more as qualitative indicators than
quantitative estimations.
A.1.8 Practical Limitations
1. Traversal bias; Strategies like DFS or BFS produce non-random samples,
which can distort all statistical assumptions.
2. Sparse data; Many real crawls yield f2 = 0, which can destabilize Chao1.
3. Skewed access; Some UI states are deeply gated or rare, violating equal
discovery probability assumptions.
4. Widget overcount; Many unexplored widgets do not actually lead to new
states, which can inflate bi.
5. Deterministic overlap; Capture-recapture fails if the two crawlers are too
similar or too divergent.
A.2 Future Validation Strategy
The estimation techniques outlined in this section remain speculative without em-
pirical validation. Due to time constraints, we were unable to perform a systematic
III
A. Appendix
evaluation of these methods against known ground truths. However, we propose a
feasible approach for future work to assess the reliability and utility of these esti-
mators.
A promising strategy is to select one or two relatively simple Android applications
whose state spaces can be manually enumerated with reasonable effort. For these
apps, we could construct a high-confidence estimate—or in some cases, an exact
count—of the number of distinct UI states by exhaustively exploring all reachable
screens and transitions. This would provide a baseline ground truth against which
estimated values can be compared.
Each of the proposed estimation methods (Chao1, Lincoln-Petersen, Good-Turing
adjustment, and Bayesian augmentation) could then be applied to the crawl logs of
these applications. Their outputs can be assessed based on proximity to the man-
ually obtained reference count, allowing us to evaluate the relative accuracy and
robustness of the estimators. While perfect accuracy is not expected, estimators
yielding results in the correct order of magnitude or exhibiting consistent under-
/overestimation behavior may still prove useful in practice.
Additionally, this methodology can be extended to a broader set of applications with
differing characteristics:
• Applications with near-infinite state spaces due to dynamic content loading or
unbounded user input.
• Applications known to have very large but finite state spaces (e.g., on the
order of thousands), where the exact count is unknown but approximate scale
is understood.
In these scenarios, while ground truth is unavailable, qualitative validation can still
be performed by comparing the estimators’ outputs to known behavioral patterns
of the app (e.g., size, structure, interaction density). Divergence between estimation
outputs across applications may also indicate sensitivity to app complexity, which
can inform method selection in future work.
To collect the necessary input data for these estimators, we can employ the existing
crawler described in Section 4.6. This crawler captures a sequence of UI states
during exploration by hashing screen content, allowing a rough identification of state
uniqueness. Alongside hash-based de-duplication, it records metadata such as the
number and types of visible widgets, the count and class of interactive elements, and
structural layout features. This information can be used to instantiate the variables
required by the estimation formulas (f1, f2, n, bi) and provide a useful dataset for
empirical evaluation.
IV
A. Appendix
(a) Main Page of Alarm
(b) Second Page of Alarm
(c) Edit Alarm
Figure A.1: Alarm Clock App Screenshots
V
A. Appendix
(a) System Main Menu (b) Audio Settings Page
(c) Distance Unit Toggle (d) Volume Control Seekbar
Figure A.2: System Settings App Screenshots
VI
A. Appendix
(a) Main Page
(b) Trailer View 1
(c) Trailer View 2
VII
(d) Calibration Page (e) Settings Page
Figure A.3: Load Indicator App Screenshots